| Programming Competition | page 1 2
|
|---|
|
|---|
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 22 May 2011 17:52
| Hi, I would like to propose a little programming competition. This competition is targeted for 68060 and 68050 CPUs. Who can write the fastest memclear routine? Who can write the fastest memcopy routine? And who can write the fastest routine which sums up a memblock? I'm curious to see what clever ideas you have. Cheers
| |
Istvan Szekeres Hungary
| | Posts 60 22 May 2011 18:07
| Competition for 68060? Not to early this thing? I'm coder, but I use only 68000 code... I'think the 060 is not popular CPU yet. Or no?
| |
André Jernung Sweden
| | (MX-Board Owner) Posts 988 22 May 2011 18:40
| Istvan Szekeres wrote:
| Competition for 68060? Not to early this thing? I'm coder, but I use only 68000 code... I'think the 060 is not popular CPU yet. Or no?
|
The 68060 has been available for 17 years now and is one of the most popular 68k-based architectures when it comes to competitive optimized coding like f.e. in the Amiga or Atari demoscene.
| |
Team Chaos Leader USA
| | (Moderator) Posts 2094 22 May 2011 20:02
| Gunnar von Boehn wrote:
| Who can write the fastest memcopy routine?
|
Matt Hey FTW!
| |
Team Chaos Leader USA
| | (Moderator) Posts 2094 22 May 2011 20:05
| Szia :)Istvan Szekeres wrote:
| Competition for 68060?
|
68060 is Awesome!
I'm coder, but I use only 68000 code...
|
There are rumors that 68060 can run 68000 code ;)
| |
Wojtek P Poland
| | Posts 1597 22 May 2011 20:07
| Gunnar von Boehn wrote:
| Hi, I would like to propose a little programming competition. This competition is targeted for 68060 and 68050 CPUs. Who can write the fastest memclear routine? Who can write the fastest memcopy routine?
|
Please tell my why you are so biased on very special and not really common things of primitive memcopy?This is not a measure of CPU performance. If you think that fast-fastRAM memcopy is so important implement it in hardware - small FPGA on 060 card and then in main FPGA for 050. It would work at DDR2 DRAM speed
| |
Jakob Eriksson Sweden
| | (Moderator) Posts 1097 22 May 2011 22:49
| Memcopy is primitive yes, but shuffling stuff around in memory is a very neat thing to do in many programs.
| |
Istvan Szekeres Hungary
| | Posts 60 22 May 2011 23:05
| Team Chaos Leader wrote:
| Szia :) |
Szia neked is! Tudsz magyarul komolyan? Nem is hittem volna hogy találkozom itt ilyen emberekkel.
| |
Istvan Szekeres Hungary
| | Posts 60 22 May 2011 23:18
| André Jernung wrote:
| The 68060 has been available for 17 years now and is one of the most popular 68k-based architectures when it comes to competitive optimized coding like f.e. in the Amiga or Atari demoscene. |
Yes. What do you think how many active Amiga users use 060 cpu all over the world? What do you think how many 060 owner can programming in machine code? It was my original problem.
| |
Matt Hey USA
| | Posts 729 23 May 2011 04:23
| Gunnar von Boehn wrote:
| I would like to propose a little programming competition. This competition is targeted for 68060 and 68050 CPUs.
|
What size memory and dynamic or static size? What type of caching for source and destination and is the source data already in the cache? What wait states for source and destination memory? What alignment of source and destination? Gunnar von Boehn wrote:
| Who can write the fastest memclear routine?
|
Generally for 68060, I would expect a simple clr.l (a0)+ in a loop to do a nice job. Unrolling is not necessary. Perhaps clearing a cache line to 0 in the cache and using move16 (a0),(a1)+ in a loop would be fastest for very large clearing. It's probably not practical on the Amiga because of small sizes to be cleared though. Gunnar von Boehn wrote:
| Who can write the fastest memcopy routine?
|
Generally for 68060, move.l (a0)+,(a1)+ in a loop is fastest for small to medium sized copies. Unrolling does not improve performance (beyond 2 anyway). move16 (a0)+,(a1)+ is fastest for large copies. Unrolling improves performance for move16 but a larger and larger memory copy is needed to see the gain. Very few copies are large enough on the classic Amiga to make move16 worthwhile. We have looked at memory copy routines before. Memory copying is not difficult on the 68060. What more is there to explore? Gunnar von Boehn wrote:
| And who can write the fastest routine which sums up a memblock?
|
Why sum? Do you mean checksum (which kind)? Wojtek P wrote:
| If you think that fast-fastRAM memcopy is so important implement it in hardware - small FPGA on 060 card and then in main FPGA for 050.
|
Chip to fast memory copies may need to be relatively fast if DMA only works in chip memory. A coprocessor memory copy would have the lowest setup overhead and be more efficient for smaller copies. move16 already acts as a coprocessor but is basic (and simple).
| |
Chris Dennett United Kingdom
| | Posts 135 23 May 2011 04:35
| The post above you needs fixing :D It only has two [/_quote] tags when it needs 3 :)
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 23 May 2011 07:52
| Hello, I see a number of questions. Let me try to answer them. Istvan Szekeres wrote:
| I'think the 060 is not popular CPU yet. Or no?
|
Well the times of the 68000, 68020, 68030 and 68040 are over. The NATAMI comes with 68060 CPU and 68050 CPU. Both cores are much better than those 68000-68040 CPUs. I think it makes sense now to look forward and to write programs that take full advantages of the 68060 or 68050. Wojtek P wrote:
| Please tell my why you are so biased on very special and not really common things of primitive memcopy?
|
You are right, memcopy is not CPU performance but system performance. A memory interface has 3 main attributes. 1) Read performance 2) Write performance 3) Copy performance A CPU has other performance attributes. For a CPU the most important attributes are: 1) Arithmetic performance 2) Control flow performance (IF/THEN/ELSE) 3) Subroutine call performance 4) Floating point performance Wojtek P wrote:
| If you think that fast-fastRAM memcopy is so important implement it in hardware.
|
All software will be in the end of the day be limited by one of the above 7 system attributes. If your attributes are generally high then all the software will benefit from this. Example: The 68060 CPU card of the NATAMI has the fastest memory around. The SRAM on the NATAMI CPU card is real fast. Therefore the 68060 has no problem even beating 600 MHz AMIGA PowerPC systems in memory performance. If you use the SRAM in the 2nd level cache mode - all the available software will benefit from this. All software old and new will run swift. All programs that actually "DO" something will have to work with memory. The 3 main memory use cases will always be: 1) Reading memory (and processing) 2) Storing some results (writing memory) 3) And copying buffer or data around. This is common to all software around. I think it makes good sense to review those 3 main types of memory usage together. If a clever coder knows a clever trick to increase READ performance - Then lets talk about this! Lets measure this and lets show how to use this. We can then all together learn from this and use it on our code. Tricks to improve read performance can be used in so many code routines - being them CRC checking, network stack routines, datatype routines. Having to read and process data is a common task. If there is a clever way to speed this up - then lets share this knowlegde. Matt Hey wrote:
| What size memory and dynamic or static size?
|
For the sake of the competition I would say: -Lets read and process 1MB of fast memory. -Lets write one 1MB of fast memory. -Lets copy 512KB from fastmemory to fastmemory and lets copy 512KB from fastmemory to chip memory. I would propose that we mimick the popular stream benchmark but tuned for integer and 68k. :-) Matt Hey wrote:
| What wait states for source and destination memory?
|
0 Waitstates. But if you know different tricks for different Waitstates - then please share them all. People might want to learn them all.Matt Hey wrote:
| What alignment of source and destination?
|
Well aligned to 16byte boundaries.Matt Hey wrote:
| Why sum? Do you mean checksum (which kind)?
|
Sum like: ADD.L (A0)+,D0 Just as example for reading and very simple processing. My main goal is to share tricks to improve read performance.I'm looking forward to see your
| |
Phil "meynaf" G. France
| | (Natami Team) Posts 393 23 May 2011 08:51
| This thread reminds me of c2p competitions i saw in a coding party long ago :) But the results might be deceiving ; fastest will probably not be much more than 5-10% better than "standard" code. Not worth doing, apart for the fun of it. That said, i don't know how my ol' 030 copymem behaves on 060 ;-)
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 23 May 2011 11:04
| Phil G. wrote:
| But the results might be deceiving ; fastest will probably not be much more than 5-10% better than "standard" code. |
Actually I would be VERY VERY happy with this result. This would show us, that all code will run very good. I can tell you a "war-time story" of a brand new high-end system. This high-end system reaches on paper a memory throughput of 50,000 MB/sec. Super value right? But the reality is different. Normal programs, C code, Glibc memcopy, Linux Kernel, all reach just about 300 MB/sec. I prefer an honest "lowend" system which reaches for all code e.g. 160MB/sec over a "paper tiger" which is advertised with 1GB/sec but in reality only reaches 120MB/sec.
| |
Wojtek P Poland
| | Posts 1597 23 May 2011 12:31
| Jakob Eriksson wrote:
| Memcopy is primitive yes, but shuffling stuff around in memory is a very neat thing to do in many programs.
|
If any program spends CONSIDERABLE time doing memcopy, with the exception of very very short blocks, it should be improved.
| |
Wojtek P Poland
| | Posts 1597 23 May 2011 12:46
| Gunnar von Boehn wrote:
| I prefer an honest "lowend" system which reaches for all code e.g. 160MB/sec over a "paper tiger" which is advertised with 1GB/sec but in reality only reaches 120MB/sec.
|
Me too, paper values are useless. But please concentrate on good software. good software does not need to regularly copy large blocks of RAM. but it often needs to do very small memcopy, often all fitting in L1 cache. For such case simplest code like move.l (A0)+,(A1)+ and dbra is the best. But some programs do a lot of non so complex computing on large blocks.For such cases instead of concentrating of slight improvement of 060 - which is temporary thing for natami, concentrate of automatic prefetch for 050. Other possible solution is to add opcode PREFETCH options,register where options are: 00) standard operation 01) read prefetch, cache 10) read prefetch, cache only current and one future cacheline 11) write buffering, flush cacheline immediately when address register points to another one This mode should be automatically reset to default on any write to that register other than automatic post/pre de/increment. This improves both memcopy, and linear performance. For example - if you want to process huge amount of data linearly, or small amount but you are sure it will not be needed again soon, then PREFETCH 10,A0 then process data using say (A0)+ and at the end just reload A0 with something else or do PREFETCH 00,A0 When processor hits PREFETCH instruction it should immediately prefetch cache line pointed by A0 at first instruction which uses pre/post de/increment addressing it should prefetch next or previous line on read. when A0 will point to other cacheline, immediately start memory read replacing the last processed cache line. Similar behaviour for 11. Give programmer a chance he/she know well while writing program while CPU have no chance to know. This way you get ZERO latency for linear processing with up to 8 streams of data processed (read or written)
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 23 May 2011 15:14
| Wojtek P wrote:
| Other possible solution is to add opcode PREFETCH options,register
|
And this is absolutely the wrong concept as its needs "manual programming" to be effective. This means normal code will run bad. Besides you don't need this on 68K. If you want to fetch a cacheline then just use it e.g TST (EA)
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 23 May 2011 19:18
| Move16 is the fastest, but when using normal loops it's probobly best to do a tst <EA> prefetching:I found this on ada.untergrund.net: - When writing to uncached memory, it pays off to do a read of each cache line some time before the write, in order to get the 8 cycle stall on the read rather than the 16 cycle stall on the write. Blueberry wrote:
| #6 - Posted: 10 Mar 2011 16:33 ... Move16 from uncached memory takes 38 cycles. The first cycle can overlap with the last cycle of a floating point arithmetic instruction, but apart from that, it doesn't seem to permit overlap with anything. Move16 from cached memory takes 22 cycles. The last two cycles can overlap with anything that does not access memory or the cache. Some further test results, now we are at it: A read from an uncached cache line which does not cause a dirty cache line to be evicted causes a stall of 8 cycles before the instruction. During the following 11 cycles, no instruction (apart from the original read) can access memory or the cache (not even the same cache line). A write to an uncached cache line which causes a dirty cache line to be evicted causes a stall of 16 cycles before the instruction. During the following 4 cycles, no instruction (apart from the original write) can access memory or the cache (not even the same cache line). During the next 11 cycles, instructions can access the cache, but a cache miss causes a stall until the push buffer (writing the dirty cache line) is empty. Thus, some lower bounds (when not using move16): Reading a large piece of memory: 19 cycles per cache line Writing a large piece of memory: 31 cycles per cache line Reading and writing the same large piece of memory: 32 cycles per cache line Reading and writing different large pieces of memory: 50 cycles per cache line Some conclusions: - Fast memory is not so fast after all (though still many times faster than chip). - There is a lot of potential for combining memory accesses with computations, as long as the computations do not need to access the cache. - When writing to uncached memory, it pays off to do a read of each cache line some time before the write, in order to get the 8 cycle stall on the read rather than the 16 cycle stall on the write. |
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 23 May 2011 19:37
| Here is an untested copyloop that use the FPU Assume A0 and a1 points to two buffers alligned to 16 fmove.d (a0)+,fp0 fmove.d fp0,(a1)+ ;allign memory by 8 bra.b .inn .loop fmove.d fp0,(a1)+ ;cached fmove.d fp1.(a1)+ ;cached fmove.d fp2,(a1)+ ;cached fmove.d fp3.(a1)+ ;cached .inn tst.w 23(a1) ;fetch 2 cachelines (write) (((a1+8) mod 32)-1)) subq.l #1,d7 ;free fmove.d (a0)+,fp0 ;cached fmove.d (a0)+,fp1 ;fetch cacheline fmove.d (a0)+,fp2 ;cached fmove.d (a0)+,fp3 ;fetch cacheline bne.b .loop ;free
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 23 May 2011 20:53
| Here is the untested integer version: move.l (a0)+,(a1)+ ;allign (a0 and a1 mod 16 + 4) bra.b .inn -loop move.l d0,(a1)+ ;cached move.l d1.(a1)+ ;cached move.l d2.(a1)+ ;cached move.l a3.(a1)+ ;cached .inn tst.w 12(a1) ;fetch next cacheline move.l (a0)+,d0 ;cached move.l (a0)+,d1 ;cached move.l (a0)+,d2 ;cached subq.l #1,d7 ;free move.l (a0)+,a3 ;fetch cacheline bne.b .loop ;free
| |
|