 |
Welcome to the Natami / Amiga ForumThis forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.
|
Do you have ideas and feature wishes? Post them here and discuss your ideas. |
|
|---|
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 17 Mar 2011 09:35
| S P wrote:
| My algorithm solves the same problem.
|
The CRC algortihm is defined. To be able to check CRC on Ethernet packets or CRC on ZIPS you need to implement a compatible algorithms. S P wrote:
| To optimize the slow CRC: Instead of using a lookup on each byte, you can expand the tablesize to lookup two bytes in one lookuptable 2^16*4 = 256KB. With a 256KB table you can halve the number of cycles needed, at a cost of more cachemisses. (256KB is still small enough to fit in a cache)
|
No 256KB is too big to cache. This "optimization" will slow the whole down a lot.
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 17 Mar 2011 09:45
| Gunnar von Boehn wrote:
| No 256KB is too big to cache. This "optimization" will slow the whole down a lot.
|
Depends on the CPU and memory speed. On Mc68000-Mc68030 it would probobly be a good Idea. How many cycles is a cacheline read on the Natami?
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 17 Mar 2011 09:57
| S P wrote:
| Gunnar von Boehn wrote:
| No 256KB is too big to cache. This "optimization" will slow the whole down a lot. |
Depends on the CPU and memory speed. |
The calculation is very simple. For a lookup to work good it should NOT be bigger than a single way of the Cache. Even the 2nd level cache of 512KB of many PowerPC Cores is too small for such a big lookup. Also warming up your the cache with the big lookup will take very very long. For the normal sceneario that you want to check the CRC on an Ethernet package or on a compressed file. The 256 entry lookup looks quite optimal too me. S P wrote:
| On Mc68000-Mc68030 it would probobly be a good Idea. |
68030 and previous 68Ks are slow and somewhat "stone age". We should advocate people how to code properly for new systems. S P wrote:
| How many cycles is a cacheline read on the Natami? |
The read is done in 1 cycle. And it has a latency of about 13 cycles. Its planned that for the 68070 a cache prefetch can be done in parallel to the normal cache usage. This means you should be able to do two independant cache reads and one cache write in parallel to one line prefetch per CPU cycle.I do not think that the 256 Entry Lookup Table is the bootleneck. The 68070 would technically allow you to do 2 reads per cycle ...
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 17 Mar 2011 10:35
| With a bigger table the code will be something like this: .nextword eor.w d0,d2 ;1 (first pipe) move.l d3,d0 ;1 (second pipe) swap d0 ;1 (second fused) ;2 cycle bubble move.l (a1,d2.w*4),d3 ;4 (first pipe) eor.l d0,d3 ;4 (fused) move.w (a0)+,d2 ;4 (second pipe) bne.b .nextword We save 4 cycles per word, but loose latancy in cachemisses. Cachesize: With a 64kb level1 cache, The average latency will be around 4cycles per word... But with cache prefetching this will be reduced..
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 17 Mar 2011 11:01
| The slow CRC can probobly be rewritten to: .loop eor.l d3,d0 move.l (a1,d2.w),d3 move.b (a0)+,d2 bne.b .loop With A1 pointing to a 1024kb table.. But than you need to change the table precalculation.This can be rewritten again remove the ALU stall.. .loop4 move.b (a0)+,d1 move.b (a0)+,d2 move.b (a0)+,d3 move.b (a0)+,d4 eor.l (a1,d1.w),d0 eor.l (a1,d2.w),d0 eor.l (a1,d3.w),d0 eor.l (a1,d4.w),d0 subq.l #1,d7 bne.b .loop4
| |
Wojtek P Poland
| | Posts 1597 17 Mar 2011 11:36
| Gunnar von Boehn wrote:
| The read is done in 1 cycle. And it has a latency of about 13 cycles.
|
What you mean exactly. That: a) cache hit have 1 cycle time and 1 cycle latency or b) cache hit have 1 cycle time and 13 cycle latency (terrible!) c) cache MISS have 13 cycle latency+memory latency. d) cache miss have 13 cycle average latency including memory
| |
Megol .
| | Posts 671 17 Mar 2011 14:48
| Found a interesting algorithm for CRC calulation here: EXTERNAL LINK It is not directly comparable with the byte-munching variant earlier in this thread as it consumes the input in 64 bit chunks/loop. Does anybody see errors or optimizations?Timing assuming 68050+no cache misses, move x, reg; op y, reg considered fused in all cases. ---- .next64 move.l (a0)+, d0 ; 1 fused eor.l d1, d0 ; 1 move.l d0, d1 ; 2 fused and.l #$ff, d1 ; 2 move.l d0, d2 ; 3 fused lsr.l #8, d2 ; 3 and.l #$ff, d2 ; 4 move.l tab88(d1*4), d1 ; 5 no stall lsr.l #16, d0 ; 6 move.l d0, d3 ; 7 fused and.l #$ff, d3 ; 7 lsr.l #8, d0 ; 8 and.l #$ff, d0 ; 9 move.l tab80(d2*4), d2 ; 10 no stall/fused eor.l d1, d2 ; 10 move.l tab72(d3*4), d1 ; 11 no stall/fused eor.l d2, d1 ; 11 move.l tab64(d0*4), d2 ; 12 no stall/fused eor.l d1, d2 ; 12 move.l (a0)+, d0 ; 13 move.l d0, d1 ; 14 fused and.l #$ff, d1 ; 14 lsr.l #8, d0 ; 15 move.l d0, d3 ; 16 and.l #$ff, d3 ; 16 lsr.l #8, d0 ; 17 move.l d0, d6 ; 18 fused and.l #$ff, d6 ; 18 lsr.l #8, d0 ; 19 (no and required) move.l tab56(d1*4), d4 ; 20 no stall/fused eor.l d2, d4 ; 20 move.l tab48(d3*4), d5 ; 21 no stall/fused eor.l d4, d5 ; 21 move.l tab40(d6*4), d4 ; 22 no stall/fused eor.l d5, d4 ; 22 move.l tab32(d0*4), d4 ; 23 no stall/fused eor.l d4, d1 ; 23 dbra d7, .next64
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 17 Mar 2011 17:46
| Ok gunnar. Here is a compatible loop without bubbles (68050) ------------------ bra.b .enter .loop4 move.l (a1,d4.w*4),d0 ;17 eor.l d5,d0 ;17 .enter move.b (a0)+,d1 ;1 eor.b d0,d1 ;2 move.b (a0)+,d2 ;3 lsr.l #8,d0 ;4 move.l (a1,d1.w*4),d5 ;5 eor.l d0,d5 ;5 eor.b d5,d2 ;6 move.b (a0)+,d3 ;7 lsr.l #8,d5 ;8 move.l (a1,d2.w*4),d0 ;9 eor.l d5,d0 ;9 eor.b d0,d3 ;11 move.b (a0)+,d4 ;10 lsr.l #8,d0 ;12 move.l (a1,d3.w*4),d5 ;13 eor.l d0,d5 ;13 eor.b d5,d4 ;14 lsr.l #8,d5 ;15 subq.l #1,d7 ;16 bne.b .loop4
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 17 Mar 2011 17:59
| Megol . wrote:
| Found a interesting algorithm for CRC calulation here: EXTERNAL LINK |
... lsr.l #16, d0 should be swap d0 ... move.l tab72(d3*4), d1 should be converted to (a0,d3*4) and you need 8 free registers- dbra d7, .next64 this is 1-2 cycles too. You can change from longword to word shifts and remove and'slsr.w #8, d2 ; 3 and.l #$ff, d2 ; 4 (remove) same here: lsr.w #8, d0 ; 8 and.l #$ff, d0 ; 9 (remove) This will save 2 cycles. I think 2 more ands can be removed farther down in the code too..
| |
Megol .
| | Posts 671 17 Mar 2011 18:30
| S P wrote:
|
Megol . wrote:
| Found a interesting algorithm for CRC calulation here: EXTERNAL LINK |
... lsr.l #16, d0 should be swap d0 ... move.l tab72(d3*4), d1 should be converted to (a0,d3*4) and you need 8 free registers- dbra d7, .next64 this is 1-2 cycles too. You can change from longword to word shifts and remove and's lsr.w #8, d2 ; 3 and.l #$ff, d2 ; 4 (remove) same here: lsr.w #8, d0 ; 8 and.l #$ff, d0 ; 9 (remove) This will save 2 cycles. I think 2 more ands can be removed farther down in the code too..
|
Thank you! I had completely forgotten the shift immediate limitation :/ Now let's hope the code is correctly translated from the original C code ;)
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 17 Mar 2011 18:59
| There are more tricks you can do.. Take a look at this: I only included half of the loop. It seems to be around 20 or 21 cycles in total. (for 8 bytes) ------------------------------------------------------------ .loop32 move.l (a0)+, d0 ; 1 fused eor.l d4, d0 ; 1 move.l d0,d1 ; 2 fused and.l #$00ff00ff,d1 ; 2 move.l d0,d2 ; 3 fused and.l #$ff00ff00,d2 ; 3 lsr.l #8,d2 ; 4 move.l tab88(a0,d1.w*4),d3 ; 5 fused eor.l d4, d3 ; 5 swap d1 ; 6 move.l tab80(a1,d2.w*4),d4 ; 7 fused eor.l d3, d4 ; 7 swap d2 ; 8 move.l tab72(a2,d1.w*4),d3 ; 9 fused eor.l d4, d3 ; 9 move.l tab80(a3,d2.w*4),d4 ; 10 fused eor.l d3, d4 ; 10 (...) subq.l #1,d7 ; 11 bne.b .loop32
| |
Matt Hey USA
| | Posts 726 18 Mar 2011 03:19
| Megol . wrote:
| move.l d0, d1 ; 2 fused and.l #$ff, d1 ; 2
|
On N68k, these can also be... mvz.b d0,d1 It would only be 1 word instead of 3 on N68k. It's 4 words on 68k which is a pretty nice size optimization. Of course, SP's suggestion may make this unnecessary. S P wrote:
| move.l d0,d1 ; 2 fused and.l #$00ff00ff,d1 ; 2 move.l d0,d2 ; 3 fused and.l #$ff00ff00,d2 ; 3
|
Yea, that's a good trick. I've used it to average the color values of 2 consecutive 16 bit pixels for scaling (creation of mipmaps). It makes a huge difference in performance on the 68060 where using long values can be twice as fast.
| |
Megol .
| | Posts 671 18 Mar 2011 10:15
| S P wrote:
| There are more tricks you can do.. Take a look at this: I only included half of the loop. It seems to be around 20 or 21 cycles in total. (for 8 bytes) ------------------------------------------------------------ .loop32 move.l (a0)+, d0 ; 1 fused eor.l d4, d0 ; 1 move.l d0,d1 ; 2 fused and.l #$00ff00ff,d1 ; 2 move.l d0,d2 ; 3 fused and.l #$ff00ff00,d2 ; 3 lsr.l #8,d2 ; 4 move.l tab88(a0,d1.w*4),d3 ; 5 fused eor.l d4, d3 ; 5 swap d1 ; 6 move.l tab80(a1,d2.w*4),d4 ; 7 fused eor.l d3, d4 ; 7 swap d2 ; 8 move.l tab72(a2,d1.w*4),d3 ; 9 fused eor.l d4, d3 ; 9 move.l tab80(a3,d2.w*4),d4 ; 10 fused ! STALL, d2 changed less than 2 clocks ago eor.l d3, d4 ; 10 (...) subq.l #1,d7 ; 11 bne.b .loop32
|
I haven't had time to check your optimizations yet but they look promising! However there are at least one stall in your code, please look above.(My optimizations was just to try avoid address generation stalls and allow as much fused execution as possible)
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 18 Mar 2011 10:27
| Yeah, I was too quick there.. The new N68050 is very powerful. 19 CISC instructions executed in 11 cycles (Without superscalar mode) This loop is without bubbles: bra.b .enter .loop32 move.l (a4,d2.w*4),d4 ; 11 eor.l d3, d4 ; 11 .enter move.l (a0)+, d0 ; 1 fused eor.l d4, d0 ; 1 move.l d0,d1 ; 2 fused and.l #$00ff00ff,d1 ; 2 move.l d0,d2 ; 3 fused and.l #$ff00ff00,d2 ; 3 lsr.l #8,d2 ; 4 move.l (a1,d1.w*4),d3 ; 5 fused eor.l d4, d3 ; 5 swap d1 ; 6 move.l (a2,d2.w*4),d4 ; 7 fused eor.l d3, d4 ; 7 swap d2 ; 8 move.l (a3,d1.w*4),d3 ; 9 fused eor.l d4, d3 ; 9 subq.l #1,d7 ; 10 bne.b .loop32 move.l (a4,d2.w*4),d4 eor.l d3, d4
| |
Megol .
| | Posts 671 18 Mar 2011 15:57
| S P wrote:
| Yeah, I was too quick there.. The new N68050 is very powerful. 19 CISC instructions executed in 11 cycles (Without superscalar mode) This loop is without bubbles: bra.b .enter .loop32 move.l (a4,d2.w*4),d4 ; 11 eor.l d3, d4 ; 11 .enter move.l (a0)+, d0 ; 1 fused eor.l d4, d0 ; 1 move.l d0,d1 ; 2 fused and.l #$00ff00ff,d1 ; 2 move.l d0,d2 ; 3 fused and.l #$ff00ff00,d2 ; 3 lsr.l #8,d2 ; 4 move.l (a1,d1.w*4),d3 ; 5 fused eor.l d4, d3 ; 5 swap d1 ; 6 move.l (a2,d2.w*4),d4 ; 7 fused eor.l d3, d4 ; 7 swap d2 ; 8 move.l (a3,d1.w*4),d3 ; 9 fused eor.l d4, d3 ; 9 subq.l #1,d7 ; 10 bne.b .loop32 move.l (a4,d2.w*4),d4 eor.l d3, d4
|
Very nice code :) Yeah the 68050 looks like a processor for us that like asm coding!
| |
Megol .
| | Posts 671 18 Mar 2011 18:29
| For something completely different, here's one example how the extended addressing word could be used: Standard: |D|REG|W|SC|1|B|I|BD|0|IS|Proposed: |D|REG|W|SC|1|M|A|BD|1|S|Y| This would allow things like: add.l {d1,d4*16}, d2 ; d2=d2+{d1+d4*16} move.l (a0,-d0*4), d0 lea 12(d2,-d3*16), a0 D=Data/Address register selection REG=register W=Word/Long selection SC=Scale B=Base suppress I=Index suppress BD=Base displacement IS=Index/indirect selection M=Memory suppress (1=no cache access, passes the calculated data) Could be used for other things for the LEA instruction. A=Address/data selection (for base register) X=indeX negate (alternatively invert) S=extra scaling bit
| |
Matt Hey USA
| | Posts 726 19 Mar 2011 03:37
| Megol . wrote:
| Proposed: |D|REG|W|SC|1|M|A|BD|1|S|Y| |
Should the Y above be X for indeX negate? Y is not in your reference. Sometimes it is better to work backwards from a memory anchor but I can usually make the index register negative and leave it negative as I work with it. I can't see where I would use negate very often and it is more work for the EA unit to do in series which is potentially slower. I prefer an encoding more like the standard that would allow a larger scale factor, base register update, and direct modes... Proposed 2: |D|REG|W|SC|1|B|I|BD|S|IS| S=extra scaling bit I=0 IS=%100 Direct (addressing) mode with index I=1 IS=%100 Direct (addressing) mode no index I=1 IS=%101 Base register update * * The base register can't be suppressed for base register update. The index suppress bit (I) must be 1 for this mode which would normally indicate to always suppress the index register but now the base register suppress bit (B) indicates whether the index register is suppressed for this mode only. I don't think the direct mode would be very easy to use because of the change/use delay (although 2 cycle delay like 68060 would help tremendously and should be possible on N68070). This would mostly limit the use by a compiler except as an LEA for a data register. It would allow more optimization for a few hand optimized assembler routines. The best part of the 3 new addressing enhancements is that they are almost free to add (little slowdown). I think a new optional N68k syntax would make programming easier for beginners and converts from other processors. I suggest... 1) Lower case instructions Example: move.l a0,d0 Reason: easier to read, no more SHOUTING, and more modern look 2) Allow a '+' sign instead of ',' in addressing modes. Example: move.l (1234+a0+d0*4) Reason: more understandable, better representation 3) Allow "<<" and a shift size as well as '*' for index scale Example: move.l (1234+a0+d0<<7) Reason: <<7 might be easier than *128 for some people/applications The older representations would of course be allowed too but something along these lines could be used for official documentation. The new style Motorola 68k syntax that primarily changed 1234(a0) to (1234,a0) is a nice improvement.
| |
Team Chaos Leader USA
| | (Moderator) Posts 2094 19 Mar 2011 05:09
| Matt Hey wrote:
| 2) Allow a '+' sign instead of ',' in addressing modes. Example: move.l (1234+a0+d0*4) Reason: more understandable, better representation
|
GreatIdea*16!
| |
Cesare Di Mauro Italy
| | Posts 526 19 Mar 2011 08:17
| Matt Hey wrote:
|
Megol . wrote:
| Proposed: |D|REG|W|SC|1|M|A|BD|1|S|Y| |
Sometimes it is better to work backwards from a memory anchor but I can usually make the index register negative and leave it negative as I work with it. I can't see where I would use negate very often and it is more work for the EA unit to do in series which is potentially slower. |
I agree. It's quite uncommon.
I prefer an encoding more like the standard that would allow a larger scale factor, base register update, and direct modes... Proposed 2: |D|REG|W|SC|1|B|I|BD|S|IS| S=extra scaling bit I=0 IS=%100 Direct (addressing) mode with index I=1 IS=%100 Direct (addressing) mode no index I=1 IS=%101 Base register update * * The base register can't be suppressed for base register update. The index suppress bit (I) must be 1 for this mode which would normally indicate to always suppress the index register but now the base register suppress bit (B) indicates whether the index register is suppressed for this mode only. I don't think the direct mode would be very easy to use because of the change/use delay (although 2 cycle delay like 68060 would help tremendously and should be possible on N68070). This would mostly limit the use by a compiler except as an LEA for a data register. It would allow more optimization for a few hand optimized assembler routines. The best part of the 3 new addressing enhancements is that they are almost free to add (little slowdown). |
What does it mean "direct mode"?
I think a new optional N68k syntax would make programming easier for beginners and converts from other processors. I suggest... 1) Lower case instructions Example: move.l a0,d0 Reason: easier to read, no more SHOUTING, and more modern look |
I agree. Lower case is my preferred way to write code.
2) Allow a '+' sign instead of ',' in addressing modes. Example: move.l (1234+a0+d0*4) Reason: more understandable, better representation |
Not for a 68000-addicted. :DAnyway, for novel people you are right.
3) Allow "<<" and a shift size as well as '*' for index scale Example: move.l (1234+a0+d0<<7) Reason: <<7 might be easier than *128 for some people/applications |
I prefer multiplication, which is easier to read.For me: * 4 -> longword. << 2 -> ? Not so immediate to understand.
| The older representations would of course be allowed too but something along these lines could be used for official documentation. The new style Motorola 68k syntax that primarily changed 1234(a0) to (1234,a0) is a nice improvement. |
I agree.
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 19 Mar 2011 08:58
| I'm a little bit afraid of making the EA to complex. If we make the calcuation in the EA more complex, then the timing weill get worse and we will reduce our maximum clockrate. Regarding using DN as base: This is not good IMHO because of two reasons. We would then need another new DN Read-Port: This drives Chip-costs higher. We would also create a new hazard point. Instructions working with the DN register in the two previous cycle would create a bubble. The correct recognise these hazards we will need to add extra tacking logic. Also we can NOT update Dn Base registers as the EA does not have a write Port to the Dn Registers. These are a lot drawbacks... Regarding the extra Scale modes. Adding an extra Bit for the scale mode looks tempting from the software point of few. But it will double our MUX area needed to do the SHIFT in the EA-Unit. This looks bad from a HW designers point of view because its costly and could reduce our clockrate. I'll have to measure this to see how much room we have here before we looks performance...
| |
|
|
|
|