Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Do you have ideas and feature wishes? Post them here and discuss your ideas.

N68k Enhancements Revisitedpage  << 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 
Gunnar von Boehn
Germany
(Moderator)
Posts 5775
17 Mar 2011 09:35


S P wrote:

My algorithm solves the same problem.

The CRC algortihm is defined.
To be able to check CRC on Ethernet packets or CRC on ZIPS
you need to implement a compatible algorithms.

S P wrote:

To optimize the slow CRC:
 
Instead of using a lookup on each byte, you can expand the tablesize to lookup two bytes in one lookuptable 2^16*4 = 256KB.
 
With a 256KB table you can halve the number of cycles needed, at a cost of more cachemisses. (256KB is still small enough to fit in a cache)

No 256KB is too big to cache.
This "optimization" will slow the whole down a lot.


Rune Stensland
Norway
(MX-Board Owner)
Posts 871
17 Mar 2011 09:45


Gunnar von Boehn wrote:

  No 256KB is too big to cache.
  This "optimization" will slow the whole down a lot.

Depends on the CPU and memory speed.

On Mc68000-Mc68030 it would probobly be a good Idea.

How many cycles is a cacheline read on the Natami?

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
17 Mar 2011 09:57


S P wrote:

 
Gunnar von Boehn wrote:

    No 256KB is too big to cache.
    This "optimization" will slow the whole down a lot.
 

 
  Depends on the CPU and memory speed.
 

  The calculation is very simple.
  For a lookup to work good it should NOT be bigger than a single way of the Cache.
 
  Even the 2nd level cache of 512KB of many PowerPC Cores is too small for such a big lookup.
 
  Also warming up your the cache with the big lookup will take very very long.
 
  For the normal sceneario that you want to check the CRC on an Ethernet package or on a compressed file. The 256 entry lookup looks quite optimal too me.
 
 
 
S P wrote:

  On Mc68000-Mc68030 it would probobly be a good Idea.
 

  68030 and previous 68Ks are slow and somewhat "stone age".
  We should advocate people how to code properly for new systems.
 
 
 
S P wrote:

  How many cycles is a cacheline read on the Natami?
 

The read is done in 1 cycle.
And it has a latency of about 13 cycles.
 
Its planned that for the 68070 a cache prefetch can be done in parallel to the normal cache usage.
 
This means you should be able to do two independant cache reads and one cache write in parallel to one line prefetch per CPU cycle.

I do not think that the 256 Entry Lookup Table is the bootleneck.
The 68070 would technically allow you to do 2 reads per cycle ...


Rune Stensland
Norway
(MX-Board Owner)
Posts 871
17 Mar 2011 10:35


With a bigger table the code will be something like this:
 
  .nextword
eor.w  d0,d2          ;1 (first pipe)
move.l d3,d0          ;1 (second pipe)
swap d0                ;1 (second fused)
                        ;2 cycle bubble
move.l  (a1,d2.w*4),d3 ;4 (first pipe)
eor.l  d0,d3          ;4 (fused)
move.w  (a0)+,d2      ;4 (second pipe)
bne.b .nextword
 
  We save 4 cycles per word, but loose latancy in cachemisses.
 
  Cachesize: With a 64kb level1 cache, The average latency will be around 4cycles per word... But with cache prefetching this will be reduced..

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
17 Mar 2011 11:01


The slow CRC can probobly be rewritten to:
 
  .loop
  eor.l d3,d0
  move.l (a1,d2.w),d3
  move.b (a0)+,d2
  bne.b .loop
 
  With A1 pointing to a 1024kb table.. But than you need to change the table precalculation.

This can be rewritten again remove the ALU stall..

.loop4

move.b (a0)+,d1
move.b (a0)+,d2
move.b (a0)+,d3
move.b (a0)+,d4

eor.l (a1,d1.w),d0
eor.l (a1,d2.w),d0
eor.l (a1,d3.w),d0
eor.l (a1,d4.w),d0

subq.l #1,d7
bne.b .loop4


Wojtek P
Poland

Posts 1597
17 Mar 2011 11:36


Gunnar von Boehn wrote:

  The read is done in 1 cycle.
  And it has a latency of about 13 cycles.

What you mean exactly.
That:
a) cache hit have 1 cycle time and 1 cycle latency or
b) cache hit have 1 cycle time and 13 cycle latency (terrible!)
c) cache MISS have 13 cycle latency+memory latency.
d) cache miss have 13 cycle average latency including memory


Megol .

Posts 671
17 Mar 2011 14:48


Found a interesting algorithm for CRC calulation here: EXTERNAL LINK  It is not directly comparable with the byte-munching variant earlier in this thread as it consumes the input in 64 bit chunks/loop.
Does anybody see errors or optimizations?

Timing assuming 68050+no cache misses, move x, reg; op y, reg considered fused in all cases.
----
.next64
  move.l (a0)+, d0      ; 1 fused
  eor.l  d1, d0          ; 1 
  move.l d0, d1          ; 2 fused
  and.l  #$ff, d1        ; 2 
  move.l d0, d2          ; 3 fused
  lsr.l  #8, d2          ; 3
  and.l  #$ff, d2        ; 4
  move.l tab88(d1*4), d1 ; 5 no stall
  lsr.l #16, d0          ; 6
  move.l d0, d3          ; 7 fused
  and.l #$ff, d3        ; 7
  lsr.l #8, d0          ; 8
  and.l #$ff, d0        ; 9
  move.l tab80(d2*4), d2 ; 10 no stall/fused
  eor.l d1, d2          ; 10
  move.l tab72(d3*4), d1 ; 11 no stall/fused
  eor.l  d2, d1          ; 11
  move.l tab64(d0*4), d2 ; 12 no stall/fused
  eor.l  d1, d2          ; 12
  move.l (a0)+, d0      ; 13
  move.l d0, d1          ; 14 fused
  and.l  #$ff, d1        ; 14
  lsr.l #8, d0          ; 15
  move.l d0, d3          ; 16
  and.l #$ff, d3        ; 16
  lsr.l #8, d0          ; 17
  move.l d0, d6          ; 18 fused
  and.l #$ff, d6        ; 18
  lsr.l #8, d0          ; 19 (no and required)
  move.l tab56(d1*4), d4 ; 20 no stall/fused
  eor.l d2, d4          ; 20
  move.l tab48(d3*4), d5 ; 21 no stall/fused
  eor.l d4, d5          ; 21
  move.l tab40(d6*4), d4 ; 22 no stall/fused
  eor.l d5, d4          ; 22
  move.l tab32(d0*4), d4 ; 23 no stall/fused
  eor.l d4, d1          ; 23
  dbra d7, .next64

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
17 Mar 2011 17:46


Ok gunnar. Here is a compatible loop without bubbles (68050)
   
    ------------------
   

    bra.b .enter
.loop4
    move.l (a1,d4.w*4),d0 ;17
    eor.l  d5,d0  ;17
.enter
    move.b (a0)+,d1 ;1
    eor.b d0,d1  ;2
    move.b (a0)+,d2 ;3
    lsr.l  #8,d0  ;4
   
    move.l (a1,d1.w*4),d5 ;5
    eor.l  d0,d5    ;5
    eor.b d5,d2  ;6
   
    move.b (a0)+,d3 ;7
    lsr.l  #8,d5  ;8
   
    move.l (a1,d2.w*4),d0 ;9
    eor.l  d5,d0  ;9
   
    eor.b d0,d3  ;11
    move.b (a0)+,d4 ;10
    lsr.l  #8,d0  ;12
    move.l (a1,d3.w*4),d5 ;13
    eor.l  d0,d5  ;13
   
    eor.b d5,d4  ;14
    lsr.l  #8,d5  ;15
   
    subq.l #1,d7  ;16
    bne.b .loop4
 

   

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
17 Mar 2011 17:59


Megol . wrote:

  Found a interesting algorithm for CRC calulation here: EXTERNAL LINK 

  ...
    lsr.l #16, d0
 
    should be swap d0
  ...
 
  move.l tab72(d3*4), d1
  should be converted to (a0,d3*4)
  and you need 8 free registers-
 
    dbra d7, .next64  this is 1-2 cycles too.
 
You can change from longword to word shifts and remove and's

lsr.w  #8, d2        ; 3
and.l  #$ff, d2        ; 4 (remove)

same here:

lsr.w #8, d0          ; 8
and.l #$ff, d0        ; 9 (remove)

This will save 2 cycles. I think 2 more ands can be removed farther down in the code too..

Megol .

Posts 671
17 Mar 2011 18:30


S P wrote:

Megol . wrote:

  Found a interesting algorithm for CRC calulation here: EXTERNAL LINK   

  ...
      lsr.l #16, d0
 
      should be swap d0
  ...
 
  move.l tab72(d3*4), d1
  should be converted to (a0,d3*4)
  and you need 8 free registers-
 
      dbra d7, .next64  this is 1-2 cycles too.
 
  You can change from longword to word shifts and remove and's
 
  lsr.w  #8, d2        ; 3
  and.l  #$ff, d2        ; 4 (remove)
 
  same here:
 
  lsr.w #8, d0          ; 8
  and.l #$ff, d0        ; 9 (remove)
 
  This will save 2 cycles. I think 2 more ands can be removed farther down in the code too..

Thank you! I had completely forgotten the shift immediate limitation :/
Now let's hope the code is correctly translated from the original C code ;)

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
17 Mar 2011 18:59


There are more tricks you can do..
  Take a look at this:
 
  I only included half of the loop. It seems to be around 20 or 21 cycles in total. (for 8 bytes)
 

  ------------------------------------------------------------
 
   
  .loop32
 
    move.l (a0)+, d0      ; 1 fused
    eor.l  d4, d0        ; 1 
 
    move.l d0,d1          ; 2 fused
    and.l  #$00ff00ff,d1  ; 2
 
    move.l d0,d2          ; 3 fused
    and.l  #$ff00ff00,d2  ; 3
 
    lsr.l  #8,d2          ; 4
 
    move.l tab88(a0,d1.w*4),d3 ; 5 fused
    eor.l  d4, d3              ; 5
 
    swap  d1                  ; 6
 
    move.l tab80(a1,d2.w*4),d4 ; 7 fused
    eor.l  d3, d4              ; 7
 
    swap  d2                  ; 8
 
    move.l tab72(a2,d1.w*4),d3 ; 9 fused
    eor.l  d4, d3              ; 9
 
    move.l tab80(a3,d2.w*4),d4 ; 10 fused
    eor.l  d3, d4              ; 10
 
    (...)
 
    subq.l #1,d7      ; 11
    bne.b .loop32
 
 


Matt Hey
USA

Posts 726
18 Mar 2011 03:19


Megol . wrote:
 
    move.l d0, d1          ; 2 fused
    and.l  #$ff, d1        ; 2

On N68k, these can also be...

  mvz.b d0,d1

It would only be 1 word instead of 3 on N68k. It's 4 words on 68k which is a pretty nice size optimization. Of course, SP's suggestion may make this unnecessary.

S P wrote:

   

      move.l d0,d1          ; 2 fused
      and.l  #$00ff00ff,d1  ; 2
   
      move.l d0,d2          ; 3 fused
      and.l  #$ff00ff00,d2  ; 3
   

Yea, that's a good trick. I've used it to average the color values of 2 consecutive 16 bit pixels for scaling (creation of mipmaps). It makes a huge difference in performance on the 68060 where using long values can be twice as fast.


Megol .

Posts 671
18 Mar 2011 10:15


S P wrote:

There are more tricks you can do..
    Take a look at this:
   
    I only included half of the loop. It seems to be around 20 or 21 cycles in total. (for 8 bytes)
   

    ------------------------------------------------------------
   
   
    .loop32
   
      move.l (a0)+, d0      ; 1 fused
      eor.l  d4, d0        ; 1 
   
      move.l d0,d1          ; 2 fused
      and.l  #$00ff00ff,d1  ; 2
   
      move.l d0,d2          ; 3 fused
      and.l  #$ff00ff00,d2  ; 3
   
      lsr.l  #8,d2          ; 4
   
      move.l tab88(a0,d1.w*4),d3 ; 5 fused
      eor.l  d4, d3              ; 5
   
      swap  d1                  ; 6
   
      move.l tab80(a1,d2.w*4),d4 ; 7 fused
      eor.l  d3, d4              ; 7
   
      swap  d2                  ; 8
   
      move.l tab72(a2,d1.w*4),d3 ; 9 fused
      eor.l  d4, d3              ; 9
   
      move.l tab80(a3,d2.w*4),d4 ; 10 fused ! STALL, d2 changed less than 2 clocks ago
      eor.l  d3, d4              ; 10
   
      (...)
   
      subq.l #1,d7      ; 11
      bne.b .loop32
   
   


I haven't had time to check your optimizations yet but they look promising! However there are at least one stall in your code, please look above.

(My optimizations was just to try avoid address generation stalls and allow as much fused execution as possible)

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
18 Mar 2011 10:27


Yeah, I was too quick there.. The new N68050 is very powerful. 19 CISC instructions executed in 11 cycles (Without superscalar mode)
   
   
    This loop is without bubbles:
   
   

        bra.b .enter
    .loop32
          move.l (a4,d2.w*4),d4 ; 11
          eor.l  d3, d4              ; 11
    .enter
          move.l (a0)+, d0      ; 1 fused
          eor.l  d4, d0        ; 1 
       
          move.l d0,d1          ; 2 fused
          and.l  #$00ff00ff,d1  ; 2
       
          move.l d0,d2          ; 3 fused
          and.l  #$ff00ff00,d2  ; 3
       
          lsr.l  #8,d2          ; 4
       
          move.l (a1,d1.w*4),d3 ; 5 fused
          eor.l  d4, d3              ; 5
       
          swap  d1                  ; 6
       
          move.l (a2,d2.w*4),d4 ; 7 fused
          eor.l  d3, d4              ; 7
       
          swap  d2                  ; 8
       
          move.l (a3,d1.w*4),d3 ; 9 fused
          eor.l  d4, d3              ; 9
       
          subq.l #1,d7      ; 10
          bne.b .loop32
         
          move.l (a4,d2.w*4),d4       
          eor.l  d3, d4
   


Megol .

Posts 671
18 Mar 2011 15:57


S P wrote:

Yeah, I was too quick there.. The new N68050 is very powerful. 19 CISC instructions executed in 11 cycles (Without superscalar mode)
   
   
    This loop is without bubbles:
   
   

          bra.b .enter
    .loop32
          move.l (a4,d2.w*4),d4 ; 11
          eor.l  d3, d4              ; 11
    .enter
          move.l (a0)+, d0      ; 1 fused
          eor.l  d4, d0        ; 1 
       
          move.l d0,d1          ; 2 fused
          and.l  #$00ff00ff,d1  ; 2
       
          move.l d0,d2          ; 3 fused
          and.l  #$ff00ff00,d2  ; 3
       
          lsr.l  #8,d2          ; 4
       
          move.l (a1,d1.w*4),d3 ; 5 fused
          eor.l  d4, d3              ; 5
       
          swap  d1                  ; 6
       
          move.l (a2,d2.w*4),d4 ; 7 fused
          eor.l  d3, d4              ; 7
       
          swap  d2                  ; 8
       
          move.l (a3,d1.w*4),d3 ; 9 fused
          eor.l  d4, d3              ; 9
       
          subq.l #1,d7      ; 10
          bne.b .loop32
         
          move.l (a4,d2.w*4),d4       
          eor.l  d3, d4
   


Very nice code :)
Yeah the 68050 looks like a processor for us that like asm coding!

Megol .

Posts 671
18 Mar 2011 18:29


For something completely different, here's one example how the extended addressing word could be used:
Standard:
|D|REG|W|SC|1|B|I|BD|0|IS|

Proposed:
|D|REG|W|SC|1|M|A|BD|1|S|Y|

This would allow things like:

add.l {d1,d4*16}, d2 ; d2=d2+{d1+d4*16}
move.l (a0,-d0*4), d0
lea 12(d2,-d3*16), a0

D=Data/Address register selection
REG=register
W=Word/Long selection
SC=Scale
B=Base suppress
I=Index suppress
BD=Base displacement
IS=Index/indirect selection

M=Memory suppress (1=no cache access, passes the calculated data)
  Could be used for other things for the LEA instruction.
A=Address/data selection (for base register)
X=indeX negate (alternatively invert)
S=extra scaling bit

Matt Hey
USA

Posts 726
19 Mar 2011 03:37


Megol . wrote:

  Proposed:
  |D|REG|W|SC|1|M|A|BD|1|S|Y|
 

 
  Should the Y above be X for indeX negate? Y is not in your reference.
 
  Sometimes it is better to work backwards from a memory anchor but I can usually make the index register negative and leave it negative as I work with it. I can't see where I would use negate very often and it is more work for the EA unit to do in series which is potentially slower.
 
  I prefer an encoding more like the standard that would allow a larger scale factor, base register update, and direct modes...
 
  Proposed 2:
  |D|REG|W|SC|1|B|I|BD|S|IS|
 
  S=extra scaling bit
 
  I=0 IS=%100 Direct (addressing) mode with index
  I=1 IS=%100 Direct (addressing) mode no index
  I=1 IS=%101 Base register update *
 
  * The base register can't be suppressed for base register update. The index suppress bit (I) must be 1 for this mode which would normally indicate to always suppress the index register but now the base register suppress bit (B) indicates whether the index register is suppressed for this mode only.
 
  I don't think the direct mode would be very easy to use because of the change/use delay (although 2 cycle delay like 68060 would help tremendously and should be possible on N68070). This would mostly limit the use by a compiler except as an LEA for a data register. It would allow more optimization for a few hand optimized assembler routines. The best part of the 3 new addressing enhancements is that they are almost free to add (little slowdown).
 
  I think a new optional N68k syntax would make programming easier for beginners and converts from other processors. I suggest...
 
  1) Lower case instructions
  Example: move.l a0,d0
  Reason: easier to read, no more SHOUTING, and more modern look
 
  2) Allow a '+' sign instead of ',' in addressing modes.
  Example: move.l (1234+a0+d0*4)
  Reason: more understandable, better representation
 
  3) Allow "<<" and a shift size as well as '*' for index scale
  Example: move.l (1234+a0+d0<<7)
  Reason: <<7 might be easier than *128 for some people/applications
 
  The older representations would of course be allowed too but something along these lines could be used for official documentation. The new style Motorola 68k syntax that primarily changed 1234(a0) to (1234,a0) is a nice improvement.
 

Team Chaos Leader
USA
(Moderator)
Posts 2094
19 Mar 2011 05:09


Matt Hey wrote:

  2) Allow a '+' sign instead of ',' in addressing modes.
    Example: move.l (1234+a0+d0*4)
    Reason: more understandable, better representation

GreatIdea*16!

Cesare Di Mauro
Italy

Posts 526
19 Mar 2011 08:17


Matt Hey wrote:

Megol . wrote:

    Proposed:
    |D|REG|W|SC|1|M|A|BD|1|S|Y|
 

Sometimes it is better to work backwards from a memory anchor but I can usually make the index register negative and leave it negative as I work with it. I can't see where I would use negate very often and it is more work for the EA unit to do in series which is potentially slower.

I agree. It's quite uncommon.
I prefer an encoding more like the standard that would allow a larger scale factor, base register update, and direct modes...
 
    Proposed 2:
    |D|REG|W|SC|1|B|I|BD|S|IS|
 
    S=extra scaling bit
 
  I=0 IS=%100 Direct (addressing) mode with index
  I=1 IS=%100 Direct (addressing) mode no index
  I=1 IS=%101 Base register update *
 
  * The base register can't be suppressed for base register update. The index suppress bit (I) must be 1 for this mode which would normally indicate to always suppress the index register but now the base register suppress bit (B) indicates whether the index register is suppressed for this mode only.
 
  I don't think the direct mode would be very easy to use because of the change/use delay (although 2 cycle delay like 68060 would help tremendously and should be possible on N68070). This would mostly limit the use by a compiler except as an LEA for a data register. It would allow more optimization for a few hand optimized assembler routines. The best part of the 3 new addressing enhancements is that they are almost free to add (little slowdown).

What does it mean "direct mode"?
I think a new optional N68k syntax would make programming easier for beginners and converts from other processors. I suggest...
 
  1) Lower case instructions
    Example: move.l a0,d0
    Reason: easier to read, no more SHOUTING, and more modern look

I agree. Lower case is my preferred way to write code.
2) Allow a '+' sign instead of ',' in addressing modes.
    Example: move.l (1234+a0+d0*4)
    Reason: more understandable, better representation

Not for a 68000-addicted. :D

Anyway, for novel people you are right.
3) Allow "<<" and a shift size as well as '*' for index scale
    Example: move.l (1234+a0+d0<<7)
    Reason: <<7 might be easier than *128 for some people/applications

I prefer multiplication, which is easier to read.

For me:
* 4 -> longword.
<< 2 -> ? Not so immediate to understand.
The older representations would of course be allowed too but something along these lines could be used for official documentation. The new style Motorola 68k syntax that primarily changed 1234(a0) to (1234,a0) is a nice improvement.

I agree.

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
19 Mar 2011 08:58


I'm a little bit afraid of making the EA to complex.

If we make the calcuation in the EA more complex,
then the timing weill get worse and we will reduce our maximum clockrate.

Regarding using DN as base:
This is not good IMHO because of two reasons.
We would then need another new DN Read-Port: This drives Chip-costs higher.
We would also create a new hazard point.
Instructions working with the DN register in the two previous cycle would create a bubble. The correct recognise these hazards we will need to add extra tacking logic.
Also we can NOT update Dn Base registers as the EA does not have a write Port to the Dn Registers.
These are a lot drawbacks...

Regarding the extra Scale modes.
Adding an extra Bit for the scale mode looks tempting from the software point of few. But it will double our MUX area needed to do the SHIFT in the EA-Unit. This looks bad from a HW designers point of view because its costly and could reduce our clockrate.
I'll have to measure this to see how much room we have here before we looks performance...


posts 435page  << 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22