Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Do you have ideas and feature wishes? Post them here and discuss your ideas.

N68k Enhancements Revisitedpage  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 
Matt Hey
USA

Posts 733
14 Mar 2011 15:19


Phil G. wrote:

Team Chaos Leader wrote:

 
Jostein Aarbakk wrote:

    1) EOR instruction - More addressing modes
 

  I need this too.
 

  So do I. Not related to speed, but to register use. Sometimes you just don't have a temporary.
 

Are you guys using EOR that much? Personally, I don't use it much but I could see it used frequently for particular uses. Adding EOR.size EA,Dn would need to be a word encoding to be worthwhile and word encodings do take up a lot of encoding space.


Team Chaos Leader
USA
(Moderator)
Posts 2094
14 Mar 2011 15:39


Matt Hey wrote:

Phil G. wrote:

 
Team Chaos Leader wrote:

 
Jostein Aarbakk wrote:

    1) EOR instruction - More addressing modes
   

    I need this too.
 

  So do I. Not related to speed, but to register use. Sometimes you just don't have a temporary.
 

 
  Are you guys using EOR that much? Personally, I don't use it much but I could see it used frequently for particular uses. Adding EOR.size EA,Dn would need to be a word encoding to be worthwhile and word encodings do take up a lot of encoding space.
 

I use it in the deep innerloop of my encryption and decryption routines.  It needs to be as fast as possible.


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
14 Mar 2011 16:56


Team Chaos Leader wrote:

I use it in the deep innerloop of my encryption and decryption routines.  It needs to be as fast as possible.

Would you mind posting the workloop for us as reference?

Thomas Richter
Germany
(MX-Board Owner)
Posts 1425
14 Mar 2011 18:49


Wojtek P wrote:

  Can  MOVEM be improved to do read 4 registers or store 2 registers per cycle.
  Make quite a big difference.
 

I agree, this would be a real improvement, given that most subroutines start with saving the register set on the stack.

Greetings,
Thomas


Cesare Di Mauro
Italy

Posts 526
14 Mar 2011 19:33


Team Chaos Leader wrote:

Matt Hey wrote:
Are you guys using EOR that much? Personally, I don't use it much but I could see it used frequently for particular uses. Adding EOR.size EA,Dn would need to be a word encoding to be worthwhile and word encodings do take up a lot of encoding space.

I use it in the deep innerloop of my encryption and decryption routines.  It needs to be as fast as possible.

You don't need an enhanced EOR for this, since this instruction is rarely used in common code.

You need a SIMD unit with binary operations support, so your loop will be A LOT faster.

Megol .

Posts 675
14 Mar 2011 19:40


Cesare Di Mauro wrote:

Team Chaos Leader wrote:

 
Matt Hey wrote:
Are you guys using EOR that much? Personally, I don't use it much but I could see it used frequently for particular uses. Adding EOR.size EA,Dn would need to be a word encoding to be worthwhile and word encodings do take up a lot of encoding space.

  I use it in the deep innerloop of my encryption and decryption routines.  It needs to be as fast as possible.

  You don't need an enhanced EOR for this, since this instruction is rarely used in common code.
 
  You need a SIMD unit with binary operations support, so your loop will be A LOT faster.

Depends on if he uses a real encryption algorithm or just some "xor with something" algorithm, the previous would need SIMD support of either scatter/gather or register-wide lookups.
SIMD acceleration for Rijndael (AKA AES) or IDEA would be nice... ;)

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
14 Mar 2011 22:04


Shifts and swaps are commonly used when working with fixed points.

I understand that the current fusion design will enable the following 4 instructions in 2 cycles(N050) and 1 cycle in (N070) with superscalar.

move.l d0,d1
swap d1  ;usefull for 16:16 fixed point

move.l d2,d3
lsr.l #8,d3  ;usefull for 8:8 fixed point

How about these:

add.l d0,d1
swap d1

or.l d0,d1
swap d1

sub.l d0,d1
swap d1

muls.l d0,d1
swap d1

eor.l d0,d1
swap d1

neg.l d1
swap d1

2 reads and 1 write..

etc..

Here are two common merge pass in Chunky2Planar conversion routines. The merges can also be used when combining 2 buffers with shadetables

input:

d0=ABCD
d2=abcd

output:

d0=AaCc
d2=BbDd

  move.l d2,d3 ;swap 8 (d2,d4)
  lsr.l #8,d3
  eor.l d4,d3
  and.l d7,d3 ;d7=0x00ff00ff
  eor.l d3,d4
  lsl.l #8,d3
  eor.l d3,d2

input:

d0=AABB
d2=aabb

output:

d0=AAaa
d2=BBbb

  move.l d2,d4  ;swap 16 (d0Xd2)
  move.w d0,d4
  swap d4
  move.w d4,d0
  move.w d2,d4

Most amiga programs wich contain chunky graphics will use a variation of these merges.
So I suggest a new instruction

c2pmerge d0,d2,#n  (n=1,2,4,8,16)

The new N050 can interpretate 128bits per clock

That meens that it can look at 8 (16 bit) instructions and search for a pattern per clock..

The first c2pmerge is 7 Instructions(112 bit wide)
The second c2pmerge is 5 Instructions(80 bit wide)

Can they be fused?


Jostein Aarbakk
Norway

Posts 7
14 Mar 2011 23:05


Megol . wrote:

EOR should be handled by fusion already IIRC?

Matt Hey wrote:

The N68050 should be able to fuse the 2 instructions into 1 so they take just 1 cycle.
...
The code will still be 2 words which might be slightly slower in rare cases but eor isn't that common and encoding space is valuable.

Fusion sounds like a good idea.
If there isn't enough space for adding these new addressing modes to the EOR-instructions, I agree that other optimizations might be more important.

However, if enough space is available, and EOR gets new addressing modes, I could speed up my program even further, because the fusion could be made with (perhaps) the next instruction in my program.

Megol . wrote:

Not overwriting registers is (again IIRC) handled by instruction fusion

Matt Hey wrote:

We have been discussing how valuable and needed 3 op instructions are in the 68k.
...but it's not that much of a priority because code fusion is so good and speeds up existing 68k code.

Ok. To me it sounds like fusion is a good replacement for 3 OP instructions.

Megol . wrote:

Lookup tables would be "free" in the N68 design as long as it hits cache. Complicating the cache access path for this special case would slow down general code execution.

Matt Hey wrote:

There would be no locking method but if you are not using too much data then the data will stay in the cache.

Ok. However, in my case, there are 2 parts of the program that I want to make use of the cache, but 1 of them (the 1st) will probably "mess up" the cache for the other one:
1) The outer loop, which loops through each byte in a file (which can be large).
2) The inner loop. For each byte in the outer loop, it uses the byte to fetch a longword from the correct position in the lookup table.

I guess the problem with the automatic cache in this case, is that the outer loop after a while will fill the cache, thus removing the "lookup table" from the cache.
It would be great if the CPU could manage to cache the data read in the outer loop without removing the "lookup table" from the cache.

Matt Hey wrote:

A series of TST.L (An) in a loop with An pointing to the start of each cache line (16 bytes) in the table will work to force data caching.

Good idea. I have to test this out.



Jostein Aarbakk
Norway

Posts 7
14 Mar 2011 23:16


Gunnar von Boehn wrote:

Cache are going to be a lot bigger than on classic systems.
So far we tested cache ranging from 4 KB to 128 KB.

Sounds very good indeed.

Gunnar von Boehn wrote:
 
One question: You said you precalculate values for the CRC.
How does this algorithm work?
What are you precalculating in detail, how many instructions would it take to calculate this on the fly?

It is no option to calculate on the fly, because it requires a series of bitwise shifting and an Eor (in worst case scenario) for each bit shifted.
See documented example code below (uses a precalculated lookup table).
Calculating "on the fly" can be done by something similar as the "Calc_CRC_longword" subroutine below.

Start:
  Bsr.s Calc_CRC_table  ; Create the CRC lookup table

  ; Calculate CRC checksum of our string.
  Lea TestString(PC),a0 ; String to check
  Lea CRC_Table(PC),a1 ; CRC lookup table
  Moveq.l #-1,d0  ; Initialize result value
  Moveq.l #0,d2
.ByteLP Move.b (a0)+,d2
  Beq.s .End
      Eor.b d0,d2
  Lsr.l        #8,d0
      Move.l (a1,d2.w*4),d3
      Eor.l        d3,d0
  Bra.s .ByteLP
.End Not.l d0
  Rts

; Precalculates 256 LONGWORDs (for values 0-255), and stores them in memory.
Calc_CRC_table:
  Move.l #255,d6
  Moveq.l #0,d1
  Lea CRC_Table(PC),a1
.Loop Movem.l d1/d6/a1,-(sp)
  Bsr.s Calc_CRC_longword
  Movem.l (sp)+,d1/d6/a1
  Move.l d0,(a1)+
  Addq.l #1,d1
  Dbf d6,.Loop
  Rts

; Precalculates 1 LONGWORD.
; For every 8 bits in the source BYTE, the loop does this:
;  1) Shifts source BYTE 1 bit right
;  2) Shifts result LONGWORD 1 bit right
;  3) If ONLY 1 of the 2 outshifted bits is 1,
;    EOR the result (d0) with the QUOTIENT.
;
; INPUT PARAMETER:
;    d1=Source BYTE (the byte to precalculate a code for).
;    You can change this between 0 and 255.
;      Test values (Source leads to this result in d0):
;    0 = $00000000
;    1 = $77073096
;    2 = $ee0e612c
;    3 = $990951ba
;    4 = $076dc419
;
Calc_CRC_longword:
  Move.l #$EDB88320,d5 ; QUOTIENT
  Moveq.l #0,d0 ; Result LONGWORD. Don't change this.

  Moveq #7,d7
.Loop Lsr.b        #1,d1
  Scs.b d2
  Lsr.l        #1,d0
  Scs.b d3
  Eor.b d2,d3
  Extb.l d3
  And.l d5,d3
  Eor.l        d3,d0
  Dbf          d7,.Loop
  Rts

CRC_table: ds.l 256

TestString: dc.b "AB",0  ; NULL-terminated string.
;Testresults: "A"=D3D99E8B, "AB"=30694C07


Cesare Di Mauro
Italy

Posts 526
15 Mar 2011 06:24


Jostein Aarbakk wrote:
  Fusion sounds like a good idea.
  If there isn't enough space for adding these new addressing modes to the EOR-instructions, I agree that other optimizations might be more important.

There's no space to add them with a 16 bits opcode. You need at least 32 bits one, so opcode fusion is a better option.
Ok. However, in my case, there are 2 parts of the program that I want to make use of the cache, but 1 of them (the 1st) will probably "mess up" the cache for the other one:
  1) The outer loop, which loops through each byte in a file (which can be large).
  2) The inner loop. For each byte in the outer loop, it uses the byte to fetch a longword from the correct position in the lookup table.
 
  I guess the problem with the automatic cache in this case, is that the outer loop after a while will fill the cache, thus removing the "lookup table" from the cache.
  It would be great if the CPU could manage to cache the data read in the outer loop without removing the "lookup table" from the cache.

Don't worry: the cache is designed to fit your needs.

Lookup table's cache lines will not be evicted, since the whole LUT needs only 1KB of space (consider, however, to proper align it at 16 or 32 bytes bounds, depending on the CPU).

If your outer loop works with a file, usually it'll use a little buffer (128 bytes is a common value) to read (or write) data from it, so it will stay entirely on a few cache lines.

The same applies if the o.s. does file block cache (AmigaOS does not, if I remember correctly), since blocks are usually small (512 bytes for AmigaOS, 1KB for old Unixes, 4KB for modern o.s. filesystems).

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
15 Mar 2011 09:12


Hi SP,
 
 
S P wrote:

  I understand that the current fusion design will enable the following 4 instructions in 2 cycles(N050) and 1 cycle in (N070) with superscalar.
 
 
 

  move.l d0,d1
  swap d1  ;usefull for 16:16 fixed point
 
 
  move.l d2,d3
  lsr.l #8,d3  ;usefull for 8:8 fixed point
 

 

  Yes the above 2 examples can be fused.
  The fusing works under the following conditions.
  1) 2 Sources, 1 Destination, 1 ALU Operation.
  This technically means you can fuse moves together with another instructions.
 
 
 
 
S P wrote:

  How about these:
 
  add.l d0,d1
  swap d1
 

 
  2 Alu operations, can not be fused.
 
 
S P wrote:

  or.l d0,d1
  swap d1
 
  sub.l d0,d1
  swap d1
 
  muls.l d0,d1
  swap d1
 
  eor.l d0,d1
  swap d1
 
  neg.l d1
  swap d1
 

 
  None of them can be fused.
 
  Fusing requires that the combined operation of BOTH instructions is already existing as ALU operation in the CPU.
 
 
 
 
S P wrote:

  Here are two common merge pass in Chunky2Planar conversion routines. The merges can also be used when combining 2 buffers with shadetables
 
  input:
 
  d0=ABCD
  d2=abcd
 
  output:
 
  d0=AaCc
  d2=BbDd
 
  1  move.l d2,d3 ;swap 8 (d2,d4)
  2  lsr.l #8,d3
  3  eor.l d4,d3
  4  and.l d7,d3 ;d7=0x00ff00ff
  5  eor.l d3,d4
  6  lsl.l #8,d3
  7  eor.l d3,d2
 

  Only 1 and 2 can be fused.
 
 
 
S P wrote:

  input:
 
  d0=AABB
  d2=aabb
 
  output:
 
  d0=AAaa
  d2=BBbb
 
  1  move.l d2,d4  ;swap 16 (d0Xd2)
  2  move.w d0,d4
  3  swap d4
  4  move.w d4,d0
  5  move.w d2,d4
 

  1 and 2 could maybe be fused.
  I'll have to doublecheck this one.
 
 
 
S P wrote:

  Most amiga programs wich contain chunky graphics will use a variation of these merges.
  So I suggest a new instruction
 
  c2pmerge d0,d2,#n  (n=1,2,4,8,16)
 
 
  The new N050 can interpretate 128bits per clock
 
  That means that it can look at 8 (16 bit) instructions and search for a pattern per clock..
 

  It means that the CPU can execute UP TO 128 bits per clock.
 
 
S P wrote:

  The first c2pmerge is 7 Instructions(112 bit wide)
  The second c2pmerge is 5 Instructions(80 bit wide)
 

 
  In these examples fusing will improve performance only a little.
  But luckily C2P routines are not needed any more at all on NATAMI - as you have native chunky support now.
  This means by leaving AWAY these routines such programs will be accelerated a LOT on NATAMI. :-D

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
15 Mar 2011 13:12


Hi,
 

Ok lets look at your code and lets analyse it.
1) So first let look at it optimistic.

The MOVE and the EOR can not be fused as the combination does not use the same DST.
 
 
  But a SuperScalar CPU could forward to result to the second ALU.
  Here is the number of cycles the workloop will take.
  By slightly changing the code you can remove one instruction and get down from 7 to 6 instructions for the inner loop.
 
 

                                         68050            68070
  ByteLP:
    Move.b  (a0)+,d2                      1                1
    Beq.s  End                            2                1
    Eor.b  d0,d2                          3                2
    Lsr.l  #8,d0                          4                2       
    Move.l (a1,d2.w*4),d3                  5                3
    Eor.l  d3,d0                          6                3 *Forwarding
    Bra.s  ByteLP                          7                3
 

 
 
 

    bra.b Start
  ByteLP:
    Eor.b  d0,d2                          1                1
    Lsr.l  #8,d0                          2                1
    Move.l (a1,d2.w*4),d3                  3                2
    Eor.l  d3,d0                          4                2 *Forwarding
  Start:         
    Move.b  (a0)+,d2                      5                3
    Bne.s  Loop                            6                3
 

In this special case having a EOR ea,Dn would be beneficial as you could then write it like this:


    bra.b Start
  ByteLP:
    Eor.b  d0,d2                          1                1
    Lsr.l  #8,d0                          2                1
    EOR.l (a1,d2.w*4),d0                  3                2
  Start:         
    Move.b  (a0)+,d2                      4                2
    Bne.s  Loop                            5                2
 

This would get the workloop down from 3 cycles to 2 cycles for the 070.  This would be rather swift then.

2) Now lest look at the code realistic
 


    bra.b Start
  ByteLP:
    Eor.b  d0,d2                          1                1
    Lsr.l  #8,d0                          2                1

    -- ALU to EA flow usage penalty of 1/2 cycles!!
    Move.l (a1,d2.w*4),d3                  3,4              2,3,4
 

    Eor.l  d3,d0                          5                2,3,4 *Forwarding
  Start:         
    Move.b  (a0)+,d2                      6                5
    Bne.s  Loop                            7                5
 


 

The table lookup using Index will loose performance on 68050, 68060 and 68070. You could try to rework the code to get the D2 update and the D2 usage further apart from each other.

Matt Hey
USA

Posts 733
15 Mar 2011 14:36


@Gunnar
    Don't forget we have ColdFire MVZ and MVS now!
     

      ByteLP:
        Mvz.b  (a0)+,d2
        Beq.s  End
        Eor.l  d0,d2
        Lsr.l  #8,d0
        Move.l (a1,d2.l*4),d3
        Eor.l  d3,d0
        Bra.s  ByteLP
     

   
    It looks like an immediate can be extended for free in a code fusion but can a register be extended for free also?
   
      moveq #7,d0 ;free extb.l
      eor.l d0,d1
   
    so is this possible too...
   
      mvz.b (a0)+,d0 ;is this sign extension free also?
      eor.l  d0,d1
   
    and can something like this be fused even though the first move is not long and there are 3 instructions...
   
      move.b (a0)+,d0
      extb.l d0
      eor.l  d0,d1
   
    It also looks like code fusion and result forwarding is still possible for multiple sizes if the first instruction is a long. Is this so?
   
      move.l (a0),d0
      and.w  #$f0f0,d0 ;upper 16 bits does not need to be reread
   
    The source is long, data is anded to the long source and output is long. Even this would be great. I understand that starting with a word or byte won't work as you explained. We can use MVS and MVZ to start with a long if we can just do word and byte operations after. Is code fusion possible for this...
   
      mvz.b (a0)+,d0
      swap d0
   

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
15 Mar 2011 15:53


Matt Hey wrote:

@Gunnar
    Don't forget we have ColdFire MVZ and MVS now!
     

      ByteLP:
        Mvz.b  (a0)+,d2
        Beq.s  End
        Eor.l  d0,d2
        Lsr.l  #8,d0
        Move.l (a1,d2.l*4),d3
        Eor.l  d3,d0
        Bra.s  ByteLP
     

   

Where would be the benefit of this change?

Matt Hey wrote:

    It looks like an immediate can be extended for free in a code fusion but can a register be extended for free also?
   
        moveq #7,d0 ;free extb.l
        eor.l d0,d1

Sorry this does not work.
The requirement for fusing is that the destinations of BOTH instructions are the same. This means only one destination has to be updated by both instructions.
Your example will update two registers - Therefore it can't be fused.

Cheers

Megol .

Posts 675
15 Mar 2011 16:10


But MVZ isn't useful for this routine right?

My simple attempt to optimize it a bit by making the move and eor potentially fusionable.

  move.b (a0)+,d2
  beq.s .end
.nextbyte
  eor.b d0, d2
  lsr.l #8, d0
  move.l (a1,d2.w*4), d3
  eor.l d0, d3          ; d3 replaces d0
  move.b (a0)+, d2
  beq.s .end2

  eor.b d3, d2
  lsr.l #8, d3
  move.l (a1, d2.w*4), d0
  eor.l d3, d0          ; d0 is now the crc again
  move.b (a0)+, d2
  bne.s .nextbyte
.end
  mov d0, d3           
.end2
  ; here d3 is the crc value...


Matt Hey
USA

Posts 733
15 Mar 2011 17:23


Gunnar von Boehn wrote:

   
Matt Hey wrote:

      @Gunnar
          Don't forget we have ColdFire MVZ and MVS now!
           

            ByteLP:
              Mvz.b  (a0)+,d2
              Beq.s  End
              Eor.l  d0,d2
              Lsr.l  #8,d0
              Move.l (a1,d2.l*4),d3
              Eor.l  d3,d0
              Bra.s  ByteLP
           

     
   
     
      Where would be the benefit of this change?
     

   
    At least on the 68060, move.l (a1,d2.l*4),d3 instead of move.l (a1,d2.w*4),d3 would save a cycle. I believe there is a penalty hear also on 68020/68030 and maybe 68040. Unfortunately, the 68060 doesn't have the mvz.b instruction but it would have only saved an instruction before the loop and it's better to use long instructions if possible ;). For the 68020+, a moveq #0,d2 before the loop and using move.l (a1,d2.l*4) should save at least a cycle per iteration of the loop after the first...

     


      Lea TestString(PC),a0 ; String to check
      Lea CRC_Table(PC),a1 ; CRC lookup table
      Moveq.l #0,d2
      Moveq.l #-1,d0  ; Initialize result value
    .ByteLP Move.b (a0)+,d2
      Beq.s .End
      Eor.b d0,d2
      Lsr.l #8,d0
      Move.l (a1,d2.l*4),d3 ;this is better than (a1,d2.w*4)
      Eor.l  d3,d0
      Bra.s .ByteLP
    .End Not.l d0
      Rts
     

    Will the N68k be able to sign extend and do all shift sizes in the EA without penalty?
   
   
Gunnar von Boehn wrote:
 
     
Matt Hey wrote:

          It looks like an immediate can be extended for free in a code fusion but can a register be extended for free also?
         
            moveq #7,d0 ;free extb.l
            eor.l d0,d1
     

     
      Sorry this does not work.
      The requirement for fusing is that the destinations of BOTH instructions are the same. This means only one destination has to be updated by both instructions.
      Your example will update two registers - Therefore it can't be fused.
   

   
    Doh! I should be more careful. I should have put...
   
            moveq #7,d0 ;free extb.l
            eor.l d1,d0


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
15 Mar 2011 17:43


Matt Hey wrote:

Will the N68k be able to sign extend and do all shift sizes in the EA without penalty?

 
Yes - the 68050 can already do EVERY EA for free.
(zero extra cycles!)

Exceptions are of couse memory indirect modes...

This means the timing on the 68050 looks like this: 


Move.l D0,d3                    -- 1 cycle
Move.l (a1),d3                  -- 1 cycle
Move.l (a1)+,d3                  -- 1 cycle
Move.l (a1,d2.l),d3              -- 1 cycle
Move.l (a1,d2.w*8),d3            -- 1 cycle
Move.l (12345678,a1,d2.w*8),d3  -- 1 cycle

What is a problem is:


Eor.l  d0,d2              -- Register write to D2 in ALU
Move.l (a1,d2.l*4),d3      -- Register usage of D2 in EA Unit

The two ALUs (EA-Unit and main ALU) are below each other in the pipeline. You need to have 3 instructions between an instruction which updates a Data-register in the ALU and an Instruction which needs this Data-Register in the EA-Unit.

BTW this is conceptional and of how the ideal 68K CORE works.
The 68060 behaves the same this is quite well explained in the 060 mamual.

   
Matt Hey wrote:

  I should be more careful. I should have put...
  moveq #7,d0 ;free extb.l
  eor.l d1,d0 

Yes this is fuseable!

MVS is currently handled by the 68050 like a fused MOVE and EXT instruction. This means the extension is done in the ALU.
Which means MVS can NOT be fused atm with another instruction.

MOVEQ is extented earlier in the pipeline as the immediate value is already available during dedcoding - the decoder does the extending in the very frist pipeline stage already.

Cheers

Matt Hey
USA

Posts 733
15 Mar 2011 18:05


Gunnar von Boehn wrote:

   
Matt Hey wrote:

      Will the N68k be able to sign extend and do all shift sizes in the EA without penalty?
     

       
      Yes - the 68050 can already do EVERY EA for free.
      (zero extra cycles!)
   

   
    Sweet!
   
   
Gunnar von Boehn wrote:
 
      What is a problem is:
     

      Eor.l  d0,d2              -- Register write to D2 in ALU
      Move.l (a1,d2.l*4),d3      -- Register usage of D2 in EA Unit
     

     
      The two ALUs (EA-Unit and main ALU) are below each other in the pipeline. You need to have 3 instructions between an instruction which updates a Data-register in the ALU and an Instruction which needs this Data-Register in the EA-Unit.
     
      BTW this is conceptional and of how the ideal 68K CORE works.
      The 68060 behaves the same this is quite well explained in the 060 manual.
   

    Jostein already does a better job of instruction scheduling than all 68k compilers I've seen 8-/.

    The 68060 change/use stall is only 2 cycles in some cases. Is this because it's superscaler or by register result forwarding? Possible on N68070 at least?
   
    "The OEP does not experience any sequence-related pipeline stalls. The most common example of this type of stall is a change/use register stall. This type of stall results from a register being modified by an instruction and a subsequent instruction generating an address using the previously modified register. The second instruction must stall in the OEP until the register is actually updated by the previous instruction. For example:
 
    muls.l #<data>,d0
    move.l (a0,d0.l*4),d1
 
    In this sequence, the second instruction is held for 2 clock cycles stalling for the first instruction to complete the update of the d0 register. If consecutive instructions load a register and then use that register as the base for an address calculation (An), a 2-clock-cycle wait may be incurred. This represents the maximum change/use penalty for a base register. The maximum change/use penalty for an index register (Xi) is 3 clock cycles (for Xi.l*2, Xi.l*8, and Xi.w). The change/use penalty for an index register if Xi.l*1 or Xi.l*4 is 2 clock cycles. Certain instructions have been optimized to ensure no change/use stall occurs on subsequent instructions. The destination register of the following instructions is available for subsequent instructions:
 
    lea
    mov.l &imm,Rn
    movq
    clr.l Dn,
    any op (An)+
    any op -(An)
 
    as a base register for address calculation with no stall, or as an index register for address calculation with no stall, if Xi.l*{1,4}. If the index register used is Xi.l*2, Xi.l*8, or Xi.w, then the previously described 3 cycle stall occurs."
 

Megol .

Posts 675
15 Mar 2011 18:14


Gunnar von Boehn wrote:

(stuff removed)
 
  MVS is currently handled by the 68050 like a fused MOVE and EXT instruction. This means the extension is done in the ALU.
  Which means MVS can NOT be fused atm with another instruction.
 
  MOVEQ is extented earlier in the pipeline as the immediate value is already available during dedcoding - the decoder does the extending in the very frist pipeline stage already.
 
  Cheers

I understand why MVS requires the execution stage but does the same apply to MVZ too?

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
15 Mar 2011 18:15


Matt Hey wrote:

  lea
  mov.l &imm,Rn
  movq
  clr.l Dn,
  any op (An)+
  any op -(An)

The manual is complecated written.
What they are saying that they do not need to use the lower ALU for executing those instructions.

This is basically what I said differently before also.

posts 435page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22