Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.


All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Do you have questions about the Natami?
Post it here and we will answer it!

Optimizing for N050
Matt Hey
USA

Posts 204
26 Feb 2010 00:03


Is there any preliminary timing info or optimization tips for N050 instructions or addressing modes yet?
   
    Maybe the following could be filled out...
   
    PROC CACHE RAdd MAdd Mul Index Bcc
    68000    0/0      6    18      40      18    10/6
    68020  256/0      2      6      28      9      6/4
    68030  256/256    2      5      28      8      6/4
    CPU32    0/0      2      9      26      12      8/4
    68040    4K/4K      1      1      16      3      2/3
    68060    8K/8K      1      1      2      1      0/1
    N050
   
    RAdd: Register to register 32 bit add (add.l  d0,d1).
    MAdd: Absolute long address to register add (add.l _mem,d1).
      Mul: 16x16 multiplication (max. time) (mulu.w d0,d1).
    Index: Indexed addressing mode (move.l 2(a0,d0),d1).
      Bcc: Byte conditional branch taken/not taken (bne.b label)
   
    And maybe add N050 answers for these questions...
   
    Operations with long immediate values between -128 and 127:
   
      A:  add.l  #20,d1        B:  moveq    #20,d0
                                      add.l      d0,d1
   
      68040/xx:    A
      68000/20/60:  B
   
    Byte/word operations that could be replaced with long operations:
   
      A:  add.w  d0,d1        B:  add.l d0,d2
   
      68000/xx:  A
      68020/40:  Any
      68060:      B
   
    Keep memory operands in registers:
   
      A:  add.l  _var,d1      B:  move.l  _var,d0
          add.l  _var,d2          add.l    d0,d1
                                    add.l    d0,d2
   
      68040:          A (as long as total # of instructions are less)
      68000/20/60/xx:  B
   
    Reschedule operations using address registers:
   
      A:  add.l  d0,d1        B:  move.l  (a1),a0
          move.l  (a1),a0          add.l    d0,d1
          move.l  (a0),d2          move.l  (a0),d2
   
      68000/20:    Any
      68040/60/xx:  B
   
    Replace constant multiplications with adds/subs/shifts:
   
      A:  mulu.w  #254,d1      B:  move.l  d1,d0
                                    lsl.l    #8,d1
                                    lsl.l    #1,d0
                                    sub.l    d0,d1
   
      68060:          A
      68000/20/40/xx:  B
   
    Operations using indexing modes:
   
      A:  add.l  (a0,d7),d1  B:  add.l  d7,a0
            add.l  (a0,d7),d2        add.l  (a0),d1
                                      add.l  (a0),d2
   
      68000/60:      A
      68020/40/xx:  B
   
    Saving/restoring registers:
   
      A:  movem.l  d4-d7,-(a7)  B:  move.l  d7,-(a7)
                                        move.l  d6,-(a7)
                                        move.l  d5,-(a7)
                                        move.l  d4,-(a7)
   
      68000/20/60/xx:  A
      68040:            B (if time critical)
   
    Any tips like these...
   
    68020:
      Use short instructions
      Keep values in registers
      Almost no scheduling necessary
      Code optimized for the 68060 runs great
   
    68040:
      Use as few instructions as possible (even if they are longer)
      Values can be kept in memory
      Avoid pipe-line stalls for some effective addresses
      Avoid subtracts to address registers
   
    68060:
      Use short instructions
      Keep values in registers
      Schedule instructions for superscalar execution
      Inline short functions
   
    N050:
   
    I expect it will be smart to schedule instructions for superscaler execution for future N070 compatibility. Anything else to watch out for or shy away from for future compatibility?
   

Gunnar von Boehn
Germany
(Natami Team Member)
Posts 3738
26 Feb 2010 05:27


Matt Hey wrote:

Is there any preliminary timing info or optimization tips for N050 instructions or addressing modes yet?

   
    Maybe the following could be filled out...

    
    PROC    CACHE    RAdd  MAdd    Mul    Index    Bcc
    68000    0/0      6    18      40      18    10/6
    68020  256/0      2      6      28      9      6/4
    68030  256/256    2      5      28      8      6/4
    CPU32    0/0      2      9      26      12      8/4
    68040    4K/4K      1      1      16      3      2/3
    68060    8K/8K      1      1      2      1      0/1
    N050    variable*  1      1      1      1    (0/0)*

1) The cache size of the 050 is variable and can be defined by compile time. Depending on your FPGA space you can have e.g 64/64KB.
We aim for a cache of 32/32 or 64/64 in the NATAMI.

2) The 68050 does support some sort of branch acceleration.
But this is not 100% finished yet. Currently some branches can
be folded away. We are in the works of adding branch merging
which means that short branches will be merged with their next
instruction so that the next instruction becomes conditional.
The advantage of this technique is that by doing this the branch
will ALWAYS be predicted correctly. As of today with the
halve finished branch acceleration some branches could also
still take 5 cycles.

3) We are going to add a LINK-Stack which means that
subroutine call (RTS instructoin) will be fast on the 68050.
   

  RAdd: Register to register 32 bit add (add.l  d0,d1).
  MAdd: Absolute long address to register add (add.l _mem,d1).
  Mul: 16x16 multiplication (max. time) (mulu.w d0,d1).
Index: Indexed addressing mode (move.l 2(a0,d0),d1).
  Bcc: Byte conditional branch taken/not taken (bne.b label)
   

And maybe add N050 answers for these questions...
   
Operations with long immediate values between -128 and 127:
   
        A:  add.l  #20,d1        B:  moveq    #20,d0
                                      add.l      d0,d1
   
        68040/xx:    A
        68000/20/60:  B
    N68050:      A 
The 050 can load 16 byte per clock from the ICache.
This means the length of an instruction does not slow the CPU down.

   
Byte/word operations that could be replaced with long operations:
   
        A:  add.w  d0,d1        B:  add.l d0,d2
   
        68000/xx:  A
        68020/40:  Any
        68060:      B
      N68050:    Any
Byte/Word/Long operations always take 1 clock on the 050.

Keep memory operands in registers: 
        A:  add.l  _var,d1      B:  move.l  _var,d0
            add.l  _var,d2          add.l    d0,d1
                                      add.l    d0,d2
   
        68040:          A (as long as total # of instructions are less)
        68000/20/60/xx:  B
    N68050:          A
The 050 can do a memory read per clock therefore A is faster.

   
Reschedule operations using address registers:
   
        A:  add.l  d0,d1        B:  move.l  (a1),a0
            move.l  (a1),a0          add.l    d0,d1
            move.l  (a0),d2          move.l  (a0),d2
   
        68000/20:    Any
        68040/60/xx:  B
    N68050:      B
The 050 has like the 040 and like the 060 a load/usage delay of address registers.
Using and updating an adrr-register like this takes no penalty.
1) (A0)+,Dn
2) (A0)+,Dn
But if you do a memory load to a register and then use it then there is a bubble between both instructions.
Just like on the 040 and on the 060.

Replace constant multiplications with adds/subs/shifts:
   
        A:  mulu.w  #254,d1      B:  move.l  d1,d0
                                      lsl.l    #8,d1
                                      lsl.l    #1,d0
                                      sub.l    d0,d1
   
        68060:          A
        68000/20/40/xx:  B
    N68050:      A
Mul is fast (1 clock) on the 050.

   
    Operations using indexing modes:
   
        A:  add.l  (a0,d7),d1  B:  add.l  d7,a0
            add.l  (a0,d7),d2        add.l  (a0),d1
                                      add.l  (a0),d2
   
        68000/60:      A
        68020/40/xx:  B
    N68050:      A
Index adressing mode is free.
Option A takes 2 clocks, Option B takes 3 clocks.
   

    Saving/restoring registers:
   
        A:  movem.l  d4-d7,-(a7)  B:  move.l  d7,-(a7)
                                        move.l  d6,-(a7)
                                        move.l  d5,-(a7)
                                        move.l  d4,-(a7)
   
        68000/20/60/xx:  A
        68040:            B (if time critical)
    N68050:          A
MOVEM takes 1 clock per loaded/stored register.

    Any tips like these...
   
    68020:
        Use short instructions
        Keep values in registers
        Almost no scheduling necessary
        Code optimized for the 68060 runs great
   
    68040:
        Use as few instructions as possible (even if they are longer)
        Values can be kept in memory
        Avoid pipe-line stalls for some effective addresses
        Avoid subtracts to address registers
   
    68060:
        Use short instructions
        Keep values in registers
        Schedule instructions for superscalar execution
        Inline short functions
   
    N050:
        Almost no scheduling necessary
        Use as few instructions as possible (even if they are longer)
        Values can be kept in memory
        Avoid memory indirect addressing modes.

   

I expect it will be smart to schedule instructions for superscaler execution for future N070 compatibility. Anything else to watch out for or shy away from for future compatibility?
   

Yes, Jens is currently reworking the Cache to prepare for Superscalarity.

There are a few things which are unique to the N68K line.
1) Like the 040 the length of an instruction does not matter.
This means even LONG instructions will only take 1 cycle.

2) The 050 is internally a 3 Operant machine.
Because of this the 050 can sometimes combine two 68K instructions into 1. This feature is currently 50% finished in the core. We need to add a hint to the Icache to finish this. When its fully finished the following will happen:

Example:
move.l D0,D1
add.l (A0),D1
The 050 will in the future do both instructions together in 1 clock.
What the 050 internally does is ADD.l (a0)+D0,D1
The 070 is planned to be able do this twice per clock then.

This means do enable support for this future please leave such depending instructions together. Do not stuff another instruction between them.

3) Branch converting.
The 050 will convert short conditional branches and instructions to conditional instruictons.

Example:
bne .dontadd
add.l (A0),D1
.dontadd
Will be converted to:
addeq.l (A0),D1
This combined instruction will only take 1 clock.

This means in theory the best throughout that the 050 could reach will be:


bne .dontadd
move.l D0,D1
add.l (A0),D1
.dontadd
bra somewhere else

The 050 is designed to do all the above 4 instructions together in 1 cycle.

The 050 can do an unconditional brach for free. *working
The 050 can merge to 2 instructions. *needs hint tag in cache.
The 050 can rewrite BCC to conditional instructions. *needs hints tag in cache.

I believe the 050 is very easy to program for.
All instructions are fast and all addressing modes are very fast.
You can use long instructions as you like.
This means here is no need to convert complex instruction into several instructions like people did on the older CPU to speed them up.
What the 050 does NOT like are memory indirect addressing modes.
But I believe that this is not a disadvantage as memory indirect addressing modes ALWAYS were very slow and were very rarely used.

Does this answer all of your questions?
Or do you have more questions?


Ayodele Stephenson
USA

Posts 58
26 Feb 2010 14:58


Great Question with a Great explanation.  Both were well laid out and even easy for a 68k novice like myself to understand.  Thank You.

Matt Hey
USA

Posts 204
27 Feb 2010 03:34


Gunnar von Boehn wrote:

  2) The 68050 does support some sort of branch acceleration.
  But this is not 100% finished yet. Currently some branches can
  be folded away. We are in the works of adding branch merging
  which means that short branches will be merged with their next
  instruction so that the next instruction becomes conditional.
  The advantage of this technique is that by doing this the branch
  will ALWAYS be predicted correctly. As of today with the
  halve finished branch acceleration some branches could also
  still take 5 cycles.

Is a branch (prediction) cache like the 68060 still planned?
 
 
  Keep memory operands in registers: 
        A:  add.l  _var,d1      B:  move.l  _var,d0
            add.l  _var,d2          add.l    d0,d1
                                      add.l    d0,d2
     
        68040:          A (as long as total # of instructions are less)
        68000/20/60/xx:  B
      N68050:          A
  The 050 can do a memory read per clock therefore A is faster.

Would a superscaler N68070 still have a 1 memory read per cycle limitation? B could become faster on the N68070?
   

  Example:
  move.l D0,D1
  add.l (A0),D1
  The 050 will in the future do both instructions together in 1 clock.
  What the 050 internally does is ADD.l (a0)+D0,D1
  The 070 is planned to be able do this twice per clock then.
 
  This means do enable support for this future please leave such depending instructions together. Do not stuff another instruction between them.

That's very powerful but counter 68040+ scheduling teachings. It will take some getting used to. Is it not possible to combine instructions other than 2 consecutive instructions in this way?


  What the 050 does NOT like are memory indirect addressing modes.
  But I believe that this is not a disadvantage as memory indirect addressing modes ALWAYS were very slow and were very rarely used.

That was a problem of the 68020+ too ;). I believe adding the indirect addressing modes was the biggest mistake of the 68k line. It's only useful when out of registers. At least the N68050 shouldn't have as much problem with running out of registers as the 68060 because it can deal with memory more efficiently and doesn't need long operations to be efficient.
 
Thanks Gunnar. Your answers were exactly what I was looking for.


Gunnar von Boehn
Germany
(Natami Team Member)
Posts 3738
27 Feb 2010 06:29


Matt Hey wrote:

Gunnar von Boehn wrote:

  2) The 68050 does support some sort of branch acceleration.
  But this is not 100% finished yet. Currently some branches can
  be folded away. We are in the works of adding branch merging
  which means that short branches will be merged with their next
  instruction so that the next instruction becomes conditional.
  The advantage of this technique is that by doing this the branch
  will ALWAYS be predicted correctly. As of today with the
  halve finished branch acceleration some branches could also
  still take 5 cycles.
 

 
  Is a branch (prediction) cache like the 68060 still planned?

Curently the following optimizations are working/worked at.
* BRA and BSR acceleration.  (working)
  BRA and BSR only need 1 cycle. (working)

* BRA can be folded (executed with another instruction together) (earlier instruction). Therefore BRA can be free. (needs hint bit in cache / is worked at)

BCC can be converted into conditional instructions (needs hint bit/ worked at). BCC then becomes free. This could also be converted/combined into a general static acceleration or combine with a direction cache.
The cache/direction cache is currently not worked at.

Matt Hey wrote:

 
 
  Keep memory operands in registers: 
          A:  add.l  _var,d1      B:  move.l  _var,d0
              add.l  _var,d2          add.l    d0,d1
                                        add.l    d0,d2
     
          68040:          A (as long as total # of instructions are less)
          68000/20/60/xx:  B
      N68050:          A
  The 050 can do a memory read per clock therefore A is faster.
 

 
  Would a superscaler N68070 still have a 1 memory read per cycle limitation? B could become faster on the N68070?

Yes, 1 memory read per cycle.

The current plan for the 070 design is:
* Two even pipelines able to do any instruction. (Up to two muls per cycle)
* 1 of these two instruction can do a memory (read/write).
* In addition to these two instructions the core can do a static BRA per cycle.
* Both piplines are 3 Operant pipelines this means two instructions instructions could be merged to one.

This means both code varints a) and B) would both take 2 clocks.

Matt Hey wrote:
     
 

  Example:
  move.l D0,D1
  add.l (A0),D1
  The 050 will in the future do both instructions together in 1 clock.
  What the 050 internally does is ADD.l (a0)+D0,D1
  The 070 is planned to be able do this twice per clock then.
 
  This means do enable support for this future please leave such depending instructions together. Do not stuff another instruction between them.
 

 
  That's very powerful but counter 68040+ scheduling teachings. It will take some getting used to. Is it not possible to combine instructions other than 2 consecutive instructions in this way?

It did not hurt to have depending instructions scheduled like this.
No, merging is not planned for instruction not in seqeunce.
BTW The Coldfire V5 works the same as we do in this regard.

Matt Hey wrote:

 

  What the 050 does NOT like are memory indirect addressing modes.
  But I believe that this is not a disadvantage as memory indirect addressing modes ALWAYS were very slow and were very rarely used.
 

 
  That was a problem of the 68020+ too ;). I believe adding the indirect addressing modes was the biggest mistake of the 68k line. It's only useful when out of registers. At least the N68050 shouldn't have as much problem with running out of registers as the 68060 because it can deal with memory more efficiently and doesn't need long operations to be efficient.
 
  Thanks Gunnar. Your answers were exactly what I was looking for.
 

Thanks for your excellent questions.

Matt Hey
USA

Posts 204
28 Feb 2010 17:41


Gunnar von Boehn wrote:

        Keep memory operands in registers: 
              A:  add.l  _var,d1      B:  move.l  _var,d0
                  add.l  _var,d2          add.l    d0,d1
                                            add.l    d0,d2
           
              68040:          A (as long as total # of instructions are less)
              68000/20/60/xx:  B
            N68050:          A
        The 050 can do a memory read per clock therefore A is faster.
   

   
Matt Hey wrote:
   
      Would a superscaler N68070 still have a 1 memory read per cycle limitation? B could become faster on the N68070?
   

   
Gunnar von Boehn wrote:

      Yes, 1 memory read per cycle.
     
      The current plan for the 070 design is:
      * Two even pipelines able to do any instruction. (Up to two muls per cycle)
      * 1 of these two instruction can do a memory (read/write).
      * In addition to these two instructions the core can do a static BRA per cycle.
      * Both pipelines are 3 Operand pipelines this means two instructions instructions could be merged to one.
     
      This means both code variants a) and B) would both take 2 clocks.
     

     
      Will the 1 cycle read limit apply when reading from cache? Will the N68050 have a write buffer like the 68060? How many writes for a buffer and/or per cycle?
   
    Also, would it be possible to have an immediate value as one of the 2 operand instructions that compacts into 1 3 operand instruction? Example from above:
   
    moveq    #20,d0
    add.l    d0,d1
   
    would compact to:
   
    #20+d0->d1
   
    This would be very nice because it would allow for more compact code. This saves a word over add.l #20,d1. Compact code is one of the nice features of the 68k family. The 68060 was a move back in this direction after the 68040 big instructions and code. If this could be done, I believe it would be a better solution than adding 8 bit long instructions like cmpq (compare quick). Also, this would allow for 68020/68030/68060 optimized code to perform better on the N68050+ and more instructions would be in the cache. I'm also thinking that having small instructions and compact code may perform better with a superscaler N68070.


Matt Hey
USA

Posts 204
03 Mar 2010 03:22


Gunnar von Boehn wrote:

    What the 050 does NOT like are memory indirect addressing modes.
    But I believe that this is not a disadvantage as memory indirect addressing modes ALWAYS were very slow and were very rarely used.
 

 
  I would like to make a clarification. I believe Gunnar is referring to the post and pre indexed addressing modes here. More specifically, the memory indirect post-indexed and pre-indexed modes. I thought that is what he was talking about and also what I referred to in my follow up post. However, I dropped the "memory" from "memory indirect" which makes my post incorrect. It is not 100% clear for Gunnar to call these modes memory indirect addressing modes. It's even worse for me to drop the "memory". Here is an example of indirect modes...
 
  (An) address register indirect
  (An)+ address register indirect with post-increment
  -(An) address register indirect with pre-decrement
  (d16,An) address register indirect with displacement
  (bd,An,Xn.SIZE*SCALE) address register indirect with index
  ([bd,An],Xn.SIZE*SCALE,od) memory indirect post-indexed
  ([bd,An,Xn.SIZE*SCALE],od) memory indirect pre-indexed
 
  Only the last 2 modes listed above are what I think Gunnar is referring to and what I was also referring to. They are all indirect modes. All but the last 2 modes could implicitly be thought of as memory indirect. It's not enough to call them only "indexed" modes either as this could apply to several modes as well. At the minimum, we should refer to them as post and pre indexed modes. It's a small but important distinction.
 

posts 7