Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Do you have questions about the Natami?
Post it here and we will answer it!

68050 Intern: the Power of the 68K Instructoin Setpage  1 2 3 4 5 6 7 8 9 10 11 12 13 14 
Gabriele Budelacci
Italy

Posts 111
11 Jul 2009 00:25


Another solution may this:

  cmp.l #30,d0
  sgt d1
  and.l #10,d1
  sub.l d1,d0

Not limited to have 16 bit d0. 4 cycles and not needed EXTB instruction since #10 is contained in the lower 8 bits => plain 68000 code.

Obviously, these instructions are coupled too, so no advantage in superscalar execution.


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
11 Jul 2009 01:19


Thomas Richter wrote:

 
  How do you handle the cases with overlapping instructions, that is, if the effective address of the addi instruction is used by the following instruction? In that case, decoding and probably fetching the input data of the following instruction is already completed before the previous instruction (let's say addi) had the chance to write its output to the cache?
 
  Example (considerably stupid, but whatever):
 
    addq.l #1,(a0)+
    subq.l #1,-(a0)

The above is not a problem.
I'll explain you why.

1) The 68050 has to write ports.
    One for the EA-Unit, and one for the ALU.
2) Also the 68050 uses result forwarding on both of them.

The Write-port of the EA-Unit is used by the following instructions:
1) Address modes that alter the Adress pointer
  (An)+  and -(An)
2) By ADDI, SUBI, ADDQ AN, ... and friends 

These instructions use the EA-Writeport to update the Addres-register.
The result of the EA-Unit is forwarded to the next Instruction using the EA-Unit and also to following instructions using the Addressregister.

This means stuff like this works without any problem
clr (An)+
clr (An)+
clr (An)+

The updated content of An is forwarded to the next instruction in time.

The 2nd write port belongs to the ALU.
The ALU does also forward is results.
This means a instruction group like this works without any penalty:

move D1,(A0)+
add D0,D1
neg D0

All the instructions are depending, nevertheless the 68050 can execute them all in 1 cycle each.

But there is one "gotcha".
This is happens when an ALU result is used by one of the following 2 instructions in the EA-Unit. The 68050 can not forward in time in this case of course. The reason that result forwarding is impossible is that the ALU stage is simple executed later than the EA-unit. This means when the first instruction reaches the ALU the 2nd already left the EA_unit.

Typical example is this one:
move (A0,D0),D1
neg D0

D0 is used in the EA-stage - but only updated in the ALU stage.
If the 68050 runs into such code is has to issues 2 wait cycles.
The bubble can be reduced and removed by adding other instructions inbetween.

This behavior is by design of the 68K architecture.
This problem is shared with all other 68K CPUs.
E.g. The 68060 will create just the same penalty in this example.

But this case is relative rare and the bubble can easily be avoided by instrution reordering.

Did this help?

Cheers

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
11 Jul 2009 01:47


Gabriele Budelacci wrote:

  Another solution may this:
 
    cmp.l #30,d0
    sgt d1
    and.l #10,d1
    sub.l d1,d0
 

 
  This is very clever!
  Also 4 cycles is an excellent result!
 
 
 
 
I think the 68050 is the only 68K CPU which can beat this.
And it can beat this even without any cleverness on the programmer side :-)
 
If you just code it this way:
    cmp.l #30,d0
    ble  .jump
    sub.l #10,d0
  .jump
 
Then its 3 instruction, which are executed in 3 cycles on the 68050.
 
 
Normally the BRANCH after the CMP is a pain in the arse.
We all know this.
The branch needs the condition code - but the condition codes are generated at the end of the pipeline. The branch wants to alter the PC which sits at the beginning of the pipeline.
This is normally the problem which makes a CPU pay with its pipeline length for such code.

So normally this combination would need to wait until the CMP went down the complete pipeline before the jump can be executed.
 
 
So we look at the pipeline diagram, then we can see:
 
Unit diagram:
  1) Fetch/ICache
  2) Decoder
  3) Register load
  4) EA-Unit
  5) DCache-Load
  6) ALU Operation
  7) Register-Store/DCache-Store
 
That "normally" a jump after the CMP is expensive.
In our case the pipeline is even short "only" 6 clocks for this.
Other CPU would pay up to 30 clocks for a jump.
 
Luckily the 68050 does simply cheat on this one.
The decoder can rewrite the instructions to:
 
  cmp.l #30,d0
  subgt.l #10,d0

The SUB is now conditional.
The clue is that the SUB can be executed just normally.
The condition is not needed early like with the branch - its only needed in the very last stage, the writeback stage.
This means the rewritten instruction does not hurt the pipeline anymore. And the SUB can be executed just after the CMP.
 

Nice isnt't it?

Cheers
Gunnar

Mike Sieweke
USA

Posts 11
11 Jul 2009 02:56


Thomas Richter wrote:

      i = i - (((30 - i) >> 31) & 10)
     
      This, indeed, will do faster on many modern processors (and is of course a hack) due to the barrel shifter.
     

     
      This is a clever solution, but not correct in all cases.  If i == -MAXINT then the subtraction will overflow and give the wrong answer.  Also, some "modern" CPUs (like the P4) are quite slow in the barrel shifter.
     
      The MIPS32 does best with the simple "if" statement.  The Gnu compiler produces this code, which executes in 3 cycles.  The other suggestions take at least 4 instructions and 4 cycles.
   

        # if i > 30 then i = i - 10
   
        slti  t0, i, 31  # t0 = i .lt. 31
        addi  t1, i, -10  # t1 = i-10
        movz  i, t1, t0  # if t0 .eq. 0 then i=t1
   

   
  ARM also supports conditional execution, so it would use similar code.  It's best to check what your compiler does before writing clever code.

Gone Gahgah
Australia

Posts 237
11 Jul 2009 05:09


Gabriele Budelacci wrote:
Another solution may this:
  cmp.l #30,d0
  sgt d1
  and.l #10,d1
  sub.l d1,d0

Very clever.  I imagine the best way is if we could just write the branch form and the compiler reworked it to the most efficient way for the CPU.  And if it could do this for all patterns then we wouldn't have to worry about it and just write the most easily read.

I imagine that human ingenuity means that someone somewhere will be able to create code patterns that weren't predicted and which the compiler/optimiser didn't know how to rework more efficiently than a clever assembler coder could.

Gunnar von Boehn wrote:
The instruction does indeed quite a lot of work. :-)
The 68050 can do this amount of work in 1 clock because of its pipelined design.
Unit diagram:
1) Fetch/ICache
2) Decoder
3) Register load
4) EA-Unit
5) DCache-Load
6) ALU Operation
7) Register-Store/DCache-Store

Thanks for that; now I understand and appreciate pipelining.  Cool existing solution.

Gunnar von Boehn wrote:
The predication conversion is supposed to rewrite BCC branches to become much faster!

Cool.  I didn't even know to think that such a solution was about so thanks for volunteering this.  It adds to my general understanding of things.  I imagine it increases the number of required instructions fairly significantly?  But so beneficial.

All of this CPU writing and testing you are doing at the moment is being done virtually at the moment isn't it Gunnar?  Do you actually get to have any FPGAs to burn too and test your ideas?  Does that come later when it is all 'debugged'?  Does an FPGA version always behave exactly like the virtual version?  If it doesn't what sort of things cause the variances?  These would be interesting things for me just to know.  I'm unlikely to ever delve into CPU coding in a broad way but it is still fun to know these little things.

Team Chaos Leader
USA
(Moderator)
Posts 2094
11 Jul 2009 09:01



subgt.l #10,d0

Is that a real instruction on the 050 that I can write in my own asm programs?

I have been wanting instructions like that since 1992!

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
11 Jul 2009 12:19


gone gahgah wrote:

  All of this CPU writing and testing you are doing at the moment is being done virtually at the moment isn't it Gunnar?

Yes, I do all my development with chip simulator.

gone gahgah wrote:

Do you actually get to have any FPGAs to burn too and test your ideas? 

Yes, we have several FPGA dev-boards even including big Virtex and Stratix boards. But development in the simulator is faster so I prefer doing this.

This is by the way very normal.
Today all CPUs are developed in using simulator environments.

Cheers


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
11 Jul 2009 12:28


Team Chaos Leader wrote:


  subgt.l #10,d0
 

  Is that a real instruction on the 050 that I can write in my own asm programs?
 
  I have been wanting instructions like that since 1992!

Inside the 68050 :
- all instructions are 3 operant instructions
- and all instructions have a condition code.

The 68K family sometimes uses conditions codes and sometimes 3 opperants - so its easier for the hardware to have this internally always.

So far we do not expose this to the user with new opcodes.
Doing this would mean we would need to "invent" new instructions.

To be honest I'm not sure if we should ever do this.
What do you think?

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
11 Jul 2009 21:04


Here is a small (untested) piece of code that Rotate 3d points around the y-axis and perform perspective transformation.. The code is not optimized.
 
  The obvious bottleneck in this loop is the two divs's per point.
  The divses can be transformed to muls if I set the zcoord in the buffer to 1/(Z1+dist) but let us assume that the zcoords are being changed from frame to frame.
 
 
If the N050 could perform the division in paralell to the other instructions this loop could be optimized almost 100%. Another improvement would be if movem.w could move to 2 or more registers in one clock(if the buffer is in the datacache) For faster fixed point transformations, a muls.w dx,dx,#xx where xx is asymertric shift right. would be extremely powerful. I always wanted a swap.b
 
 
 
The optimal loop is different from processor to processor.
If somebody could compile the c-code I provided we can check how my code is doing compared to the compiler.
If I get the complete cycle-timing of the N050 I will try to create the fastest loop...
 
 
  Execution timing(could be small errors). Assuming the code is in the cache, and no superscalar.
 
  Mc68000
 
  move  4 cycles
  muls.w  43 cycles
  sub.l  6 cycles
  asr.l(12) 32 cycles (8+2*12)
  lsl.l(12) 32 cycles
  ext.l  4 cycles
  divs.w  158 cycles
  cmpa.l  6 cycles
  bne.b  10 cycles
 
  Mc68030
 
  move  2 cycles
  muls.w  28 cycles
  sub.l  2 cycles
  asr.l(12) 4 cycles
  lsl.l(12) 4 cycles
  ext.l  2 cycles
  divs.w  58 cycles
  cmpa.l  2 cycles
  bne.b  6 cycles
 
  Mc68060
 
  move  1 cycle
  muls.w  2 cycle
  sub.l  1 cycle
  asr.l(12) 1 cycle
  lsl.l(12) 1 cycle
  ext.l  1 cycle
  divs.w  22 cycle
  cmpa.l  1 cycles
  bne.b  1-7 cycle
 
  N050
 
  move  1 cycle
  muls.w  1 cycle
  sub.l  1 cycle
  asr.l(12) 1 cycle(?)
  lsl.l(12) 1 cycle(?)
  ext.l  1 cycle
  divs.w  x cycle(?)
  cmpa.l  1 cycle
  bne.b  1-7 cycle(?)
 
 
  D0=X
  D1=Y
  D2=Z
 
  D3=SIN(V) (FIXED POINT 4:12)
  D7=COS(V) (FIXED POINT 4:12)
 
  A0=points to buffer with 3d coords.(input)
  A1=points to buffer with 2d coords.(output)
  a6=points to end of buffer with 3d coords.
  dist=192
  moveq.l #12,d6
  .loop
  movem.w (a0)+,d0-d2 ;x
 
  ;X1=X*COS-Y*SIN
  ;Y1=X*SIN+Y*COS
 
  move.l d0,d4
  muls.w d7,d0  ;X*COS
  muls.w d3,d4  ;X*SIN
  move.l d1,d5
  muls.w d7,d1  ;Y*cos
  muls.w d3,d5  ;y*sin
  sub.l d5,d0  ;X*COS-Y*SIN
  sub.l d1,d4  ;X*SIN+Y*COS
 
  ; x=x1*dist / (z1+dist)
  ; y=y1*dist / (z1+dist)
 
  asr.l d6,d0
  asr.l d6,d4
  muls.w #dist,d0
  muls.w #dist,d4
  add.w #dist,d2
  ext.l d2
  lsl.l d6,d2
  move.l d2,d1
 
  divs.w d0,d1
  divs.w d4,d2
 
  movem.w d1-d2,(a1)+
 
  cmpa.l a0,a6
  bne.b .loop
 
  ..
  C-Code:
 
  for(int i=0;i

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
11 Jul 2009 21:06


The editor is destroying my c-code.


Gone Gahgah
Australia

Posts 237
11 Jul 2009 23:20


S P wrote:
The editor is destroying my c-code.

Such is Natami website life...
 
Gunnar von Boehn wrote:
Luckily the 68050 does simply cheat on this one.
The decoder can rewrite the instructions to:
cmp.l #30,d0
subgt.l #10,d0

Gunnar von Boehn wrote:
So far we do not expose this to the user with new opcodes.
Doing this would mean we would need to "invent" new instructions.
To be honest I'm not sure if we should ever do this.
What do you think?

Another thing I have just learnt/just dawned.  Thanks Gunnar.
Cool.  The decoder has its own sub-psuedo-instruction set that it can modify actual instructions to; eliminating branching.
 
I think that if this is just as quick as allowing the coder do it directly then there isn't any point adding the new commands.
I guess it is just a psychological thing.
I would have a easier time mentally if I were able to code it directly as it is then obvious what is happening and I am happy because I can directly see that I do not to have a branch anymore.
But what I should do is rest easy confident that the N68050 will eliminate the nasty branch for me automatically.
 
At the same time my code maintains backward compatibility so that I can run it on older 680x0s without exceptions.
I imagine also it makes things more jammed if we start adding new instructions; especially multiple flag variations on existing instructions.
 
The above are my interpretations from what you have said Gunnar.
I hope it is correct.
 
I imagine the decoder has to be bigger?  Does it have its own little clever tricks and shortcuts?
 
Gunnar von Boehn wrote:
JMP xxx.w/d16(pc)/xxxx.l = 1 clock
JSR xxx.w/d16(pc)/xxxx.l = 1 clock
All other JMP and JSR = 3 clocks

Could you possibly do a little expose on what makes those other JMP and JSR instructions 3 clocks instead of 1 Gunnar?  I'm interesed by exceptions to the rule and I have not a clue to guess it myself.

Denis Markovic
Germany
(Natami Team)
Posts 41
12 Jul 2009 07:20


Hi Gunnar,

> The 68K family sometimes uses conditions codes and sometimes 3
> opperants - so its easier for the hardware to have this internally > always.

Cool, that makes it really powerfull, just like e.g. ARM in ARM32 mode or other processors :)

> So far we do not expose this to the user with new opcodes.
> Doing this would mean we would need to "invent" new instructions.

> To be honest I'm not sure if we should ever do this.
> What do you think?

I fear, you are right. If you always recognize such code
(conditional branch over following instruction) and take
the  shortcuts, you do optimization while staying quite
compatible, so no need to have new opcodes for that and
it runns on other 68k processors.

On the other hand ... if you maybe decide to add things
like vector instructions, array min/max, norm etc. to
some maybe future (far far away) 68080 then I would also
make the conditional instructions available in the future
(then it does not matter anyway because you have a lot of
new instructions).

Btw.: I think I missed all discussion about Robin capabilities
and 3d Core capabilities. Could you give me a hint in which
forum post to find some information about instruction set,
vector processing, parallel instructions etc. in Robin and
the 3D core? Just interessted as I was programming such things
the last 10 years :-)



Marcel Verdaasdonk
Netherlands

Posts 3977
12 Jul 2009 12:30


I think we should do the same thing IBM is upto with there new compiler.

they are working on a compiler that compilers and then runs code to see how fast is goes then compiles it in a different way and test again, then compairs the run time, and takes the fastest code.

Denis Markovic
Germany
(Natami Team)
Posts 41
12 Jul 2009 15:53


Isn't gcc able to do the same? If I remember right, there
was a mode where you could compile a special executable,
let it run on typical testdata and during that the executable
writes information about loop counts etc. into a file.

Then the same code can be recompiled using the collected
information for optimization, usually gives a few % more speed.

Marcel Verdaasdonk wrote:

I think we should do the same thing IBM is upto with there new compiler.
 
  they are working on a compiler that compilers and then runs code to see how fast is goes then compiles it in a different way and test again, then compairs the run time, and takes the fastest code.



Gunnar von Boehn
Germany
(Moderator)
Posts 5775
12 Jul 2009 17:21


gone gahgah wrote:
 
 
Gunnar von Boehn wrote:
JMP xxx.w/d16(pc)/xxxx.l = 1 clock
  JSR xxx.w/d16(pc)/xxxx.l = 1 clock
  All other JMP and JSR = 3 clocks
 

  Could you possibly do a little expose on what makes those other JMP and JSR instructions 3 clocks instead of 1 Gunnar?  I'm interesed by exceptions to the rule and I have not a clue to guess it myself.
 

 
Sure, this is easy to explain.
 
Unit diagram:
    1) Fetch/ICache
    2) Decoder
    3) Register load
    4) EA-Unit
    5) DCache-Load
    6) ALU Operation
    7) Register-Store/DCache-Store
 
The Branch acceleration unit is located in stage 1).
The Fetch/Branch acceleration unit contains the program counter PC.
 

The 68k knows two main branch types.
Conditional Branches - which need the condition codes which are available at the end of the Pipeline.

And unconditional branches.
Unconditional branches are:
BRA,BSR,JSR,JMP and also RTS,RTE,...

The unconditional branches can ge grouped in four different types of how to address to jump to is calculated.
 
a) JUMP to an absolute address, included in the opcode
JMP $100000
 
b) PC-relative JUMP
BRA.w label
BSR label
JMP label(pc)

 
c) A jump to a calculated Effective Address (EA)
JMP -240(A6)
 
d) Jump to an address located in memory.
RTS

The branch acceleration unit can on its own handle the branch types A and B. (A) is easy to handle as the address is contained in the instructions.
(B) is also easy to handle as only an addition to the PC is needed.
The branch acceleration unit contains three adders-unit for this purpose.
 
The branch type C needs access to the register set and an EA calculation.
This can only be done by the EA-unit. Which comes 3 stages later in the pipe. This is way this type of branch take longer.
 

The branch to memory (RTS) is basicly a
MOVE.l (A7)+,PC
And therefore needs also be done later in the pipeline

Did this answer your question?

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
13 Jul 2009 08:24


Gabriele Budelacci wrote:
Another solution may this:
    cmp.l #30,d0
    sgt d1
    and.l #10,d1
    sub.l d1,d0
 

 
If I understand the main usage of the SCC instruction then its most useful in combination with an AND and then with any other instruction that you want to replace.
 
If we wanted, then we could create a combined SCC/AND instruction which does the same but in 1 cycle.
 
Something like :
SCC.L #immidiate,Dn

Would that make sense?

Marcel Verdaasdonk
Netherlands

Posts 3977
13 Jul 2009 09:06


oddly enough it does make sence.
However this shouldn't force the programmer to allways use it commbined.

Gabriele Budelacci
Italy

Posts 111
13 Jul 2009 11:54


@Gunnar

I disagree, it whould be used only in few cases like this.

Personally, if new instructions will be added, I strongly prefer math+cc instructions, eg:

    cmp.l #30,d0
    subgt.l #10,d0

cheers ;-)

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
13 Jul 2009 20:39


Gabriele Budelacci wrote:

  Personally, if new instructions will be added, I strongly prefer math+cc instructions, eg:
 
    cmp.l #30,d0
    subgt.l #10,d0
 
  cheers ;-)

There is one challenge with doing "CC" on any type of instruction - which I should not hide.

If someone sees a clever solution to this problem please speak up!

CC-ing an instruction can be easily done on any type of instruction with do their alteration only in the ALU-stage.
If an instruction does any type of alteration in the EA-stage then this is very difficult to "CC" is the condition codes are not available in that early stage.

In other words:
move (a0),D0  - can be easily CC'ed
as it only alter D0, and this is done in th ALU-stage.

move (a0)+,D0 -- is a problem as it alter the content of A0 in the EA-stage also.

Any idea?


Gabriele Budelacci
Italy

Posts 111
13 Jul 2009 22:04


Gunnar von Boehn wrote:

 
  In other words:
  move (a0),D0  - can be easily CC'ed
  as it only alter D0, and this is done in th ALU-stage.
 
  move (a0)+,D0 -- is a problem as it alter the content of A0 in the EA-stage also.
 
  Any idea?
 

Firsts, I don't know much on HW development :-P

Idea #1: place the CC test before in the pipeline:

Unit diagram:
    1) Fetch/ICache
    2) Decoder
    3) Register load
    3-bis) CC test, continue if true
    4) EA-Unit
    5) DCache-Load
    6) ALU Operation
    7) Register-Store/DCache-Store

Idea #2: allow CC-test only for specified EA (eg: allowed for (a0), not allowed for (a0)+ )

Idea #3: insert a rollback stage after ALU operation stage:

Unit diagram:
    1) Fetch/ICache
    2) Decoder
    3) Register load
    4) EA-Unit
    5) DCache-Load
    6) ALU Operation
    6-bis) CC rollback test
    7) Register-Store/DCache-Store
 

posts 271page  1 2 3 4 5 6 7 8 9 10 11 12 13 14