Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Welcome to the Natami lounge.
Meet new AMIGA friends here and enjoy having a friendly chit chat.

Superscalar Fusion?
Samuel D Crow
USA
(Natami Team)
Posts 1295
25 Oct 2010 15:31


It occurred to me that if code has been written to avoid a pipeline stall on a 68060, an instruction scheduler on a compiler might do something like this:


move xxx, d1
move yyy, d0
add d1, d2
sub d0, d3

Would the '050 or '070 be able to correctly fuse the first move to the add and the second move to the subtract?

Deep Sub Micron
Germany
(MX-Board Owner)
Posts 567
25 Oct 2010 20:37


At the moment I don't see how this can be done without an unreasonably high effort.

Samuel D Crow
USA
(Natami Team)
Posts 1295
25 Oct 2010 20:39


@Deep Sub Micron

I understand.  Since most Amiga code isn't optimized for the '060 it shouldn't be a problem anyway, but I thought I would ask.

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
25 Oct 2010 20:52


Samuel D Crow wrote:

It occurred to me that if code has been written to avoid a pipeline stall on a 68060, an instruction scheduler on a compiler might do something like this:
 
 

  move xxx, d1
  move yyy, d0
  add d1, d2
  sub d0, d3
 

 
  Would the '050 or '070 be able to correctly fuse the first move to the add and the second move to the subtract?

While in theory possible this would make the decoder more complex and it would require more complex cache and hint logic.

The above code could be executed on a Super spalar unit in 2 cycles giving a IPC of 2 which is good.

The real benefit of fusion is to remove the bottleneck
This code:
 


  move xxx, d1
  add d1, d2
  move yyy, d0
  sub d0, d3
 

Would need 3 cycles on the 68060 which gives a IPC of 1.3 which is not so good. Fusion would allow even a 1 ALU design like the 68050 to reach an IPC of 2 with this code.
A Super Scalar design could then in theory reach even an IPC of 4.

Matt Hey
USA

Posts 735
26 Oct 2010 01:11


@Samuel D Crow
Good Question.

Gunnar von Boehn wrote:

  The real benefit of fusion is to remove the bottleneck

This should make instruction scheduling much easier. It should be easier for the compiler, easier for the assembler programmer and the code should be prettier.


  This code:
 

  move xxx, d1
  add d1, d2
  move yyy, d0
  sub d0, d3
 

  Would need 3 cycles on the 68060 which gives a IPC of 1.3 which is not so good. Fusion would allow even a 1 ALU design like the 68050 to reach an IPC of 2 with this code.

Are you sure? I think the 68060 could handle this in 2 cycles IF the instructions are all long size and only a word in length. That is, xxx and yyy could be dn,(a0),(a0)+,or -(an). The 68060 can handle...

move.l EA,Dn
op.l Dn

or

op.l Dn
move.l Dn,EA

by forwarding the result in the same cycle. The limitation to long sized instructions and the instruction fetch bottleneck does take it's toll on the IPC.


  A Super Scalar design could then in theory reach even an IPC of 4.

That's a "CISC" 4 IPC. Very nice :). Actually, I would be very impressed if you could average 3 IPC on the N68070.


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
26 Oct 2010 07:16


Matt Hey wrote:

The 68060 can handle...
 
  move.l EA,Dn
  op.l Dn

  or
 
  op.l Dn
  move.l Dn,EA
 
  by forwarding the result in the same cycle.

Gee, you are right. The 68060 could really do this.
I'm getting old. :-/

The 68060 was really a wonderful CPU.
Its a real shame that Moto did not continue producing and enhancing it. I think even with minimal effort MOTO could have produced a real powerful 68060 ancestor.

I think if MOTO would have brought out a slightly enhacned 68060 as 68080 which the following changes:
1) Widening the ICache data access width from 4Bytes to 8Bytes. This would have removed the super scalar bottleneck of immediate instructions.
2) Adding a LINK STACK. This would have improved subroutine execution times significantly
3) Increasing clockrate and Cache sizes as new technology shrinks allow it.

Such a moderate enhanced 68060 with higher clockrate and bigger cache would have been a killer.



posts 6