Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Do you have ideas and feature wishes? Post them here and discuss your ideas.

N68k Enhancements Revisitedpage  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 
Rune Stensland
Norway
(MX-Board Owner)
Posts 871
08 Apr 2011 19:27


On the old computers it was common to use Self modified code

Can this be supported on the N050?

2 cycles:

move.l (sp)+,d0
add.l (a0),d0

1 cycle(with SMC):

move.l #xxx,d0
add.l (a0),d0

Instead of storing temp variables on the stack the programmer can put them directly into the code. This will allow 2 memory accesses per clock. But will require some kind of cache snooping..


Team Chaos Leader
USA
(Moderator)
Posts 2094
08 Apr 2011 19:49


S P wrote:

On the old computers it was common to use Self modified code
 
  Can this be supported on the N050?
 
  2 cycles:
 
  move.l (sp)+,d0
  add.l (a0),d0
 
  1 cycle(with SMC):
 
  move.l #xxx,d0
  add.l (a0),d0
 

Where is the SMC?
/me hand u a cup of coffee :)

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
08 Apr 2011 20:01


Sometimes the programmer is out of registers. Take a look at my latancy free matrix muls routine in the sysinfo topic for an example..

instead of writing this:
move.l d0,-(sp)
...
move.l (sp)+,d0
add.l (a0),d0

You can write this:

move.l d0,.smc+2(pc)
...
.smc
move.l #xxx,d0
add.l (a0),d0

The N050 could reach superscalar performance with this feature.. 2 memreads per clock.. One memread is stored in the instructioncache.
the other read is stored in the datacache.

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
08 Apr 2011 20:28


Marcel Verdaasdonk wrote:

Doesn't the N68050 already do.
 
 
  load
  ea
  opp
  store
 

  In a single pipeline every cycle?
 

 
Yes it does!

Marcel Verdaasdonk wrote:

it doesn't matter to me i was asking for the size of a unit with three pipelines and a FPU in a FPGA. 

But your question was unprecise.
I did not want to give you a wrong answer.
And to be able to answer you correctly
you really need to give some proper examples of what type of instructions (EXAMPLES!!) you want toi execute in 1 cycle with your 3 pipes.

I was asking you for examples to clarify this before already, didn't I?

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
08 Apr 2011 20:34


S P wrote:

On the old computers it was common to use Self modified code

Selfmodifying code is quite messy - we certainly all agree in this.
We all know that you could shave off a cycle or two with this in the 68000.

Selfmodifying code makes invalidating of the cache necessary - this can be really costly. I know what you mean and where you are coming from - but for the sake of code clearness I really hope that people will refrain from using such hacks ...

One Thousand
USA

Posts 832
08 Apr 2011 21:41


@SP

I also think SMC is neat.  It is that cache stuff that is messy!  :)  Really!

Here is an interesting example: Sun's MAJC processor even had specific instructions to do it.

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
09 Apr 2011 09:28


S P wrote:

  move.l #xxx,d0
  add.l (a0),d0
 
The N050 could reach superscalar performance with this feature.. 2 memreads per clock.. One memread is stored in the instructioncache.
the other read is stored in the datacache.

Yes in this special case you could win 1 cycle.

But there is no compiler which would be able to use this feature. :-/

This means for 99% of all applications this feature will remain unused.
Improving the CPU in a way which benefits all code would be the best. General improvements like being able to do 2 memory reads per cycle would benefit all code - it would even accelerate existing legacy code.


Rune Stensland
Norway
(MX-Board Owner)
Posts 871
09 Apr 2011 11:26


Ofcourse to support 2 memreads per clock would be the best. It will probobly require twice the number of LE's and alot of work.. My Cachesnooping idea is much easier to implement isn't it?
 
  In compiled code you often see function calls like this:
 
  movem.l d0-d7/a0-a6,-(sp)
  jsr method
  movem.l (sp)+,d0-d7/a0-a6
 
  I would be happy to implement support in GCC for SMC style calling convention to save datacache and time.. :D
 
 

    movem.l  a0-a6,-(sp)
    move.l d0,.i+2(pc)
    move.l d1,.i+8(pc)
    move.l d2,.i+14(pc)
    move.l d3,.i+20(pc)
    move.l d4,.i+26(pc)
    move.l d5,.i+32(pc)
    move.l d6,.i+38(pc)
    move.l d7,.i+44(pc)
    jsr  (a6)
    movem.l  (sp)+,a0-a6
  .i
    move.l #0,d0
    move.l #0,d1
    move.l #0,d2
    move.l #0,d3
    move.l #0,d4
    move.l #0,d5
    move.l #0,d6
    move.l #0,d7
 

 
  Another nice instruction would be to have a movem with modulo.
 
  movem.w #2,d0-d7/a0-a6,(a0)+
 
  here a0 will point to:
 
  dc.l d0
  dc.w 0
  dc.l d1
  dc.w 0
  dc.l d2
  ...
 

 
    movem.l #2,d0-d7/a0-a6,.i+2(pc)
    jsr  (a6)
  .i
    move.l #0,d0
    move.l #0,d1
    move.l #0,d2
    move.l #0,d3
    move.l #0,d4
    move.l #0,d5
    move.l #0,d6
    move.l #0,d7
    move.l #0,a0
    move.l #0,a1
    move.l #0,a2
    move.l #0,a3
    move.l #0,a4
    move.l #0,a5
    move.l #0,a6
 

 

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
09 Apr 2011 11:34


Gunnar von Boehn wrote:

  Yes in this special case you could win 1 cycle.
 
  But there is no compiler which would be able to use this feature. :-/
...

Then we have to make a compiler that support this feature. :D
My latency free 4x4 matrix multiplier will get 32 cycles faster on the N050 with this. Other innerloops can be improved too.

Self modified code is messy. But to squeeze out more cycles of our low clocked chip we need it..



Gunnar von Boehn
Germany
(Moderator)
Posts 5775
09 Apr 2011 11:41


S P wrote:

Ofcourse to support 2 memreads per clock would be the best.

2 memreads per clock are coming.
Jens did already redesign the cache to support this.

S P wrote:

My Cachesnooping idea is much easier to implement isn't it?

I'm not so sure about this.
Especially the case where you manipulate code very near to your current instruction could be impossible to handle correctly.

S P wrote:

 
  In compiled code you often see function calls like this:
 
  movem.l d0-d7/a0-a6,-(sp)
  jsr method
  movem.l (sp)+,d0-d7/a0-a6
 
  I would be happy to implement support in GCC for SMC style calling convention to save datacache and time.. :D
 
 

      movem.l  a0-a6,-(sp)
      move.l d0,.i+2(pc)
      move.l d1,.i+8(pc)
      move.l d2,.i+14(pc)
      move.l d3,.i+20(pc)
      move.l d4,.i+26(pc)
      move.l d5,.i+32(pc)
      move.l d6,.i+38(pc)
      move.l d7,.i+44(pc)
      jsr  (a6)
      movem.l  (sp)+,a0-a6
  .i
      move.l #0,d0
      move.l #0,d1
      move.l #0,d2
      move.l #0,d3
      move.l #0,d4
      move.l #0,d5
      move.l #0,d6
      move.l #0,d7
 
 

Now you confused me.
The lower code option is much longer and slower.
Where is the benefit of it in your opinion?


Marcel Verdaasdonk
Netherlands

Posts 3976
09 Apr 2011 11:48


Gunnar for some old code caches need to be disabled anyhow. :/
Does anyone know if the OS makes use of self modified code?

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
09 Apr 2011 11:51


Gunnar von Boehn wrote:

  2 memreads per clock are coming.
  Jens did already redesign the cache to support this.
 

 
  This is awsome!
 
 
Gunnar von Boehn wrote:

  I'm not so sure about this.
  Especially the case where you manipulate code very near to your current instruction could be impossible to handle correctly.
 

 
So you need some rules.. Use a latency term.
F.ex SMC is not allowed in the next 16 cycles.
 
 
Gunnar von Boehn wrote:

  Now you confused me.
  The lower code option is much longer and slower.
  Where is the benefit of it in your opinion?
 

 
  A movem is still one cycle per longword moved to memory. ?
  But if Jens implement 2 reads per clock can the movem be 2 times faster?
  movem.l  (sp)+,a0-a6
 
  This example will save datacachespace but increase the instruction cache.
 
  With the limit of one memread per clock. this code will run at 4 cycles with 2 pipes:
 
  move.l #0,d0
  move.l #0,d1
  move.l #0,d2
  move.l #0,d3
  move.l #0,d4
  move.l #0,d5
  move.l #0,d6
  move.l #0,d7
   
  and this will run at 8 cycles:
 
  move.l (sp)+,d0
  move.l (sp)+,d1
  move.l (sp)+,d2
  move.l (sp)+,d3
  move.l (sp)+,d4
  move.l (sp)+,d5
  move.l (sp)+,d6
  move.l (sp)+,d7
   
 

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
09 Apr 2011 12:06


S P wrote:

With the limit of one memread per clock. this code will run at 4 cycles with 2 pipes:

Ok, now I see what you mean.
Currently we have 1 Read and 1 ALU.

The 2nd ALU will come together with the 2nd readport.
Maybe we even enable the 2nd readport at first because it will allow to execute some of the complex memory 68k instructions in 1 cycle. Like e.g CMP2 or CMPM.


Rune Stensland
Norway
(MX-Board Owner)
Posts 871
09 Apr 2011 15:46


Take a look at the code in my latency free 4x4 MatrixMuls.Every 3d game needs a matrix muls...

In the sysinfo topic I managed to Optimize it down to 156 clocks. With SMC I can get it down to 148.

If the N070 have two readports per cycle I can optimize it with SMC like this:


      fmove.s (sp)+,fp0 ;fused
      fadd.s  #xxx,fp0  ;1 P1
      fmove.s (sp)+,fp1 ;fused
      fadd.s  #xxx,fp1  ;1 P2
      fmove.s (sp)+,fp2 ;fused
      fadd.s  #xxx,fp2  ;2 P1
      fmove.s (sp)+,fp3 ;fused
      fadd.s  #xxx,fp3  ;2 P2
      fmove.s (sp)+,fp4 ;fused
      fadd.s  #xxx,fp4  ;3 P1
      fmove.s (sp)+,fp5 ;fused
      fadd.s  #xxx,fp5  ;3 P2
      fmove.s (sp)+,fp6 ;fused
      fadd.s  #xxx,fp6  ;4 P2
      fmove.s (sp)+,fp7 ;fused
      fadd.s  #xxx,fp7  ;4 P2

here is the old version:

      fmove.s (sp)+,fp0  ;fused
      fadd.s  (sp)+,fp0  ;1
      fmove.s (sp)+,fp1  ;fused
      fadd.s  (sp)+,fp1  ;2
      fmove.s (sp)+,fp2  ;fused
      fadd.s  (sp)+,fp2  ;3
      fmove.s (sp)+,fp3  ;fused
      fadd.s  (sp)+,fp3  ;4
      fmove.s (sp)+,fp4  ;fused
      fadd.s  (sp)+,fp4  ;5
      fmove.s (sp)+,fp5  ;fused
      fadd.s  (sp)+,fp5  ;6
      fmove.s (sp)+,fp6  ;fused
      fadd.s  (sp)+,fp6  ;7
      fmove.s (sp)+,fp7  ;fused
      fadd.s  (sp)+,fp7  ;8




Gunnar von Boehn
Germany
(Moderator)
Posts 5775
10 Apr 2011 06:42



    len
  1  4    fmove.s (sp)+,fp0 ;fused
  2  8    fadd.s  #xxx,fp0  ;1 P1
  3  4    fmove.s (sp)+,fp1 ;fused
  4  8    fadd.s  #xxx,fp1  ;1 P2
 

 
  Sorry but fusing 4 FPU instructions does not work,
  because of two reasons:
  a) The fused 4 instructions are 24 bytes long.
  24 bytes is more than the 16bytes our CPU can read and excute per clock.
  b) We have 1 FPU Core.
  Therefore we can not execute this many instructions per cycle.
  2 FPU need a real lot of chip space. Which we can not spend on such  a rarely used unit.
 

Deep Sub Micron
Germany
(MX-Board Owner)
Posts 567
10 Apr 2011 09:52


S P wrote:

    I would be happy to implement support in GCC for SMC style calling convention to save datacache and time.. :D
 

 
  Hi, I just like to mention that SMC is not reentrant. So GCC must be somehow aware of whether this is OK or not.
 
  Implementing SMC support without adding penalty cycles is not trivial, well it sounds actually nearly impossible. For example there are two cases. When data is still in the pipeline then it might be forwarded. But if data was written to data cache, then it takes some cycles until instruction fetch can access that data. There might be a gap in between both cases. If data is still in pipeline there must be a check whether the whole or just a part of the immediate value was affected (and aligned?) and a check if the opcode was affected. I guess there are dozens of other things that need to be considered, too.


Rune Stensland
Norway
(MX-Board Owner)
Posts 871
10 Apr 2011 23:45


I have added two more tests in the benchmark program. Here is a recursive qsort implementation. I compiled it with VBCC and then I wrote my own optimized version.
 
  In Winuae my handwritten ASM version is 2.3 times faster. :D
 
 

  void quickSort(int arr[], int left, int right)
  {
    int ii,j,tmp,pivot;
        ii = left;
    j = right;
    pivot = arr[(left + right)>> 1];
        /* partition */
        while (i <= j) {
              while (arr[ i ] < pivot)
                    i++;
              while (arr[ j ] > pivot)
                    j--;
              if (i <= j) {
                    tmp = arr[ i ];
                    arr[ i ] = arr[j];
                    arr[ j ] = tmp;
                    i++;
                    j--;
              }
        };
        /* recursion */
        if (left < j)
              quickSort(arr, left, j);
        if (i < right)
              quickSort(arr, i, right);
  }
 

 

TEST_QSORT:
 
  lea data,a0
  move.l a0,a3
 
  move.l #quicksortnumbers,-(sp)
  move.l #0,-(sp)
  bsr.b SP_QSORT
  addq.l #8,sp
  rts
 
SP_QSORT:
  move.l 4(sp),d0 ;1 P1
  move.l 8(sp),d1 ;1 P2
  move.l d0,d2    ;2 P1 fused
  add.l  d1,d2    ;2 P1
  move.l d1,d4    ;2 P2
  move.l d0,d3    ;3 P1
  asr.l  #1,d2  ;3 P2
 
  lea.l  (a0,d0.l*4),a1  ;4 P1
  lea.l  (a3,d1.l*4),a2  ;4 P2 a0=a3
  move.l (a0,d2.l*4),d2  ;5 P1
  .outer
  .o
  cmp.l  a1,a2    ;5 P2
  beq.b .next    ;free
  cmp.l (a1)+,d2  ;6 P2  ;arr<pivot
  ble.b .o  ;?
  subq.l #4,a1    ;7 P1
  .next
      addq.l #4,a2  ;7 P2
  .o2
  cmp.l a1,a2      ;8 P1
  beq.b .s        ;free
  cmp.l -(a2),d2  ;9 p
  bgt.b .o2        ;? ;arr[ j ]>pivot
  .s
  cmp.l a1,a2      ;10 P1 ;  if (i <= j)
  blt.b .o3        ;free ;if a1>a2 then skip
 
  sub.l  #4,a2        ;11 P1
  move.l (a1),d5      ;11 P2
  move.l -4(a2),(a1)+ ;12 P1
  move.l d5,(a2)      ;13 P1
 
  cmp.l a1,a2        ;14 P1
  bge.b .outer        ;free
  .o3
  sub.l a0,a1        ;14 P2
  sub.l a3,a2        ;15 P1
 
  move.l a2,d7        ;15 P2 fused
  asr.l  #2,d7        ;15 P2
  move.l a1,d6        ;16 P1 fused
  asr.l  #2,d6        ;16 P1
 
  .k cmp.l d3,d7      ;16 P2 ;left=d3, j=d7 if (d3 >= d7) then skip
  ble.b  .skip  ;?
  movem.l d4/d6,-(sp)  ;18
  move.l  d7,-(sp)    ;19
  move.l  d3,-(sp)    ;20
  bsr.b  SP_QSORT    ;21
  addq.l  #8,sp        ;22
  movem.l (sp)+,d4/d6  ;24
  .skip
  cmp.l  d6,d4      ;25 ;i=d6, right=d4 if (d6>=d4 ) then skip
  ble.b  .skip2  ;?
  .kk
  move.l  d4,-(sp) ;26
  move.l  d6,-(sp) ;27
  bsr.b  SP_QSORT ;28
  addq.l  #8,sp  ;29
  .skip2
  rts    ;30
 
 
  Here is the VBCC compiled version with cpu=68060 and speedup=true
 
  TEST_QSORT_VBCC:
 
  lea data,a0
 
  subq.w #4,a7
  move.l #quicksortnumbers,-(a7)
  move.l #0,-(a7)
  move.l a0,-(a7)
 
  jsr _quickSort
  add.w #12,a7
  addq.w #4,a7
 
  rts
  _quickSort
  sub.w #16,a7
  ; movem.l .l21,-(a7)
  ; movem.l d0-d6/a0-a6,-(a7)
  move.l (24+.l23,a7),(0+.l23,a7)
  move.l (28+.l23,a7),(4+.l23,a7)
  move.l (24+.l23,a7),d0
  add.l (28+.l23,a7),d0
  asr.l #1,d0
  move.l (20+.l23,a7),a0
  move.l (0,a0,d0.l*4),(12+.l23,a7)
  bra .l7
  .l6
  bra .l10
  .l9
  addq.l #1,(0+.l23,a7)
  .l10
  move.l (0+.l23,a7),d0
  move.l (20+.l23,a7),a0
  move.l (0,a0,d0.l*4),d1
  cmp.l (12+.l23,a7),d1
  blt .l9
  .l11
  bra .l13
  .l12
  subq.l #1,(4+.l23,a7)
  .l13
  move.l (4+.l23,a7),d0
  move.l (20+.l23,a7),a0
  move.l (0,a0,d0.l*4),d1
  cmp.l (12+.l23,a7),d1
  bgt .l12
  .l14
  move.l (0+.l23,a7),d0
  cmp.l (4+.l23,a7),d0
  bgt .l16
  .l15
  move.l (0+.l23,a7),d0
  move.l (20+.l23,a7),a0
  move.l (0,a0,d0.l*4),(8+.l23,a7)
  move.l (4+.l23,a7),d0
  lsl.l #2,d0
  move.l (20+.l23,a7),a0
  add.l d0,a0
  move.l (0+.l23,a7),d0
  move.l (20+.l23,a7),a1
  move.l (a0),(0,a1,d0.l*4)
  move.l (4+.l23,a7),d0
  move.l (20+.l23,a7),a0
  move.l (8+.l23,a7),(0,a0,d0.l*4)
  addq.l #1,(0+.l23,a7)
  subq.l #1,(4+.l23,a7)
  .l16
  .l7
  move.l (0+.l23,a7),d0
  cmp.l (4+.l23,a7),d0
  ble .l6
  .l8
  move.l (24+.l23,a7),d0
  cmp.l (4+.l23,a7),d0
  bge .l18
  .l17
  move.l (4+.l23,a7),-(a7)
  move.l (28+.l23,a7),-(a7)
  move.l (28+.l23,a7),-(a7)
  jsr _quickSort
  add.w #12,a7
  .l18
  move.l (0+.l23,a7),d0
  cmp.l (28+.l23,a7),d0
  bge .l20
  .l19
  move.l (28+.l23,a7),-(a7)
  move.l (4+.l23,a7),-(a7)
  move.l (28+.l23,a7),-(a7)
  jsr _quickSort
  add.w #12,a7
  .l20
  .l5
  .l21 ;movem.l (a7)+,d0-d6/a0-a6
  ;reg
  add.w #16,a7
  .l23 equ 0
  rts
 

 
 

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
10 Apr 2011 23:59


deep sub micron wrote:
 
Implementing SMC support without adding penalty cycles is not trivial, well it sounds actually nearly impossible. For example there are two cases. When data is still in the pipeline then it might be forwarded. But if data was written to data cache, then it takes some cycles until instruction fetch can access that data. There might be a gap in between both cases. If data is still in pipeline there must be a check whether the whole or just a part of the immediate value was affected (and aligned?) and a check  if the opcode was affected. I guess there are dozens of other things that need to be considered, toot

A penalty is ok. In my code the I want to create instructions fe.x 100 cycles ahead. If the cache controller use 50 cycles to support this its ok. As long as it's done in paralell. If it's difficult to make, I am sure you can find a solution. :D

ex:

move.l #333,.o+2(pc)
.o add.l #222,d0      ;50 cycle stall.

move.l #333,.d9+2(pc)
(50 cycles of code)
.d9 add.l #222,d0      ;no stall.


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
11 Apr 2011 06:32


Hi SP,
 
I think it would most likely look like this:
 
S P wrote:

  move.l #333,.o+2(pc)
  .o add.l #222,d0      ;50 cycle stall.

 
Silently creating a wrong result.
 
 
S P wrote:
 
  move.l #333,.d9+2(pc)
  (50 cycles of code)
  .d9 add.l #222,d0      ;no stall.
 

 
Works fine.
 
 
The reason is simple:
  A CPU pipeline consists of several stages:
  1) Fetch Instruction
  2) Decode Instruction
  3) Fetch Registers
  4) Calculate EA
  5) Data-Operants Fetch
  6) Alu Operations
  7) Store Result
 
Your Instructions get fetch in stage 1.
Your Cache will be updates in stage 7.
 
This means by the time the instruction does alter the cached content 6 new instruction where already fetched and partly executed in the pipeline stages! Even if the ICache is snooping the CPU will from itself never notice that these already processed instructions are wrong.
 
 
But this will normally not be a problem.
 
As you certainly know even the 68000 could fail on selfmodifying code. The 68000 did have no cache but it did prefetch instructions by 6 bytes. If you did selfmodification in such a window then it could fail unnoticed on the 68000 too!

On the 68K cores with cache 020 and up the problem was bigger.
Because a cache could make selfmodification fail even thousand instructions later.
 
If you disabled the cache the CPU did always try to prefetch 16bytes. Which means there was always a NO-GO area of 16bytes for selfmodified code plus the pipelinelength - Even if you did disable or flush the caches.
 
This means your code would have failed of the old 68K pipelied CPUs even with disable caches!

And if you would have written in bytes shorter selfmodified code then it could have failed on the 68000 too.

I assume you agree with this.

I think surviving selfmodifying code which alters code not already in the pipeline could be a nice feature. As it makes the CPU able to execute code with enabled caches that other 68K cores with disabled cache could execute. This would be an advantage in running incompatible written old code.

Making the CPU survive code that is already processed in the pipeline would mean the CPU needed to add an extra trace unit alos for the CPU core which IMHO is an significant CPU cost increase which we should rather not do.

But isnt't the main really: "Is Selfmodifying code needed today?"

Maybe we should answer this first.

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
11 Apr 2011 08:43


Gunnar von Boehn wrote:

  "Is Selfmodifying code needed today?"
 

 
  The most common instruction to change with SMC is this:
 
  move.b 0000(a0),d0
 
  Instead of writing this:
 
  move.w (a1)+,d2
  move.b  (a0,d2.w),d0
 
  This is one cycle faster. It frees 2 registers and Datacache access, and removes the ALU Stall.
  Here is a demo that use SMC: (Optimized for the Mc68030)
  EXTERNAL LINK 
  An other example is to do this in 1 cycle.. 4 memreads with fusion. The Current N050 will use 4 cycles on this. With SMC 2 variables can be stored in the instruction cache while 2 in the datacache.
 
 

  move.l (sp)+,d0 ;1
  add.l (sp)+,d0 ;2
  move.l (sp)+,d1        ;3
  add.l (sp)+,d1        ;4
 
  With SMC (2 cycles on N050 and one cycle on N070)
 
  move.l #xxx,d0  ;fused
  add.l (sp)+,d0 ;1 P1
  move.l #xxx,d1  ;fused
  add.l (sp)+,d1 ;1 P2
 

 
  I can create more examples..
 

posts 435page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22