Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Do you have questions about the Natami?
Post it here and we will answer it!

Sysinfo FPU Questionpage  1 2 3 4 5 
Megol .

Posts 676
30 Mar 2011 11:25


Shouldn't it be possible to treat FNOPs as real null operations? I understand that is defined as a serializing instruction but as long as the FPU execution have no side effects on the integer side (no exceptions and no ld-st collisions, more?) they could be skipped/fused right?

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
30 Mar 2011 13:00


Megol . wrote:

Shouldn't it be possible to treat FNOPs as real null operations? I understand that is defined as a serializing instruction but as long as the FPU execution have no side effects on the integer side (no exceptions and no ld-st collisions, more?) they could be skipped/fused right?

FPU and Integer instruction can run in parallel on the 68K.
Purpose of the FNOP is to make a barrier in your code for cases where you want to be sure that a possible late exception is fully handled before the next instruction is started. For example anoverflow triggered in the last cycle of your FPU instruction.

Megol .

Posts 676
30 Mar 2011 14:00


Gunnar von Boehn wrote:

Megol . wrote:

  Shouldn't it be possible to treat FNOPs as real null operations? I understand that is defined as a serializing instruction but as long as the FPU execution have no side effects on the integer side (no exceptions and no ld-st collisions, more?) they could be skipped/fused right?
 

 
  FPU and Integer instruction can run in parallel on the 68K.
  Purpose of the FNOP is to make a barrier in your code for cases where you want to be sure that a possible late exception is fully handled before the next instruction is started. For example anoverflow triggered in the last cycle of your FPU instruction.

Yes but it should be possible to emulate that behavior without needlessly stall the processor. A checkpoint scheme of some sort should do and could improve performance in other cases too. But I guess that would be something for a 68080+. :)


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
30 Mar 2011 23:08


Megol . wrote:

Gunnar von Boehn wrote:

 
Megol . wrote:

  Shouldn't it be possible to treat FNOPs as real null operations? I understand that is defined as a serializing instruction but as long as the FPU execution have no side effects on the integer side (no exceptions and no ld-st collisions, more?) they could be skipped/fused right?
 

 
  FPU and Integer instruction can run in parallel on the 68K.
  Purpose of the FNOP is to make a barrier in your code for cases where you want to be sure that a possible late exception is fully handled before the next instruction is started. For example anoverflow triggered in the last cycle of your FPU instruction.
 

  Yes but it should be possible to emulate that behavior without needlessly stall the processor. A checkpoint scheme of some sort should do and could improve performance in other cases too. But I guess that would be something for a 68080+. :)
 

You will in real live only use FNOP where you want to stall the CPU.
Lets say you launch an FPU instruction and as next instruction you start the Blitter. But you want to make sure that the Blitter gets NOT started if the FPU instructions fails with an exception ...
As the FPU runs in parallel to the Integer Unit you need to SYNC bnoth units - which means you need to force the Integer Unit to wait for the FPU to fully finish.

For this rare usecase the FNOP instruction is there.



Gunnar von Boehn
Germany
(Moderator)
Posts 5775
30 Mar 2011 23:13


S P wrote:

Here is a small test loop for you
    Renders a mandelbrot.
   
    Here is an exe file wich runs this code with c2p. in 256 colors.
  (256x256) 1x1
 
    (Realtime mandelbrot zoomer on amiga (060))
   
    EXTERNAL LINK   
     
     

    ;d6 Number of ylines
    ;d7 ystartpos
    MANDELBROT:
      move.l  txtureptr+8,a0
      move.l d7,d0
      muls.w bredde+2,d0
      add.l d0,a0
     
      fmove.x .xmin(pc),fp0
      fmove.x .xmax(pc),fp1
   
      fsub.x fp0,fp1
      fmove.l width,fp7
      fdiv.x fp7,fp1
   
      fmove.x .ymin(pc),fp2
      fmove.x .ymax(pc),fp3
   
      fsub.x fp2,fp3
      fmove.l height,fp7
      fdiv.x fp7,fp3
   
      fmove.x fp2,fp7 ;cy
      fadd.x .ypos(pc),fp7
   
      fmove.l d7,fp4
      fmul.x fp3,fp4
      fadd.x fp4,fp7  ;start at correct raster
   
      fmove.x fp3,fp2
   
      fmove.x #256,fp0
    .yloop
      move.l width,d7
      subq.l #1,d7
      fmove.x .xmin(pc),fp6 ;cx
      fadd.x .xpos(pc),fp6
    .xloop
      moveq.l #0,d5
   
      fmove.x fp6,fp3  ;x
      fmove.x fp7,fp4  ;y
    .cloop
      fmove.x fp4,fp5
      fmul.x fp3,fp4
      fadd.x fp4,fp4
      fadd.x fp7,fp4  ;y1 = 2*x*y + cy
     
      fmul.x fp5,fp5
      fmul.x fp3,fp3
   
      fsub.x fp5,fp3
      fadd.x fp6,fp3  ;x1 = x*x - y*y + cx
   
      fadd.x fp3,fp5
   
      fcmp.x fp0,fp5
      fbnlt .ut
   
    ;x1 = x*x - y*y + cx;
    ;y1 = 2*x*y + cy;
    ;x = x1;
    ;y = y1;
    ; count++;
   
      addq.b #1,d5
      cmp.w #255,d5
      blt.b .cloop
    .ut
      move.b d5,(a0)+
   
      fadd.x fp1,fp6
   
      dbf d7,.xloop
      fadd.x fp2,fp7
      dbf d6,.yloop
   
      rts
   
    .xpos: ;dc.x -0.74449379998
      dc.x -1.253100001
   
    .ypos: ;dc.x -0.0969500010092 
      dc.x -0.344125
     
    .xmin: dc.x -0.5
      ;dc.x -0.000000001
      ;dc.x -0.000000000001
    .xmax: dc.x 0.5
      ;dc.x 0.000000001
      ;dc.x 0.000000000001
    .ymin: dc.x -0.5
      ;dc.x -0.000000001
      ;dc.x -0.000000000001
    .ymax: dc.x 0.5
      ;dc.x 0.000000001
      ;dc.x 0.000000000001
   
    ; for(i = 0;i < 720;i++)
    ;    for(j = 0;j < 540;j++)
    ;      {
    ;          /* generate complex numbers in the window [-2,1] x [-1.2, 1+] to
    ;          locate picture on screen for optimal viewing  */
    ;    cx = ((thiss->xmax - thiss->xmin)/720.0) * i - fabs(thiss->xmin);
    ;    cy = fabs(thiss->ymax) - ((thiss->ymax - thiss->ymin)/540.0) * j;     
    ; 
    ;  x = cx;
    ;  y = cy;   
    ;       
    ;    // find orbit to a max of 255 iterations or when orbit tends to get huge
    ;  count = 0;
    ;  while(count <= 255 && x*x + y*y < 100000000)
    ;      {
    ;        x1 = x*x - y*y + cx;
    ;        y1 = 2*x*y + cy;
    ;        x = x1;
    ;        y = y1;
    ;        count++;
    ;      }   
   
   

     

Hi SP,

Cool Routine!
Thanks!

Would you like to tune it for perfomrance ofr the 68050 FPU?

The 68050 looks from the timing somewhat similar to an 68040.

Timing rules:
You can issue one instruction per cycle.
2 instructions if you have a MOVE+OPP which can be fused.
FPU instructions can also be issued 1 per cycle.
FPU instructions have a latency to finish.
This means you need to avoid depending instruction streams.

Latency:
FADD =8
FMUL =8

For best performance we need to change the instruction order to remove the dependencies. If you manage to issue instructions without dependancies then you can execute 1 FPU instruction per cycle.

BTW The 68040 has similar latency rules. Such code reorder will also improve the performance on 68040 significantly.

Would you like to re-order the code for us for best performance?

Cheers

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
01 Apr 2011 22:49


Ok here is a start of a 050 implementation of the mandelbrot.
This routine calculate 4 pixels in paralell.

  I got it down to 40 (muls,add,sub) operations in 52 clocks. (including Latency and with moves)
  It's probobly bugs in the code. (Late night Friday :D )
 
  If you optimize the latency to 7 cycles it will probobly run without latancy.. or you can add some more
  registers...
 
  6*4(24) muls
  4*4(16) addsub
 
 


        ;  while(count <= 255 && x*x + y*y < 100000000)
        ;      {
        ;        x1 = x*x - y*y + cx;
        ;        y1 = 2*x*y + cy;
        ;        x = x1;
        ;        y = y1;
        ;        count++;
        ;      }
 
  ; (x1) x*x,  y*y;
 
    fmove.x .fp0(pc),fp0 ;fused
    fmul.x fp0,fp0  ;1
    fmove.x .fp1(pc),fp1 ;fused
    fmul.x fp1,fp1  ;2
    fmove.x .fp2(pc),fp2 ;fused
    fmul.x fp2,fp2  ;3
    fmove.x .fp3(pc),fp3 ;fused
    fmul.x fp3,fp3  ;4
    fmove.x .fp4(pc),fp4 ;fused
    fmul.x fp4,fp4  ;4
    fmove.x .fp5(pc),fp5 ;fused
    fmul.x fp5,fp5  ;5
    fmove.x .fp5(pc),fp5 ;fused
    fmul.x fp5,fp5  ;6
    fmove.x .fp6(pc),fp6 ;fused
    fmul.x fp7,fp7  ;7
    fmove.x .fp7(pc),fp7 ;fused
    fmul.x fp7,fp7  ;8
  ;(x1) (x*x - y*y)
    fsub.x fp1,fp0  ;9
    fsub.x fp3,fp2  ;10
    fsub.x fp5,fp4  ;11
    fsub.x fp7,fp5  ;12
 
  ; (y1) y1=x*y;
    fmove.x .fp0(pc),fp1 ;fused
    fmul.x .fp1(pc),fp1 ;13
    fmove.x .fp2(pc),fp3 ;fused
    fmul.x .fp3(pc),fp3 ;14
    fmove.x .fp4(pc),fp5 ;fused
    fmul.x .fp5(pc),fp5 ;15
    fmove.x .fp6(pc),fp7 ;fused
    fmul.x .fp7(pc),fp7 ;16
 
  ;(x1)x1=x1+cx
    fadd.x .cx(pc),fp0 ;18 (1 cycle Latency)
    fadd.x .cx(pc),fp2 ;19
    fadd.x .cx(pc),fp4 ;20
    fadd.x .cx(pc),fp6 ;21
 
  ; y1=2*x*y
    fadd.x fp1,fp1  ;22
    fadd.x fp3,fp3  ;23
    fadd.x fp4,fp4  ;24
    fadd.x fp5,fp5  ;25
 
  ; x = x1
 
    fmove.x fp0,.fp0(pc) ;27(1 cycle latency)
    fmove.x fp2,.fp2(pc) ;28
    fmove.x fp4,.fp4(pc) ;29
    fmove.x fp6,.fp6(pc) ;30
 
  ;(y1)y1=y1+cx
    fadd.x .cy(pc),fp1 ;31
    fadd.x .cy(pc),fp3 ;32
    fadd.x .cy(pc),fp5 ;33
    fadd.x .cy(pc),fp7 ;34
 
  ;(test) x*x
    fmul.x fp0,fp0  ;36(1 cycle latency)
    fmul.x fp2,fp2  ;37
    fmul.x fp4,fp4  ;38
    fmul.x fp6,fp6  ;39
 
  ;y=y1
    fmove.x fp1,.fp1(pc) ;40
    fmove.x fp3,.fp3(pc) ;41
    fmove.x fp5,.fp5(pc) ;42
    fmove.x fp7,.fp7(pc) ;43
 
  ;(test) y*y
    fmul.x fp1,fp1  ;44
    fmul.x fp3,fp3  ;45
    fmul.x fp5,fp5  ;46
    fmul.x fp7,fp7  ;47
 
  ;(test) y*y+x*x
 
    fadd.x fp1,fp0  ;49(4cycle latency)
    fadd.x fp3,fp2  ;50
    fadd.x fp5,fp4  ;51
    fadd.x fp7,fp6  ;52
 
    (cmp logic here)
    loop
 
    ,,,
 
  .fp0 dc.x 0
  .fp1 dc.x 0
  .fp2 dc.x 0
  .fp3 dc.x 0
  .fp4 dc.x 0
  .fp5 dc.x 0
  .fp6 dc.x 0
  .fp7 dc.x 0
 
  .cx dc.x 0
  .cy dc.x 0
   
 

  %)
 

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
02 Apr 2011 07:47


Wow cool!!

This looks like a real nice testcase.
I'm looking forward to see the fully finished version.

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
02 Apr 2011 07:58


It would be nice to creat maybe 3 or 4 real live testcases which test workloads which have a real relavance to real work.

The Mandelbrot would be a good one.
Another good one for FPU would be a matrix mul as this is the core of 3D games.

For the Integer CPU, I think four cases are worth to measure:
- Something like a BubbleSort.
  * Bubble Sort works a little bit on memory this is good as this doing this is important in real live.
  * Bubble Sort has both loop branches which are easy to predict
    and it has contidional code / branches which - which are very difficult to predict.
    This is a very good testcase as it measuer branch prediction performance very well.
    Branch prediction performance is one of the most important thinks for a real CPU.

- Another very good real live testcasse would do a lot of unpredictable subroutine calls. In real live software subroutine calls or OS calls are very common - measuring how well they are donee in real live would be VERY important. This testcase could be mixed with doing some arithmetic operations.

What do you think about this small bubble sort:
(Written out of my mind and untested)


  lea data,A0    ; pointer to data block
  move #n-2,D0    ; number of elements to sort
loopout
  move.l A0,A1 
  move.l D0,D1
loopin
  move.l (4,A1),D2
  move.l (A1)+,D3
  cmp.l  D2,D3
  ble    noswap
  movem.l D2/D3,(-4,A1)
noswap
  dbf    D1,loopin
  dbf    D0,loopout

What do you think of this routine?
Could you tes/fix it that it works?
Could you then make a real testcase out of it including 1024 unsorted data elements and the outer code to measuer the time?

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
02 Apr 2011 08:43


Maybe a second good tescase could do something like this:
 

    moveq  #1024,D7
    moveq  #0,D0
    moveq  #0,D1
    moveq  #0,D2
    lea    routine1,A0
  loop
    move.l D7,D6
    and.l  #$30,D6
    beq    test0
  return:
    jsr    (A0,D6)
    dbf    loop
 
  test0
    addq.l #1,d0
    bra return
 
  routine1:          -- immidiate arithmetic
    moveq  #$1,D3    -- 2
    eor.l  D3,D1      -- 2
    and.l  #$1234,D2  -- 6
    addi.w #4567,D1  -- 4
    rts              -- 2
 
 
  routine2:              -- memory arithmetic
    move.l  data1(pc),D3  -- 4
    add.l  D1,D3        -- 2
    or.l    data2(pc),D2  -- 4
    mulu.l  D3,D1        -- 2
    move.l  D1,data1      -- 6
    rts
 
  routine3:
    some shift instructions and a div.

 
  This is just an idea ... and needs some more refinement.
  The test case is very similar to do test that SYSINFO actually does.
  For good application performance is important to be able to do arithmetic on memory and its VERY important to do functions calls fast.
 
  With some tweaking this testcase could measure this well.
 
 
  What I like about SYSINFO is that it provides 1 simple number.
  The problem with sysinfo is that this number does not say much as the test is to narrow.
 
  Of our CPU core the 68050/070 we have so far produced a duzends of internal milestone versions.
  It would be nice to have a small testsuite which covers test which are really important like the JSR-test and the bubble sort.
  And to rate the relaeses with a performance number.
 
  I think this will also help users to see the difference in the 050 cores.
 
  If we produce a new core verison 68050-A then B then C then ... L.. O - this will no say much about the value of the core.
 
 
  But if you can number them with a virtual clockrate that for example an 68030 would have based on the testcases then this would look moire like this:

        INT  FPU
  68050- 500    0
  68050- 570    0
  68050- 600    0
  68050- 620 2400
  68050- 850 3600
  68070-1250 5000
  68070-1400 7500
 
This could make the progress much clearer and simpler to see.
What do you think?

Marcel Verdaasdonk
Netherlands

Posts 3976
02 Apr 2011 11:50


Gunnar please refrain from using virtual clock speeds as a performance measurement it can confuse people.
And would draw a inconsistent line in performance tests.

please use MIPS, this was at the time well established as synthetic measurement.
And even though it's known to be a flawed method it is easier to compare CPU's based on MIPS then it is on raw clock speed since some instructions are multi-cycle.

AFAIK Moto supplied MIPS figures for the whole 68K line so let's stay consistent in measuring methods

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
02 Apr 2011 11:56


Marcel Verdaasdonk wrote:

  Gunnar please refrain from using virtual clock speeds as a performance measurement it can confuse people.
  And would draw a inconsistent line in performance tests.
 
  please use MIPS, this was at the time well established as synthetic measurement.
  And even though it's known to be a flawed method it is easier to compare CPU's based on MIPS then it is on raw clock speed since some instructions are multi-cycle.
 
  AFAIK Moto supplied MIPS figures for the whole 68K line so let's stay consistent in measuring methods
 

The idia behind MIPS is the same as our idea.
But using Mips has some flaws.

1sdt problem: MIPS is meaningless.
The testcase is too limited to cover real world needs.

Also the individual MIPS implementations has to much effect onm the result. Thsi your compiler can make the MIPS code execute 2 or 3 times faster than another compiler.
This makes using MIPS to compare different systems pointless.
The drystone MIPS code in Syssinfo for example is very slow and scores very low results. Other Dryhstone implementations on other systems score much higher numbers.

We should not try to compare ourselves with an artificial number with a forein system.

What is really important for a CPU is how good you can execute contorl flow, this means conditional arithmetic and memory operations mixed with subroutines calls.
 
This is the key which is important.
MIPS is really not important for realk world or for running AMIGA apps.
 
Showing MIPS numbers is like advertising cars with their weight.
It does not tell you what you want to know. :-D

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
02 Apr 2011 20:39


Here is a new version of the optimized mandelbrot(zoom). It's not working yet,but should have all the logic needed.
 
  I will create an exefile when I have debugged it.
 
  Gunnar,
 
  Can you verfiy my latency and cycle calculations?
 
 
 

 

  ;Written by SP (2-april-2011). Optimized for the N050 CPU 
  ;Renders 4 pixels in paralell. to remove 8 cycle latency.
  ;The pixels are not plottet linear.(When a pixel is finished
  ;the pipe will move to the next pixel in the que.) The calculationtime for each
  ;pixel is not constant.
  ;The method use (d0-d7,a0-a6,fp0-fp7)
  ; (color,plotadress,x,y)
  ;pipeline1: d2,a2,fp0,fp1
  ;pipeline2: d3,a3,fp2,fp3
  ;pipeline3: d4,a4,fp4,fp5
  ;pipeline4: d5,a5,fp6,fp7
 
  ;d6 Number of ylines
  ;d7 ystartpos
  SP_MANDELBROT_N050:
    lea .var,a1
    move.l  txtureptr+8,a0
    move.l d7,d0
    muls.w bredde+2,d0
    add.l d0,a0
   
    fmove.x .xmin(pc),fp0
    fmove.x .xmax(pc),fp1
 
    fsub.x fp0,fp1
    fmove.l bredde,fp7
    fdiv.x fp7,fp1
 
    fmove.x .ymin(pc),fp2
    fmove.x .ymax(pc),fp3
 
    fsub.x fp2,fp3
    fmove.l hoyde,fp7
    fdiv.x fp7,fp3
 
    fmove.x fp2,fp7 ;cy
    fadd.x .ypos(pc),fp7
 
    fmove.l d7,fp4
    fmul.x fp3,fp4
    fadd.x fp4,fp7  ;start at correct raster
 
    fmove.x fp3,fp2
 
    fmove.x fp1,.deltax-.var(a1)
    fmove.x fp2,.deltay-.var(a1)
 
    fmove.x .xmin(pc),fp6 ;cx
    fadd.x .xpos(pc),fp6
    fmove.x fp6,.cx-.var(a1)
 
    fmove.x fp7,.cy-.var(a1)
  .yloop
    move.l bredde,d7
    subq.l #1,d7
    moveq.l #4,d1
    moveq.l #0,d2
    moveq.l #0,d3
    moveq.l #0,d4
    moveq.l #0,d5
    move.l a0,a2
    lea.l 1(a0),a3
    lea.l 2(a0),a4
    lea.l 3(a0),a5
 
    fmove.x #0,fp0
    fmove.x fp0,.deltaxsum-.var(a1)
 
    fmove.x .deltax-.var(a1),fp1
    fmove.x .deltay-.var(a1),fp2
 
    fmove.x fp0,.sp0-.var(a1)
    fmove.x fp0,.sp1-.var(a1)
    fmove.x fp1,.sp2-.var(a1)
    fmove.x fp2,.sp3-.var(a1)
    fmove.x fp1,fp4
    fmove.x fp2,fp5
    fadd.x fp1,fp4
    fadd.x fp2,fp5
    fmove.x fp4,.sp4-.var(a1)
    fmove.x fp5,.sp5-.var(a1)
    fadd.x fp1,fp4
    fadd.x fp2,fp5
    fmove.x fp4,.sp6-.var(a1)
    fmove.x fp5,.sp7-.var(a1)
 
  .cxloop
    ; (x1) x*x,  y*y;
      fmove.x .sp0(pc),fp0  ;fused
      fmul.x  fp0,fp0  ;1
        fmove.x .sp1(pc),fp1  ;fused
        fmul.x  fp1,fp1  ;2
        fmove.x .sp2(pc),fp2  ;fused
        fmul.x  fp2,fp2  ;3
        fmove.x .sp3(pc),fp3  ;fused
        fmul.x  fp3,fp3  ;4
        fmove.x .sp4(pc),fp4  ;fused
        fmul.x  fp4,fp4  ;4
        fmove.x .sp5(pc),fp5  ;fused
        fmul.x  fp5,fp5  ;5
        fmove.x .sp5(pc),fp5  ;fused
        fmul.x  fp5,fp5  ;6
        fmove.x .sp6(pc),fp6  ;fused
        fmul.x  fp7,fp7  ;7
        fmove.x .sp7(pc),fp7  ;fused
        fmul.x  fp7,fp7  ;8
    ;(x1) (x*x - y*y)
        fsub.x fp1,fp0  ;9
        fsub.x fp3,fp2  ;10
        fsub.x fp5,fp4  ;11
        fsub.x fp7,fp5  ;12
     
    ; (y1) y1=x*y;
        fmove.x .sp0(pc),fp1 ;fused
        fmul.x .sp1(pc),fp1 ;13
        fmove.x .sp2(pc),fp3 ;fused
        fmul.x .sp3(pc),fp3 ;14
        fmove.x .sp4(pc),fp5 ;fused
        fmul.x .sp5(pc),fp5 ;15
        fmove.x .sp6(pc),fp7 ;fused
        fmul.x .sp7(pc),fp7 ;16
     
    ;(x1)x1=x1+cx
        fadd.x .cx(pc),fp0 ;18 (1 cycle Latency)
        fadd.x .cx(pc),fp2 ;19
        fadd.x .cx(pc),fp4 ;20
        fadd.x .cx(pc),fp6 ;21
     
    ; y1=2*x*y
        fadd.x fp1,fp1  ;22
        fadd.x fp3,fp3  ;23
        fadd.x fp4,fp4  ;24
        fadd.x fp5,fp5  ;25
     
    ; x = x1
        fmove.x fp0,.sp0-.var(a1)  ;27(1 cycle latency)
        fmove.x fp2,.sp2-.var(a1)  ;28
        fmove.x fp4,.sp4-.var(a1)  ;29
        fmove.x fp6,.sp6-.var(a1)  ;30
     
    ;(y1)y1=y1+cx
        fadd.x .cy(pc),fp1  ;31
        fadd.x .cy(pc),fp3  ;32
        fadd.x .cy(pc),fp5  ;33
        fadd.x .cy(pc),fp7  ;34
     
    ;(test) x*x
        fmul.x fp0,fp0  ;36(1 cycle latency)
        fmul.x fp2,fp2  ;37
        fmul.x fp4,fp4  ;38
        fmul.x fp6,fp6  ;39
     
    ;y=y1
        fmove.x fp1,.sp1-.var(a1)  ;40
        fmove.x fp3,.sp3-.var(a1)  ;41
        fmove.x fp5,.sp5-.var(a1)  ;42
        fmove.x fp7,.sp7-.var(a1)  ;43
     
    ;(test) y*y
        fmul.x fp1,fp1  ;44
        fmul.x fp3,fp3  ;45
        fmul.x fp5,fp5  ;46
        fmul.x fp7,fp7  ;47
     
    ;(test) y*y+x*x
     
        fadd.x fp1,fp0  ;49(4cycle latency)
        fadd.x fp3,fp2  ;50
        fadd.x fp5,fp4  ;51
        fadd.x fp7,fp6  ;52
    cmp.w #255,d2  ;52
    beq.b .plot1
    fcmp.x #256,fp0  ;53
    fbngt .spixel1 
  .plot1 
    fmove.x .deltaxsum(pc),fp0 ;fused
    fadd.x .deltax(pc),fp0  ;1
    move.b d2,(a2)  ;2
    moveq.l #0,d2  ;3
    addq.l #1,d1  ;4
    lea (a0,d1.w),a2  ;5
    fmove.x .deltaysum-.var(a1),fp1 ;5
    fmove.x fp1,.sp1-.var(a1)  ;6
    fmove.x fp0,.deltaxsum-.var(a1) ;9 (3 cycle latency)
    fmove.x fp0a,.sp0-.var(a1)  ;10
 
  .spixel1
    cmp.w #255,d3  ;54
    beq.b .plot2
    fcmp.x #256,fp2 ;54
    fbngt .spixel2
  .plot2 
    fmove.x .deltaxsum(pc),fp2 ;fused
    fadd.x .deltax(pc),fp2  ;1
    move.b d3,(a3)  ;2
    moveq.l #0,d3  ;3
    addq.l #1,d1  ;4
    lea (a0,d1.w),a3  ;5
    fmove.x .deltaysum(pc),fp3 ;5
    fmove.x fp3,.sp3-.var(a1)  ;6
    fmove.x fp2,.deltaxsum-.var(a1) ;9 (3 cycle latency) 
    fmove.x fp2,.sp2-.var(a1)  ;10
  .spixel2
    cmp.w #255,d4    ;55
    beq.b .plot3
    fcmp.x #256,fp4  ;55
    fbngt .spixel3
  .plot3 
    fmove.x .deltaxsum(pc),fp4 ;fused
    fadd.x .deltax(pc),fp4  ;1
    move.b d4,(a4)  ;2
    moveq.l #0,d4  ;3
    addq.l #1,d1  ;4
    lea (a0,d1.w),a4  ;5
    fmove.x .deltaysum(pc),fp5 ;5
    fmove.x fp5,.sp5-.var(a1)  ;6
    fmove.x fp4,.deltaxsum-.var(a1) ;9 (3 cycle latency)
    fmove.x fp4,.sp4-.var(a1)  ;10
  .spixel3
    cmp.w #255,d5  ;56
    beq.b .plot4
    fcmp.x #256,fp6 ;56
    fbngt .spixel4
  .plot4 
    fmove.x .deltaxsum(pc),fp6 ;fused
    fadd.x .deltax(pc),fp6  ;1
    move.b d5,(a5)  ;2
    moveq.l #0,d5  ;3
    addq.l #1,d1  ;4
    lea (a0,d1.w),a5  ;5
    fmove.x .deltaysum(pc),fp7 ;5
    fmove.x fp7,.sp7-.var(a1)  ;6
    fmove.x fp6,.deltaxsum-.var(a1) ;9 (3 cycle latency)
    fmove.x fp6,.sp4-.var(a1)  ;10
  .spixel4
 
    addq.w #1,d2  ;57
    addq.w #1,d3  ;58
    addq.w #1,d4  ;59
    addq.w #1,d5  ;60
    cmp.w #256,d1  ;61
    blt.w .cxloop
 
    add.w #256,a0
    fmove.x .deltaysum(pc),fp7
    fadd.x .deltay(pc),fp7
    fmove.x fp7,.deltaysum-.var(a1)
    dbf d6,.yloop
 
    rts
 
  .var:
     
  .sp0 dc.x 0
  .sp1 dc.x 0
  .sp2 dc.x 0
  .sp3 dc.x 0
  .sp4 dc.x 0
  .sp5 dc.x 0
  .sp6 dc.x 0
  .sp7 dc.x 0
     
  .cx  dc.x 0
  .cy  dc.x 0
  .deltax: dc.x 0
  .deltay: dc.x 0
  .deltaxsum: dc.x 0
  .deltaysum: dc.x 0
 
  .xpos: 
    dc.x -1.253100001
 
  .ypos: 
    dc.x -0.344125
   
  .xmin: dc.x -0.5
  .xmax: dc.x 0.5
  .ymin: dc.x -0.5
  .ymax: dc.x 0.5
 

 

 

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
02 Apr 2011 21:01


Gunnar von Boehn wrote:

  It would be nice to creat maybe 3 or 4 real live testcases which test workloads which have a real relavance to real work.
 
  The Mandelbrot would be a good one.
  Another good one for FPU would be a matrix mul as this is the core of 3D games.
 

 
  Good Idea. I found an old 3x3 matrix multiplier i used on the Mc68030. It uses words. (fixedpoint 7:9)

Can u use it?
..

could this be fused?
  asr.l #8,d6
  asr.l #6,d6
to
  asr.l #14,d6 in 1 cycle?

 
      STRUCTURE matrix_3x3,0
  WORD M11
  WORD M12
  WORD M13
  WORD M21
  WORD M22
  WORD M23
  WORD M31
  WORD M32
  WORD M33
 
  MULSMATRIX_3x3:
 
  ; lea objmatrix1,a0
  ; lea objmatrix2,a1
  ; lea destmatrix,a2
 
  move.l #3-1,d7
  .loop
 
  move.w (a0)+,d0
  move.w (a0)+,d1
  move.w (a0)+,d2
 
  move.w M11(a1),d4
  muls.w d0,d4
  move.w M21(a1),d5
  muls.w d1,d5
  move.w M31(a1),d6
  muls.w d2,d6
 
  add.l d4,d6
  add.l d5,d6
 
  asr.l #8,d6
  asr.l #6,d6
 
  move.w d6,(a2)+
 
  move.w M12(a1),d4
  muls.w d0,d4
  move.w M22(a1),d5
  muls.w d1,d5
  move.w M32(a1),d6
  muls.w d2,d6
 
  add.l d4,d6
  add.l d5,d6
   
  asr.l #8,d6
  asr.l #6,d6
 
  move.w d6,(a2)+
 
  move.w M13(a1),d4
  muls.w d0,d4
  move.w M23(a1),d5
  muls.w d1,d5
  move.w M33(a1),d6
  muls.w d2,d6
 
  add.l d4,d6
  add.l d5,d6
 
  asr.l #8,d6
  asr.l #6,d6
  move.w d6,(a2)+
 
  dbf d7,.loop
 
  rts
 

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
02 Apr 2011 21:02



   

 
    .cxloop
      ; (x1) x*x,  y*y;
one question does this code now code X*X and Y*Y ?
Is Y*Y different for the pixels or constant for the row?

        fmove.x .sp0(pc),fp0  ;fused
        fmul.x  fp0,fp0  ;1
        fmove.x .sp1(pc),fp1  ;fused
        fmul.x  fp1,fp1  ;2
        fmove.x .sp2(pc),fp2  ;fused
        fmul.x  fp2,fp2  ;3
        fmove.x .sp3(pc),fp3  ;fused
        fmul.x  fp3,fp3  ;4
        fmove.x .sp4(pc),fp4  ;fused
        fmul.x  fp4,fp4  ;4
        fmove.x .sp5(pc),fp5  ;fused
        fmul.x  fp5,fp5  ;5
        fmove.x .sp5(pc),fp5  ;fused
        fmul.x  fp5,fp5  ;6
        fmove.x .sp6(pc),fp6  ;fused
        fmul.x  fp7,fp7  ;7
should this be mul fp6,fp6?
        fmove.x .sp7(pc),fp7  ;fused
        fmul.x  fp7,fp7  ;8

      ;(x1) (x*x - y*y)
        fsub.x fp1,fp0  ;9
        fsub.x fp3,fp2  ;10
        fsub.x fp5,fp4  ;11
Fp5 is not ready in this cycle.
Fp5 will be ready in cycle 13.

        fsub.x fp7,fp5  ;12
Fp7 is not ready in this cycle.
Fp7 will be ready in cycle 15.
Do we want to sub from FP5 or FP6?

     
      ; (y1) y1=x*y;
        fmove.x .sp0(pc),fp1 ;fused
        fmul.x .sp1(pc),fp1 ;13
        fmove.x .sp2(pc),fp3 ;fused
        fmul.x .sp3(pc),fp3 ;14
        fmove.x .sp4(pc),fp5 ;fused
        fmul.x .sp5(pc),fp5 ;15
        fmove.x .sp6(pc),fp7 ;fused
        fmul.x .sp7(pc),fp7 ;16
     
      ;(x1)x1=x1+cx
        fadd.x .cx(pc),fp0 ;18 (1 cycle Latency)
I think FP0 should be ready already in cycle 16. So no problem here.
        fadd.x .cx(pc),fp2 ;19
        fadd.x .cx(pc),fp4 ;20
        fadd.x .cx(pc),fp6 ;21
     
      ; y1=2*x*y
        fadd.x fp1,fp1  ;22
        fadd.x fp3,fp3  ;23
        fadd.x fp4,fp4  ;24
        fadd.x fp5,fp5  ;25
Do we want here to mul FP4 and FP5 or 5 and 7?
I'm getting confused with your registers :-/

     

Cheers
Gunnar

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
02 Apr 2011 21:03


S P wrote:

Good Idea. I found an old 3x3 matrix multiplier i used on the Mc68030. It uses words. Can u use it?

I was thinking of a MATRIX MUL suitable for todays normal games.
This means a MATRIX MUL operating on SINGLE or DOUBLE FLOAT.


Rune Stensland
Norway
(MX-Board Owner)
Posts 871
02 Apr 2011 21:23


Gunnar von Boehn wrote:

   
       

       
        .cxloop
          ; (x1) x*x,  y*y;
      one question does this code now code X*X and Y*Y ?
      Is Y*Y different for the pixels or constant for the row?

   

   

    fp0=Pipe1 x
    fp1=Pipe1 y
    fp2=pipe2 x
    fp3=pipe2 y
    fp4=pipe3 x
    fp5=pipe3 y
    fp6=pipe4 x
    fp7=pipe4 y
   
    Y*Y and X*X is different for each pixel but they are not dependent on eachother. So they can be run in paralell.
   
Gunnar von Boehn wrote:

    should this be mul fp6,fp6?
   

    One bug found! One beer for you :D
   
   
Gunnar von Boehn wrote:
 
    Do we want here to mul FP4 and FP5 or 5 and 7?
    I'm getting confused with your registers :-/
   

   
    One more bug found! two beers for you :D
   
    fp1,fp3,fp5,fp7 contain code to calculate y
    fp0,fp2,fp4,fp6 conaain code to calculate x
   
    corrected to:
   

          ; y1=2*x*y
              fadd.x fp1,fp1  ;22
              fadd.x fp3,fp3  ;23
              fadd.x fp5,fp5  ;24
              fadd.x fp7,fp7  ;25
   

   

My code is trying to solve the code below.
But with 4 x's and 4 y's and 4 while statements.
Each loopcounter is different. Some pixels use 3 iterations  while others use 256.

;  while(count <= 255 && x*x + y*y < 256)
;      {
;        x1 = x*x - y*y + cx;
;        y1 = 2*x*y + cy;
;        x = x1;
;        y = y1;
;        count++;
;      }

    cheers.

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
02 Apr 2011 23:10


Here is the 030 3x3 multiplier converted to fpu code. Unoptimized (double)
 
  It uses 88*3=264 cycles to perform a 3x3 muls.
  (27 multiplications and 18 adds.)
 
 
 

 

  sizeofdouble=8
 
  M11=0
  M12=1*sizeofdouble
  M13=2*sizeofdouble
  M21=3*sizeofdouble
  M22=4*sizeofdouble
  M23=5*sizeofdouble
  M31=6*sizeofdouble
  M32=7*sizeofdouble
  M33=8*sizeofdouble
 
  MULSMATRIX_3x3_fpu:
 
  ; lea objmatrix1,a0
  ; lea objmatrix2,a1
  ; lea destmatrix,a2
 
    move.l #3-1,d7
  .loop
 
    fmove.d (a0)+,fp0 ;1
    fmove.d (a0)+,fp1 ;2
    fmove.d (a0)+,fp2 ;3
 
    fmove.d M11(a1),fp4 ;fused
    fmul.x fp0,fp4  ;9 (6 cycle latency)
    fmove.d M21(a1),fp5 ;fused
    fmul.x fp1,fp5  ;10
    fmove.d M31(a1),fp6 ;fused
    fmul.x fp2,fp6  ;11
 
    fadd.x fp4,fp6  ;19 (8 cycle latency)
    fadd.x fp5,fp6  ;27 (8 cycle latency)
 
    fmove.d fp6,(a2)+ ;35 (8 cycle latency)
 
    fmove.d M12(a1),fp4 ;fused
    fmul.x fp0,fp4  ;36
    fmove.d M22(a1),fp5 ;fused
    fmul.x fp1,fp5  ;37
    fmove.d M32(a1),fp6 ;fused
    fmul.x fp2,fp6  ;38
 
    fadd.x fp4,fp6  ;46 (8 cycle latency)
    fadd.x fp5,fp6  ;54 (8 cycle latency)
     
    fmove.d fp6,(a2)+ ;60 (8 cycle latency)
 
    fmove.d M13(a1),fp4 ;fused
    fmul.x fp0,fp4  ;61
    fmove.d M23(a1),fp5 ;fused
    fmul.x fp1,fp5  ;62
    fmove.d M33(a1),fp6 ;fused
    fmul.x fp2,fp6  ;63
 
    fadd.x fp4,fp6  ;71 (8 cycle latency)
    fadd.x fp5,fp6  ;79 (8 cycle latency)
 
    fmove.d fp6,(a2)+ ;87 (8 cycle latency)
 
    dbf d7,.loop ;88
 
    rts
 

 

 

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
02 Apr 2011 23:26


Hi Sp,
 
Many thanks for the FPU Matrix version..
Looks nice but the latency hits us badly. :-/
 
What do you think about the option to keep the three (A0) values
in memory and not moving them into FPU registers?
This would remove 3 instructions and free 3 registers.
 
With more registrer free we could do 8 out of the 9 MULS and then the ADDs maybe this helps to remvoe some latency?
 
What do you think?
 

Marcel Verdaasdonk
Netherlands

Posts 3976
03 Apr 2011 02:24


MIPS are meaningless i agree but it has some ground compared to you can better state, 20 times more MIPS then a 68030 since it has more meaning to the people looking at your product.
Virtual is something you came up with MIPS is something the industry used back in '92 to compare results.
this gives MIPS more value then your virtual clock cycles.
besides your virtual clock cycle comparison in much more prone then MIPS to code rot.

Be honest and use a known bad standard for performance measurement instead of using something that has both less meaning 500MHZ 68030 is what, What sort of code are we looking at?(bet SP would get more out a 68030 then your tests show you)

It's very subjective how you run your test oh it would be nice to use optimized code, but that would create a VERY flawed testing method by default.

Hence MIPS this gives working code, that is not optimize, but would still compile.(perhaps even run)
this gives a level playing field and a real comparison to the older hardware.
Reducing to the consumer your apple and egg comparison factor.

We both know and agree it's wrong, but it was at the time the 68060 came out it was a defacto standard, and that is something you cannot swap out until your CPU has a given performance up against that, after that you can compare the 68070 to the 68050 in which ever way you like.(prove against proven, meaning your test cannot be proven on a system not yet released, you cannot simply state something like clocks and cycles as performance measurement.)

Marcel Verdaasdonk
Netherlands

Posts 3976
03 Apr 2011 02:30


Sorry for this what i am trying to say is you cannot state something for a not yet released system like your doing.

your testing method needs to have some solid ground meaning when you release the 68050 as a stable release we can start comparing everything up until then is guess work on either side.
That is why i like MIPS in this case for the 68050 testing.
We know for the whole family what they scored.

what you can do after the 68050 before the 68070, is introduce a new testing method, because your system performs with it you can draw a indirect line between your clock cycle test and the MIPS test.

posts 87page  1 2 3 4 5