Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Do you have ideas and feature wishes? Post them here and discuss your ideas.

N68k Enhancements Revisitedpage  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 
Gunnar von Boehn
Germany
(Moderator)
Posts 5775
24 Mar 2011 06:00


Thomas Richter wrote:

  Thus, I don't quite get why it makes sense to analyze this algorithm in particular, or why you bother about counting cycles?
 

 
  I see two things here.
 
  A) Analyzing some old code shows that the 68070 does run the 68K code quite well.
 
  We saw that a 100Mhz 68070 design is already able to run the original code as fast as a 750 Mhz 68030.
  I think this is very good.
 
  And I think this test showed that our coders have no problem to code for the 68070 - getting the performance to 950MHz 68030 speed with a few instructions swap is a very good result IMHO.
 
  But I also agree with you that the best code examples will be routines which make sense to use for all of us today.
 
  Looking at common code or common datatypes which we all will need will make sense.
  Mabye good examples might be modern and CPU heavy algorithms like:
  D-compression, Sorting, Video, and audio codecs?

Phil "meynaf" G.
France
(Natami Team)
Posts 393
24 Mar 2011 11:26


If it's about tasks like MP3, new instructions such as x86's shld/shrd will come back to your face ;-)
That is, provided you've made the choice to use 4:28 fixed-point (which is what the libmad library does), how do you make 4:28 * 4:28 muls ? This kind of operation is incredibly common there.


68k
  move.l x,d1
  muls.l y,d0:d1
  lsl.l #4,d0
  clr.w d1
  rol.l #4,d1
  or.w d1,d0  ; return d0
(a faster one can be done, but you will lose accuracy)

x86
  mov eax, x
  imul y
  shrd eax, edx, 28  ; return eax

Note that for me, the fact shrd is fast or not on x86 today is irrelevant.
What's relevant is the speed we can get out of the 680x0.
And this kind of instruction seems useful for that, provided it's faster than the above 68k code.

What do people here think about it ?


Marcel Verdaasdonk
Netherlands

Posts 3976
24 Mar 2011 11:47


Phil It nearly makes me want to split the ALU in two pipeline stages to even make fusion posible between arithmatic and logic instructions.
  But i know this is silly.

Well two stages consistent of 1 full arithmatic stage and two half logic stages prior to and after the arithmatic stage, but like i said silly idea.

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
24 Mar 2011 12:18


Phil G. wrote:

  If it's about tasks like MP3, new instructions such as x86's shld/shrd will come back to your face ;-)
    That is, provided you've made the choice to use 4:28 fixed-point (which is what the libmad library does), how do you make 4:28 * 4:28 muls ? This kind of operation is incredibly common there.
   
   

    68k
    move.l x,d1
    muls.l y,d0:d1
    lsl.l #4,d0
    clr.w d1
    rol.l #4,d1
    or.w d1,d0  ; return d0
    (a faster one can be done, but you will lose accuracy)
   

 

I think this will use 5 clockcycles on the N070.

3 cycles are used in the muls. and 2 to convert..
 


    move.l x,d1    ;1 p1 fused
    muls.l y,d0:d1 ;2 p1-p2
    lsl.l  #4,d0  ;4 p1
    clr.w  d1      ;4 p2
    rol.l  #4,d1  ;5 p1
    or.w  d1,d0  ;5 p2 forwarded
 

 

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
24 Mar 2011 12:40


Gunnar von Boehn wrote:

 
A) Analyzing some old code shows that the 68070 does run the 68K code quite well.
  We saw that a 100Mhz 68070 design is already able to run the original code as fast as a 750 Mhz 68030.
I think this very good.

I also think this is very good. The N050/N070 is very impressive when you think about that this is a hobby project you guys do in your spare time after work. I think that it will be fast enough to decode MP3 and play youtube movies, render web pages. etc.

Thomas Richter...

You believe that assembly coders are obsolete, and that a good compiler always will beat handmade code. Why don't you provide me with some important sourcecode you want me to implement fast.
I always beat the compiler when I code Mc680x0 assembly.
Because I count cycles...

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
24 Mar 2011 13:21


Phil G. wrote:

  If it's about tasks like MP3, new instructions such as x86's shld/shrd will come back to your face ;-)
  That is, provided you've made the choice to use 4:28 fixed-point (which is what the libmad library does), how do you make 4:28 * 4:28 muls ? This kind of operation is incredibly common there.
 
 

  68k
    move.l x,d1
    muls.l y,d0:d1
    lsl.l #4,d0
    clr.w d1
    rol.l #4,d1
    or.w d1,d0  ; return d0
  (a faster one can be done, but you will lose accuracy)
 
  x86
    mov eax, x
    imul y
    shrd eax, edx, 28  ; return eax
 

 
  Note that for me, the fact shrd is fast or not on x86 today is irrelevant.
  What's relevant is the speed we can get out of the 680x0.
  And this kind of instruction seems useful for that, provided it's faster than the above 68k code.
 
  What do people here think about it ?
 

 
I have four questions:
 
A) Is the best concept to decode MP3 with Integer?
Or would it make sense to use a FPU for this?
How sensible would be using the FPU be if your FMUL needs 1 cycle?
 
B) Even the most optimal SHRD without any Microcode, needs 2 Register write-ports. This means it would count like 2 instructions.
 
C) You used 4 68k-instructions to implement "SHRD".
The 4 instructions would take 2 cycles on the 070.
Is 2 Cycles a problem?
 
D) Using Bitfield you can do the same in 3 instructions.
So its just 1 instruction more than the SHRD - is this a problem?
 
E) Assuming that we wanted to add SHRD to 68K.
What encoding would you propose?

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
24 Mar 2011 13:23


S P wrote:

Phil G. wrote:

    If it's about tasks like MP3, new instructions such as x86's shld/shrd will come back to your face ;-)
    That is, provided you've made the choice to use 4:28 fixed-point (which is what the libmad library does), how do you make 4:28 * 4:28 muls ? This kind of operation is incredibly common there.
   
   

    68k
      move.l x,d1
      muls.l y,d0:d1
      lsl.l #4,d0
      clr.w d1
      rol.l #4,d1
      or.w d1,d0  ; return d0
    (a faster one can be done, but you will lose accuracy)
   

   

  I think this will use 5 clockcycles on the N070.
 
  3 cycles are used in the muls. and 2 to convert..
   

      move.l x,d1    ;1 p1 fused
      muls.l y,d0:d1 ;2 p1-p2
      lsl.l  #4,d0  ;4 p1
      clr.w  d1      ;4 p2
      rol.l  #4,d1  ;5 p1
      or.w  d1,d0  ;5 p2 forwarded Does not work
   

   

To prevent any misunderstanding.
The Forwarding has a precondition.
Forwarding between the pipes is only possible if only 1 Pipe does an ALU operation. This means you can forward between 1 MOVE and 1 OPERATION - but not between 2 OPERATIONS.

Does this make sense?

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
24 Mar 2011 13:58


.

Thomas Richter
Germany
(MX-Board Owner)
Posts 1425
24 Mar 2011 15:11


Phil G. wrote:

If it's about tasks like MP3, new instructions such as x86's shld/shrd will come back to your face ;-)
  That is, provided you've made the choice to use 4:28 fixed-point (which is what the libmad library does), how do you make 4:28 * 4:28 muls ?

This is a considerably better approach and a good question. The same type of operation would also be tremendously useful for image and video processing, i.e. 32x32 -> 64 multiply followed by a rightshift that must be specified. It is the core of FIR filters found in signal processing applications. In addition to the multiply-add instruction.

Basically, filters can implemented by "lifting". An elementary "lifting step" is a addition, a multiplication, a shift and an addition:

(((a+b)*c + d) >> e) + f -> g

as many of these steps as possible should be fused in the CPU to make such common algorithms fast. For example the multply-shift step, or the multiply-shift-add step if possible. The addition of "d" is just rounding (thus, d is typically (1 << e) >> 1).

So long,
Thomas


Thomas Richter
Germany
(MX-Board Owner)
Posts 1425
24 Mar 2011 15:19


Gunnar von Boehn wrote:

  I have four questions:
 
  A) Is the best concept to decode MP3 with Integer?

It is an implementation strategy. Whether that strategy works depends on the CPU. My J2K library offers both fixpoint and floating point implementations, and on the intels and AMDs, they are equally fast, float is probably a bit faster on the AMD. On the PPCs and Sparcs, it's just reverse, and the fixpoint implementation (aka integer implementation) works faster. Thus, if you can fuse floating point instructions, this would also help.

Gunnar von Boehn wrote:

  Or would it make sense to use a FPU for this?
  How sensible would be using the FPU be if your FMUL needs 1 cycle?

The mul itself is not the whole story. It is a multiply-add that is the important step. The mul+shift is just a fixpoint implementation of a multiplication.

Gunnar von Boehn wrote:

  B) Even the most optimal SHRD without any Microcode, needs 2 Register write-ports. This means it would count like 2 instructions.
 
  C) You used 4 68k-instructions to implement "SHRD".
  The 4 instructions would take 2 cycles on the 070.
  Is 2 Cycles a problem?

Not even 100 cycles would be a "problem". (-; The point is: Make it as fast as you can, because this is a "heavy duty" operation that appears in all common signal processing algorithms. Other heavy duty operations are (surprisingly) round to integer - thus a separate rounding mode specification in fmove.l fpx,dx would be very helpful here as well. This is the core of any quantization algorithm, the heart of all lossy coding algorithms.

Gunnar von Boehn wrote:

  D) Using Bitfield you can do the same in 3 instructions.
  So its just 1 instruction more than the SHRD - is this a problem?

Bitfields used to be slow, and they are not exactly what is needed here.

Gunnar von Boehn wrote:

  E) Assuming that we wanted to add SHRD to 68K.
  What encoding would you propose?

I would rather use a coprocessor encoding for all signal-processing type instructions, probably leaving the option of "vectorizing" these instructions later. That is, multiply+add, vector+add, vector round to int: That might go into a separate coprocessor slot.

Line A is somewhat used up because old Mac-Os has its Os-calls in there, and Shapeshifter should probably continue to work.

Greetings,
Thomas


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
24 Mar 2011 18:23


Thomas Richter wrote:

 
Gunnar von Boehn wrote:

  D) Using Bitfield you can do the same in 3 instructions.
  So its just 1 instruction more than the SHRD - is this a problem?
 

  Bitfields used to be slow, and they are not exactly what is needed here.

Can you explain what is needed here?

The example that was used does this:
Out[31 to 0] = In[59 to 27]

So what is it : Is it a SHIFT that we need or a BITFIELD-EXT?
Or could eather both do the job?

In this example we needed 64 source but only one 32bit result was used.
This means the instruction needed only to update 1 32bit destination.
Whether the instruction has to update 1 or 2 destinations makes a big difference.

Which form do we actually need?
In this example only 1 destination was needed - is this always the case - or was this an exception?



Rune Stensland
Norway
(MX-Board Owner)
Posts 871
25 Mar 2011 09:20


2 fixed point numbers multiplied
   
    4:28
    4:28
   
    then you get a new 64bit fixed point number
   
   

    8:56 (0xIIFFFFFF ffffffff)
   
    d0:
    0xIIFFFFFF
   
    D1:
    0xffffffff
   
    your code convert to this:
   
    d0:
    0xIFFFFFFf
   

   
    To convert back to 4:28 you remove the 4 left bits and add the 4 last bits.

In the x86 version the High longword is ignored.


  8:56 (0xIIFFFFFF ffffffff) >> 28
  (0x0000000I IFFFFFF f)


   

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
25 Mar 2011 10:27


This can be optimized down to 3 cycles like this:
     
     

      move.l  x,d1    ;1 P1 fused
      asr.l  #2,d1  ;1 P1
      move.l  y,d2    ;2 P2 fused
      asr.l  #2,d2  ;2 P2
      muls.l  d2,d0:d1 ;3 (P1/P2) (return d0)
     

   
    To improve accurancy you should use 2:30 fixed point instead
    (In your 8:56 --->> 4:28 convertion the 4 upper bits are removed after the muls. You should use them to improve accurancy instead)
   
     

      move.l  x,d1    ;1 P1 fused
      asr.l  #1,d1  ;1 P1
      move.l  y,d2    ;2 P2 fused
      asr.l  #1,d2  ;2 P2
      muls.l  d2,d0:d1 ;3 (P1/P2) (return d0)
     

 
  Both of these routines are more accurate than yours. For every muls you do, you loose 4 bits of information. In my routine I also lose 4 bits in the first example and 2 bits in the second example.

But my bits are the most insignificant. So accurancy will be improved alot.

Matt Hey
USA

Posts 734
25 Mar 2011 10:54


Phil G. wrote:

 

  N68k
  move.l x,d1
  muls.l y,d0:d1
  bfextu d1{0,4},d1
  lsl.l #4,d0
  or.l d1,d0  ; return d0

  x86
  ; Shuffle some registers since registers not general purpose!
  mov eax, x
  imul y
  shrd eax, edx, 28  ;shrd=slow, return eax
  ; Shuffle some registers since registers not general purpose!
 

  What do people here think about it ?

There I fixed it! Hmmm, I don't see any x86 advantage now ;).


Rune Stensland
Norway
(MX-Board Owner)
Posts 871
25 Mar 2011 11:24


That is nice Matt. Your routine will run in 3-4 cycles on the N070.
  (If Gunnar and co. can do some magic and make the Bfext in 1 cycle and 1 pipe)
 
  My routine will run in 2 cycles:
 
 

  move.l  x,d1    ;1 P1 fused     
  asr.l  #2,d1  ;1 P1     
  move.l  y,d2    ;1 P2 fused     
  asr.l  #2,d2  ;1 P2     
  muls.l  d2,d0:d1 ;2 (P1/P2) (return d0)
 

 

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
25 Mar 2011 11:51


A pentium 4 will use 23 clocks on this!!
  Intel P4 F4 
   

      mov  eax, x      ;1  (free)
      imul y            ;10
      shrd eax, edx, 28 ;12
   

    22 clocks
 
    Intel ATOM
   

      mov  eax, x      ;1 (free)
      imul y            ;18
      shrd eax, edx, 28 ;9
   

   
    27 clocks
   
    AMD K10
   

      mov  eax, x      ;1 (free)
      imul y            ;5
      shrd eax, edx, 28 ;3
   

    8 clocks
   

Matt Hey
USA

Posts 734
25 Mar 2011 12:03


S P wrote:

That is nice Matt. Your routine will run in 3-4 cycles on the N070.
  (If Gunnar and co. can do some magic and make the Bfext in 1 cycle and 1 pipe)
 
  My routine will run in 2 cycles:

My post was meant as a joke. The original shorted the 68k in regard to data movement. It is more efficient to do the shifts before the multiply if all that is needed is d0. I wonder if a mulhs.l d2,d0 (mul high signed) could be made faster than muls.l d2,d0:d1. It would keep from trashing a register even if it wasn't faster and it's not unusual that just the upper 32 bits are used. A combination multiply shift as ThoR suggested would also be nice but it would require a 64 bit shift and write 2 registers.


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
25 Mar 2011 12:33


Matt Hey wrote:

I wonder if a mulhs.l d2,d0 (mul high signed) could be made faster. 

What happend again when you do this:
muls.l d2,d0:d0

When both targets are the same?

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
25 Mar 2011 13:13


Intel Core 2
       
       

        mov  eax, x      ;1  (free)     
        imul y            ;8       
        shrd eax, edx, 28 ;2
       

       
        10 cycles total
       
        Here is an updated x86 cycle diagram
       
        Instruction latencies and throughput for
        AMD and Intel x86 processors

  http://www.gmplib.org/~tege/x86-timing.pdf
 

Megol .

Posts 678
25 Mar 2011 13:49


Matt Hey wrote:

Phil G. wrote:

 

  N68k
    move.l x,d1
    muls.l y,d0:d1
    bfextu d1{0,4},d1
    lsl.l #4,d0
    or.l d1,d0  ; return d0
 
  x86
  ; Shuffle some registers since registers not general purpose!
    mov eax, x
    imul y
    shrd eax, edx, 28  ;shrd=slow, return eax
  ; Shuffle some registers since registers not general purpose!
 

 
  What do people here think about it ?
 

 
  There I fixed it! Hmmm, I don't see any x86 advantage now ;).

If you code assembly in that manner or have a compiler that generates that lousy code you should do something else for a living/hobby. But yes this is one of the few cases where x86 registers aren't ortogonal. But if we use todays standard (first x86-64/AMD64 processor released 8 years  ago):

; assumes esi/ebp (or any two register) are used as 32 bit registers earlier.
; 32 bit operations are zero-extend to 64 bit for free.
imul rsi, rbp ; any registers. 64*64->64 bit multiply
shr rsi, 28  ; any register

Which for my processor have a latency of 4 (3+1) and throughput of 1.
(I did see your smiley ;)

posts 435page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22