Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Do you have ideas and feature wishes? Post them here and discuss your ideas.

N68k Enhancements Revisitedpage  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 
Cesare Di Mauro
Italy

Posts 528
19 Mar 2011 09:11


One of the great features of 68K is to have separate data and address registers. ;)

Matt Hey
USA

Posts 737
19 Mar 2011 14:39


Cesare Di Mauro wrote:

  What does it mean "direct mode"?

This is Gunnar's idea (based on Megol idea) where the EA is calculated and used without being an offset to memory. Read back in the thread a little bit if you missed it. It opens up some nice possibilities for parallel integer operations but also is difficult to use because of a potentially sizable change/use stall before it. We have looked at a syntax like this...

  add.l {1234,a0,d0*8},d1 ;add.l 1234+a0+d0*8,d1

The instruction above would do a shift and 3 adds in 1 cycle.

Cesare Di Mauro wrote:

  3) Allow "<<" and a shift size as well as '*' for index scale
    Example: move.l (1234+a0+d0<<7)
    Reason: <<7 might be easier than *128 for some people/applications

  I prefer multiplication, which is easier to read.

It depends on what I'm doing as to which is easiest. Sometimes I want to shift and sometimes I want to multiply ;). Experienced programmers would probably know the powers of 2 easily enough (2,4,8,16,32,64,128,256) but they could be difficult for a beginner.
It's nice to allow both '*' and "<<".

Gunnar von Boehn wrote:
 
  Regarding using DN as base:
  This is not good IMHO because of two reasons.
  We would then need another new DN Read-Port: This drives Chip-costs higher.
  We would also create a new hazard point.

I did NOT suggest updating the index register. I suggested moving the bit indicating if the index register is suppressed to the base register suppress bit location. This is possible because base register update does not make sense with the base register suppressed. It is necessary because there are no more free I/IS encodings with an index register if the 1 free slot is used for direct mode with index register. My suggestion updates the base register only.

Gunnar von Boehn wrote:

I'm a little bit afraid of making the EA too complex.
 
  If we make the calculation in the EA more complex,
  then the timing will get worse and we will reduce our maximum clockrate.

We don't want the EA calculation slowing down as it's too commonly used. I thing base register update is the most important of the 3 EA enhancements. Direct mode has synergies with a larger shift size as it's more general purpose often trying to use the EA for what it's not designed. If it's free, fine. Otherwise, we need to take a harder look.


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
19 Mar 2011 14:54


Matt Hey wrote:

We don't want the EA calculation slowing down as it's too commonly used. I thing base register update is the most important of the 3 EA enhancements. Direct mode has synergies with a larger shift size as it's more general purpose often trying to use the EA for what it's not designed.

Regarding costs, I'll quickly try to estimate the costs for all proposals I can remember:

* Not doing the "Cache Fetch" but passing the EA to the ALU, has no implication on our speed. This means adding such a mode would not slow the EA-Unit down.

* A mode which updating the Base-Register. Like (An)+ does.
Has no influence on our timing and will not slow the EA-Unit down.

* Allowing Data registers as Base register will open a Pandaros box.
It would require a new extra Data-Read-Port and it would open new hazards which would need to be handled. Such a change is expensive and full of drawbacks.

* Adding more scaling modes is tempting but not free.
Right now we have a 4-Way MUX for *1,*2,*4,*8.
We would need to enhance this to 5 a Way-Mux to add (*16).
This has some cost but I like the idea as *16 looks sweet for SIMD or FPU.
Even if we could create encoding for more Scall values, I would want to add more than the *16 because as more we beef this up as more we but us stumples stones in the time critical path.

* Adding extras calculation An-D0 instead or An+D0 or other operations like AND,OR NEG NOT.
Will increase the complicity and in the end limit our clockrate uneeded.

Were these all ideas, or did I forgot something?


Megol .

Posts 690
19 Mar 2011 15:04


Gunnar von Boehn wrote:

I'm a little bit afraid of making the EA to complex.
 
  If we make the calcuation in the EA more complex,
  then the timing weill get worse and we will reduce our maximum clockrate.
 
  Regarding using DN as base:
  This is not good IMHO because of two reasons.
  We would then need another new DN Read-Port: This drives Chip-costs higher.
  We would also create a new hazard point.
  Instructions working with the DN register in the two previous cycle would create a bubble. The correct recognise these hazards we will need to add extra tacking logic.
  Also we can NOT update Dn Base registers as the EA does not have a write Port to the Dn Registers.
  These are a lot drawbacks...
 
  Regarding the extra Scale modes.
  Adding an extra Bit for the scale mode looks tempting from the software point of few. But it will double our MUX area needed to do the SHIFT in the EA-Unit. This looks bad from a HW designers point of view because its costly and could reduce our clockrate.
  I'll have to measure this to see how much room we have here before we looks performance...

Using Dn as base isn't needed by would be nice if it could be done inexpensive/for free.
One alternative would be to only support the extended scaling for instructions that doesn't access memory (LEA and your proposal to pass the calculated EA to the execution stage).
The negation/inverse idea could be inexpensive depending on the adder implementation.It could be used for misc. computations and also some odd things like simulating a little-endian memory space.


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
19 Mar 2011 15:18


Megol . wrote:

  The negation/inverse idea could be inexpensive depending on the adder implementation.
 

 
Adding another source option (or the Mux for it) could already greatlyt reduce your clockrate.
Therefore I'll stay sceptical until you show me the VHDL! ;-D

Megol .

Posts 690
19 Mar 2011 18:47


Gunnar von Boehn wrote:

Megol . wrote:

  The negation/inverse idea could be inexpensive depending on the adder implementation.
 

 
  Adding another source option (or the Mux for it) could already greatlyt reduce your clockrate.
  Therefore I'll stay sceptical until you show me the VHDL! ;-D

For Cyclone, Stratix II+, Arria, Cyclone V and at least Virtex 5+, Spartan 6 an add/sub unit uses the same amount of logic as an adder.
The Cyclone III/IV appear to require an other logic layer for the same functionality.
No extra mux is needed for an add/subtractor? If the address generator is routing limited it could slow it down but shouldn't the carry chain be more critical?

EA=An+((da?Dm:Am)<<scale)+displacement+0
compared to:
EA=An+(((da?Dm:Am)<<scale-)^-negate)+displacement+negate

The last term is simply the carry input, the ^-negate inverts the scaled index if set.

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
22 Mar 2011 20:23


Ok we need some more examples of code to be optimized. This routine is used for a streaming music player. (realtime decompression)

"Norwegian Kindness" AMiGA demo by Spaceballs (2011)
EXTERNAL LINK 
Here is the music decoder optimized for Mc68030. (written by me in 1998) (The algorithm is written by me and The Nightraver (Delta bit rate reduction variation)) On the Mc68030 50mhz a chipmem write could pipeline 22 cycles. Most of the cycles in this loop is performed while the cpu is writing to memory.(copyspeed)  but the writes are not longwords. so on 060 the speedup can be signifficant. On the Natami the latancy on writes to chipmem will not be as extreme as on the amiga (5.6 meg / second) I think 1 minute of music compressed down to around 1 mb(lossy). 28KHZ


; NTSP DECODER V2.0
NTSP_DECODE
  move.l .First(PC),a0  ; a0 = offset
  move.l AUDIO_LOGICAL(PC),a1 ; a1 = destination
  move.w #NTSP_BUFFERSIZE/9-1,d5 ;Bytes to be converted...
  moveq #0,d0
.ytre
  moveq.l #0,d6
  move.b (a0)+,d6
  move.l d6,d0
  lsr.b #5,d6
  move.b d6,d4

  subq.b #1,d4
  bpl.b .pll
  moveq.l #0,d4
.pll
  lsl.b #3,d0
  asr.b #3,d0
  lsl.b d4,d0
  move.b d0,(a1)+
 
  move.b (a0)+,d0
  move.l d0,d1
  asr.b #4,d0
  and.l #$f,d1
  lsl.l d6,d0
  lsl.l #8,d0
  lsl.b #4,d1
  asr.b #4,d1
  lsl.l d6,d1
  move.b d1,d0

  move.b (a0)+,d1
  move.w d0,(a1)+

  move.l d1,d0
  asr.b #4,d1
  and.l #$f,d0
  lsl.l d6,d1
  lsl.l #8,d1
  lsl.b #4,d0
  asr.b #4,d0
  lsl.l d6,d0
  move.b d0,d1

  move.b (a0)+,d0
  move.w d1,(a1)+
  move.l d0,d1
  asr.b #4,d0
  and.l #$f,d1
  lsl.l d6,d0
  lsl.l #8,d0
  lsl.b #4,d1
  asr.b #4,d1
  lsl.l d6,d1
  move.b d1,d0

  move.b (a0)+,d1
  move.w d0,(a1)+

  move.l d1,d0
  asr.b #4,d1
  and.l #$f,d0
  lsl.l d6,d1
  lsl.l #8,d1
  lsl.b #4,d0
  asr.b #4,d0
  lsl.l d6,d0
  move.b d0,d1
  move.w d1,(a1)+

  dbf d5,.ytre

  add.l #NTSP_BUFFERSIZE*5/9,.First

  rts

.First  dc.l NTSP_SAMPLE+8+NTSP_FILSTART




Gunnar von Boehn
Germany
(Moderator)
Posts 5775
22 Mar 2011 23:27


Hi SP,

Thanks for the code example.
Do you want to discuss its execution on the current 050 or how the 070 would execute it?

Marcel Verdaasdonk
Netherlands

Posts 3991
23 Mar 2011 00:52


this might be out of line but deeming from SP's post i would assume he would prefer the 070 over the 050.
And Gunnar you pointing out the power of super scalar doesn't reduce that zeal.

So just give the 070 example not further making us feel a little down on that the 050 isn't super scalar. ;)

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
23 Mar 2011 07:14


OK, then let us try how the 070 design will execute this example code:

S P wrote:

 

  ; NTSP DECODER V2.0
  NTSP_DECODE
    move.l .First(PC),a0  ; a0 = offset
    move.l AUDIO_LOGICAL(PC),a1 ; a1 = destination
    move.w #NTSP_BUFFERSIZE/9-1,d5 ;Bytes to be converted...
    moveq #0,d0
  .ytre                        -  Cycle
    moveq.l #0,d6              -  1 p1 fused
    move.b (a0)+,d6            -  1 p1
    move.l d6,d0              -  1 p2 forwarded
    lsr.b #5,d6                -  2 p1
    move.b d6,d4              -  3 p1 (depending code)
 
    subq.b #1,d4              -  4 p1 (depending code)
    bpl.b .pll                -  5 p1 fused
    moveq.l #0,d4              -  5 p1
  .pll
    lsl.b #3,d0                -  5 p2
    asr.b #3,d0                -  6 p1
    lsl.b d4,d0                -  7 p1 (depending code)
    move.b d0,(a1)+            -  7 p2 forwarded
 
    move.b (a0)+,d0            -  8 p1
    move.l d0,d1              -  8 p2 forwarded
    asr.b #4,d0                -  9 p1
    and.l #$f,d1              -  9 p2
    lsl.l d6,d0                - 10 p1
    lsl.l #8,d0                - 11 p1 (depending code)
    lsl.b #4,d1                - 11 p2
    asr.b #4,d1                - 12 p1 
    lsl.l d6,d1                - 13 p1 (depending code)
    move.b d1,d0              - 13 p2 forwarded
 
    move.b (a0)+,d1            - 14 p1
    move.w d0,(a1)+            - 14 p2
 
    move.l d1,d0              - 15 p1
    asr.b #4,d1                - 15 p2
    and.l #$f,d0              - 16 p1
    lsl.l d6,d1                - 16 p2
    lsl.l #8,d1                - 17 p1
    lsl.b #4,d0                - 17 p2
    asr.b #4,d0                - 18 p1 
    lsl.l d6,d0                - 19 p1 (depending code)
    move.b d0,d1              - 19 p2 forwarded 
 
    move.b (a0)+,d0            - 20 p1
    move.w d1,(a1)+            - 20 p2
    move.l d0,d1              - 21 p1
    asr.b #4,d0                - 21 p2
    and.l #$f,d1              - 22 p1
    lsl.l d6,d0                - 22 p2
    lsl.l #8,d0                - 23 p1
    lsl.b #4,d1                - 23 p2
    asr.b #4,d1                - 24 p1
    lsl.l d6,d1                - 25 p1 (depending)
    move.b d1,d0              - 25 p2 forwarded
 
    move.b (a0)+,d1            - 26 p1
    move.w d0,(a1)+            - 26 p2
 
    move.l d1,d0              - 27 p1
    asr.b #4,d1                - 27 p2
    and.l #$f,d0              - 28 p1
    lsl.l d6,d1                - 28 p2
    lsl.l #8,d1                - 29 p1
    lsl.b #4,d0                - 29 p2
    asr.b #4,d0                - 30 p1
    lsl.l d6,d0                - 31 p1 (depending)
    move.b d0,d1              - 31 p2 forwarded
    move.w d1,(a1)+            - 32 p1
 
    dbf d5,.ytre              - 32 p2
 
    add.l #NTSP_BUFFERSIZE*5/9,.First
 
    rts
 
  .First  dc.l NTSP_SAMPLE+8+NTSP_FILSTART
 
 

 

The 68070 does need 32 clocks for the complete workloop.

8 Instruction are depending and therefore limit super scaler execution slightly. Also there a few cases where  instruction could be fused if they would be rearrenged.
Example:


  move.l d0,d1
    asr.b #4,d0
  and.l #$f,d1

1 and 3 could be fused together on the 050 and 070 if 3 swaps location with 2.

I think 32 clocks for the workloop is not bad.
How many did the 68030 need?

I assume that a very clever coder could re-arrange the code slightly to improve the fusing and double issue rate.
Maybe this would then go down to 25 clocks or so?

Who can get the code down to the lowest clock number?


Rune Stensland
Norway
(MX-Board Owner)
Posts 871
23 Mar 2011 10:02


This was impressive..
 
  moveq.l #0,d6              -  1 p1 fused   
  move.b (a0)+,d6            -  1 p1   
  move.l d6,d0              -  1 p2 forwarded
 
  The forwarding is new. The Mc68060 will use 3 clocks for this.
  ..
 
  Cycle Timings Mc68030 (50mhz)
 
 

  .ytre
    moveq.l #0,d6 ;2
    move.b (a0)+,d6 ;4
    move.l d6,d0  ;2
    lsr.b #5,d6  ;4
    move.b d6,d4  ;4
 
    subq.b #1,d4  ;2
    bpl.b .pll  ;6
    moveq.l #0,d4 ;2
  .pll
    lsl.b #3,d0  ;4
    asr.b #3,d0  ;4
    lsl.b d4,d0  ;6
    move.b d0,(a1)+ ;4
   
    move.b (a0)+,d0 ;4
    move.l d0,d1  ;4
    asr.b #4,d0  ;4
    and.l #$f,d1  ;6
    lsl.l d6,d0  ;6
    lsl.l #8,d0  ;4
    lsl.b #4,d1  ;4
    asr.b #4,d1  ;4
    lsl.l d6,d1  ;6
    move.b d1,d0  ;2
 
    move.b (a0)+,d1 ;6
    move.w d0,(a1)+ ;6
 
    move.l d1,d0  ;2
    asr.b #4,d1  ;4
    and.l #$f,d0  ;6
    lsl.l d6,d1  ;6
    lsl.l #8,d1  ;4
    lsl.b #4,d0  ;4
    asr.b #4,d0  ;4
    lsl.l d6,d0  ;6
    move.b d0,d1  ;2
 
    move.b (a0)+,d0 ;4
    move.w d1,(a1)+ ;4
    move.l d0,d1  ;2
    asr.b #4,d0  ;4
    and.l #$f,d1  ;6
    lsl.l d6,d0  ;6
    lsl.l #8,d0  ;4
    lsl.b #4,d1  ;4
    asr.b #4,d1  ;4
    lsl.l d6,d1  ;6
    move.b d1,d0  ;4
 
    move.b (a0)+,d1 ;4
    move.w d0,(a1)+ ;4
 
    move.l d1,d0  ;2
    asr.b #4,d1  ;4
    and.l #$f,d0  ;6
    lsl.l d6,d1  ;4
    lsl.l #8,d1  ;4
    lsl.b #4,d0  ;4
    asr.b #4,d0  ;4
    lsl.l d6,d0  ;6
    move.b d0,d1  ;2
    move.w d1,(a1)+ ;4
 
    dbf d5,.ytre  ;6
 
  Total cycles:            240
 

 
  The 4chipmem writes will stall for additional 104 cycles(26 cycles per write) (but in this loop the writes will pipeline with the cpu).  The fastram reads will also stall for 4(?) cycles if cached and 12(?) if not cached.
 
 

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
23 Mar 2011 12:50


Wow, 240 clocks for the 68030  - thats a big difference.

This means the routine would run on the 68070 as fast as on a 68030 clocked at 750 MHz.

Not counting the performance difference from the bigger cache - which would make the 68070 go even faster than the 750 MHz 68030....

The code runs already quite nice on superscalar.
But would you be able to tweak it more to run even better?

Will you get the code to 68030 at 1000MHz Speed?

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
23 Mar 2011 13:08


Remember that most Amigas where equipped with Mc68000, Mc68020 and MC68030. The Mc68060 from 1994 was too expensive for the common user, so ppl bought PC's instead.

I will write a new version when I come home from work. It will fetch 32bit and work with longwords instead. I will optimize it for the N070 with superscalar.

Angel of Paradise
Germany

Posts 61
23 Mar 2011 14:46


How about this?


NTSP_DECODE
    move.l .First(PC),a0  ; a0 = offset
    move.l AUDIO_LOGICAL(PC),a1 ; a1 = destination
    move.w #NTSP_BUFFERSIZE/9-1,d5 ;Bytes to be converted...
    moveq #0,d0
  .ytre                        -  Cycle
    moveq.l #0,d6              -  1 p1 fused
    move.b (a0)+,d6            -  1 p1
    move.l d6,d0              -  1 p2 forwarded

    lsr.b #5,d6                -  2 p1
    lsl.b #3,d0                -  2 p2

    move.l d6,d4              -  3 p1 fused
    subq.b #1,d4              -  3 p1
    asr.b #3,d0                -  3 p2

    bpl.b .pll                -  4 p1 fused
    moveq.l #0,d4              -  4 p1
  .pll

    lsl.b d4,d0                -  5 p1
    move.b d0,(a1)+            -  5 p2 forwarded

    move.b (a0)+,d0            -  6 p1
    move.l d0,d1              -  6 p2 forwarded
    and.l #$f,d1              -  6 p2 fused

    asr.b #4,d0                -  7 p1
    lsl.b #4,d1                -  7 p2

    lsl.l d6,d0                -  8 p2
    asr.b #4,d1                -  8 p1

    lsl.l #8,d0                -  9 p1
    lsl.l d6,d1                -  9 p2

    move.b d1,d0              - 10 p1         
    move.b (a0)+,d1            - 10 p2

    move.w d0,(a1)+            - 11 p1
    move.l d1,d0              - 11 p2 fused
    and.l #$f,d0              - 11 p2
    asr.b #4,d1                - 12 p1
    lsl.b #4,d0                - 12 p2
   
    lsl.l d6,d1                - 13 p1
    asr.b #4,d0                - 13 p2
   
    lsl.l #8,d1                - 14 p1
    lsl.l d6,d0                - 14 p2

    move.b d0,d1              - 15 p1         
    move.b (a0)+,d0            - 15 p2

    move.w d1,(a1)+            - 16 p1
    move.l d0,d1              - 16 p2 fused
    and.l #$f,d1              - 16 p2

    asr.b #4,d0                - 17 p1
    lsl.b #4,d1                - 17 p2

    lsl.l d6,d0                - 18 p1
    asr.b #4,d1                - 18 p2

    lsl.l #8,d0                - 19 p1
    lsl.l d6,d1                - 19 p2

    move.b d1,d0              - 20 p1
    move.b (a0)+,d1            - 20 p2

    move.w d0,(a1)+            - 21 p1
    move.l d1,d0              - 21 p2 fused
    and.l #$f,d0              - 21 p2

    asr.b #4,d1                - 22 p1
    lsl.b #4,d0                - 22 p2

    lsl.l d6,d1                - 23 p1
    asr.b #4,d0                - 23 p2

    lsl.l #8,d1                - 24 p1
    lsl.l d6,d0                - 24 p2

    move.b d0,d1              - 25 p1

    move.w d1,(a1)+            - 26 p1
    dbf d5,.ytre              - 26 p2

This would be 57 instructions in 26 clocks.


Rune Stensland
Norway
(MX-Board Owner)
Posts 871
23 Mar 2011 16:49


I got it down to 11 (10.5 cycles) but there is one catch.
   
    The Byteorder in the ouputbuffer is scrambled 0 1 3 5 7 2 4 6 8
    Instead of 0 1 2 3 4 5 6 7 8. This scrambling can be done in the encoder. Or you need some more code to unscramble.
   
    C2pMerge d0,d1,#8
    C2pMerge d0,d1,#16
 
    This will use 5.5 more cycles. (increase the loop to write 18 bytes (Run two parallell c2p merges for the last 16bytes).
 
  This will give a 100% compatible output in 16 cycles for 9 bytes written
 
   
   

   

    ; NTSP DECODER V2.1 (N070)
    NTSP_DECODE
      move.l .first(PC),a0  ; a0 = offset
      move.l AUDIO_LOGICAL(PC),a1 ; a1 = destination
      move.w #NTSP_BUFFERSIZE/9-1,d5 ;to be converted...
      moveq #0,d0
   
    .loop
      moveq.l #0,d6                ;1 P1 fused
      move.b (a0)+,d6              ;1 P1
      move.l d6,d0                ;1 P2 forwarded
   
      lsr.b #5,d6                  ;2 p1
      lsl.b #3,d0                  ;2 p2
     
      move.b d6,d4                ;3 p1
      asr.b #3,d0                  ;3 p2
     
      lsr.b #4,d6                  ;4 p1
      subq.b #1,d4                ;4 p2
   
      bpl.b .pll                  ;5 p1 fused
      moveq.l #0,d4                ;5 p1
    .pll
      move.l (a0),d1              ;5 p2 fused
      and.l #$f0f0f0f0,d1          ;5 p2
   
      lsl.b d4,d0                  ;6 p1
      move.b d0,(a1)+              ;6 p2 forwarded
     
      move.l (a0)+,d2              ;7 p1 fused
      and.l #$0f0f0f0f,d2          ;7 p1
   
      lsr.l #4,d1                  ;7 p2
   
      sub.l .submask(pc,d6.w*4),d1 ;8 p1
      sub.l .submask(pc,d6.w*4),d2 ;8 p2
   
      lsl.l d6,d1                  ;9 p1
      move.l d1,(a1)+              ;9 p2 forwarded
     
      lsl.l d6,d2                  ;10 p1
      move.l d2,(a1)+              ;10 p2 forwarded
   
      dbf d5,.loop                ;11 p1
   
      add.l #NTSP_BUFFERSIZE*5/9,.First
   
      rts
    .first dc.l NTSP_SAMPLE+8+NTSP_FILSTART
   
      CNOP 0,16
    .submask:
      dc.l 0
      dc.l $02020202
      dc.l $04040404
      dc.l $08080808
   

   

   

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
23 Mar 2011 19:33


Here is the code in C. (Go and beat 10.5 cycles compiler junkies) :D
 

void DecodeNTSP(int length,unsigned char *input, signed char *output)
{
  int inputindex=0;
  int outputindex=0;
  unsigned char tempbyte;
  unsigned int shift;
  unsigned short tempword;
  signed int highnible,lownible,tempint;

  for(int i=0;i<length;i++)
  {
  tempbyte=input[inputindex++];  //tempbyte=%SSbbbbbb (2bit shift 6bits of data)
  shift=((unsigned int)tempbyte)>>6;
  tempint=((signed int)(tempbyte<<2))>>2; //extend the 5 bits to a signed byte
  tempint<<=(shift-1);
  output[outputindex++]=(signed char)tempint;

  for(int k=0;k<4;k++)
  {
    tempbyte=input[inputindex++];
    lownible=((signed int)((tempbyte<<4))>>4)<<shift;
    highnible=(signed int)((tempbyte)>>4)<<shift;
    output[outputindex++]=(signed char)highnible;
    output[outputindex++]=(signed char)lownible;
  } 
  }
}


   

Thomas Richter
Germany
(MX-Board Owner)
Posts 1425
23 Mar 2011 21:30


S P wrote:

Here is the code in C. (Go and beat 10.5 cycles compiler junkies) :D

Why? Or, what is the code supposed to do in first place?

If it is audio decompression, then why compress it in first place with a machine with that much RAM? If you want to compress it, then a mainstream compression like mp3 would be more useful. And if it is only supposed to be fast, then Natami will be faster in first place, so why do you need to optimize it?

Thus, I don't quite get why it makes sense to analyze this algorithm in particular, or why you bother about counting cycles?

Greetings,

Thomas


Marcel Verdaasdonk
Netherlands

Posts 3991
23 Mar 2011 21:48


algorithm analyze is a very important process in optimization.
ThoR your comment can be seen as one of those nice things like 640K is enough.

Since why do we need new software, the old one still works right?

Well why do we need a Natami then?
If people are willing to invest time in optimal algorithms we will not fall down in the slums of bloated and slow software since the knowledge is retained.

If this is not done, it would be more economical/sane to buy a PC to emulate.

I have said it before a important part of this forum is to get a wave of new developers into good software development.
good software for one would be quick and lean.
This doesn't per scribe the need for Assembler, what it does is creating understanding of how things actually work, and having a clear and readable Assembler language helps here!

Thomas Richter
Germany
(MX-Board Owner)
Posts 1425
23 Mar 2011 22:12


Marcel Verdaasdonk wrote:

algorithm analyze is a very important process in optimization.
  ThoR your comment can be seen as one of those nice things like 640K is enough.
 
  Since why do we need new software, the old one still works right?
 
  Well why do we need a Natami then?
  If people are willing to invest time in optimal algorithms we will not fall down in the slums of bloated and slow software since the knowledge is retained.
 
  If this is not done, it would be more economical/sane to buy a PC to emulate.
 
  I have said it before a important part of this forum is to get a wave of new developers into good software development.
  good software for one would be quick and lean.
  This doesn't per scribe the need for Assembler, what it does is creating understanding of how things actually work, and having a clear and readable Assembler language helps here!

You don't understand. This *specific* algorithm is tuned to a specific machine. Old, obsolete. Tuning the machine for this specific algorithm doesn't provide a future direction or development - it is oriented backwards instead of forwards.

Instead, either analyze algorithms that are of some importance and relevant for general purpose - and mp3 would be one - or rather ask yourself which *new* algorithms you could compile with the powers of the new machine available. For example, compression algorithms that are a bit less naive than the above, though require a bit more power. Power that is now available.

But looking at *old* special purpose algorithms once designed with the limitations of the old machine in mind really makes no sense.

Greetings,
Thomas


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
24 Mar 2011 05:51


Angel of Paradise wrote:

How about this?
 
  .....
 
This would be 57 instructions in 26 clocks.

I have not checked it 100% but it looks correct.
You did only swap instructions to optimize the pipeline utilization, right?

If I count this correctly the 68070 would execute the code not with the performance of a 950 MHz 68030 CPU.

posts 435page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22