 |
Welcome to the Natami / Amiga ForumThis forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.
|
Do you have ideas and feature wishes? Post them here and discuss your ideas. |
|
|---|
Cesare Di Mauro Italy
| | Posts 528 19 Mar 2011 09:11
| One of the great features of 68K is to have separate data and address registers. ;)
| |
Matt Hey USA
| | Posts 737 19 Mar 2011 14:39
| Cesare Di Mauro wrote:
| What does it mean "direct mode"?This is Gunnar's idea (based on Megol idea) where the EA is calculated and used without being an offset to memory. Read back in the thread a little bit if you missed it. It opens up some nice possibilities for parallel integer operations but also is difficult to use because of a potentially sizable change/use stall before it. We have looked at a syntax like this... add.l {1234,a0,d0*8},d1 ;add.l 1234+a0+d0*8,d1 The instruction above would do a shift and 3 adds in 1 cycle. Cesare Di Mauro wrote:
| 3) Allow "<<" and a shift size as well as '*' for index scale Example: move.l (1234+a0+d0<<7) Reason: <<7 might be easier than *128 for some people/applications |
I prefer multiplication, which is easier to read.
|
It depends on what I'm doing as to which is easiest. Sometimes I want to shift and sometimes I want to multiply ;). Experienced programmers would probably know the powers of 2 easily enough (2,4,8,16,32,64,128,256) but they could be difficult for a beginner. It's nice to allow both '*' and "<<". Gunnar von Boehn wrote:
| Regarding using DN as base: This is not good IMHO because of two reasons. We would then need another new DN Read-Port: This drives Chip-costs higher. We would also create a new hazard point.
|
I did NOT suggest updating the index register. I suggested moving the bit indicating if the index register is suppressed to the base register suppress bit location. This is possible because base register update does not make sense with the base register suppressed. It is necessary because there are no more free I/IS encodings with an index register if the 1 free slot is used for direct mode with index register. My suggestion updates the base register only. Gunnar von Boehn wrote:
| I'm a little bit afraid of making the EA too complex. If we make the calculation in the EA more complex, then the timing will get worse and we will reduce our maximum clockrate.
|
We don't want the EA calculation slowing down as it's too commonly used. I thing base register update is the most important of the 3 EA enhancements. Direct mode has synergies with a larger shift size as it's more general purpose often trying to use the EA for what it's not designed. If it's free, fine. Otherwise, we need to take a harder look.
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 19 Mar 2011 14:54
| Matt Hey wrote:
| We don't want the EA calculation slowing down as it's too commonly used. I thing base register update is the most important of the 3 EA enhancements. Direct mode has synergies with a larger shift size as it's more general purpose often trying to use the EA for what it's not designed.
|
Regarding costs, I'll quickly try to estimate the costs for all proposals I can remember: * Not doing the "Cache Fetch" but passing the EA to the ALU, has no implication on our speed. This means adding such a mode would not slow the EA-Unit down. * A mode which updating the Base-Register. Like (An)+ does. Has no influence on our timing and will not slow the EA-Unit down. * Allowing Data registers as Base register will open a Pandaros box. It would require a new extra Data-Read-Port and it would open new hazards which would need to be handled. Such a change is expensive and full of drawbacks. * Adding more scaling modes is tempting but not free. Right now we have a 4-Way MUX for *1,*2,*4,*8. We would need to enhance this to 5 a Way-Mux to add (*16). This has some cost but I like the idea as *16 looks sweet for SIMD or FPU. Even if we could create encoding for more Scall values, I would want to add more than the *16 because as more we beef this up as more we but us stumples stones in the time critical path. * Adding extras calculation An-D0 instead or An+D0 or other operations like AND,OR NEG NOT. Will increase the complicity and in the end limit our clockrate uneeded. Were these all ideas, or did I forgot something?
| |
Megol .
| | Posts 690 19 Mar 2011 15:04
| Gunnar von Boehn wrote:
| I'm a little bit afraid of making the EA to complex. If we make the calcuation in the EA more complex, then the timing weill get worse and we will reduce our maximum clockrate. Regarding using DN as base: This is not good IMHO because of two reasons. We would then need another new DN Read-Port: This drives Chip-costs higher. We would also create a new hazard point. Instructions working with the DN register in the two previous cycle would create a bubble. The correct recognise these hazards we will need to add extra tacking logic. Also we can NOT update Dn Base registers as the EA does not have a write Port to the Dn Registers. These are a lot drawbacks... Regarding the extra Scale modes. Adding an extra Bit for the scale mode looks tempting from the software point of few. But it will double our MUX area needed to do the SHIFT in the EA-Unit. This looks bad from a HW designers point of view because its costly and could reduce our clockrate. I'll have to measure this to see how much room we have here before we looks performance...
|
Using Dn as base isn't needed by would be nice if it could be done inexpensive/for free. One alternative would be to only support the extended scaling for instructions that doesn't access memory (LEA and your proposal to pass the calculated EA to the execution stage). The negation/inverse idea could be inexpensive depending on the adder implementation.It could be used for misc. computations and also some odd things like simulating a little-endian memory space.
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 19 Mar 2011 15:18
| Megol . wrote:
| The negation/inverse idea could be inexpensive depending on the adder implementation. |
Adding another source option (or the Mux for it) could already greatlyt reduce your clockrate. Therefore I'll stay sceptical until you show me the VHDL! ;-D
| |
Megol .
| | Posts 690 19 Mar 2011 18:47
| Gunnar von Boehn wrote:
|
Megol . wrote:
| The negation/inverse idea could be inexpensive depending on the adder implementation. |
Adding another source option (or the Mux for it) could already greatlyt reduce your clockrate. Therefore I'll stay sceptical until you show me the VHDL! ;-D
|
For Cyclone, Stratix II+, Arria, Cyclone V and at least Virtex 5+, Spartan 6 an add/sub unit uses the same amount of logic as an adder. The Cyclone III/IV appear to require an other logic layer for the same functionality. No extra mux is needed for an add/subtractor? If the address generator is routing limited it could slow it down but shouldn't the carry chain be more critical?EA=An+((da?Dm:Am)<<scale)+displacement+0 compared to: EA=An+(((da?Dm:Am)<<scale-)^-negate)+displacement+negate The last term is simply the carry input, the ^-negate inverts the scaled index if set.
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 22 Mar 2011 20:23
| Ok we need some more examples of code to be optimized. This routine is used for a streaming music player. (realtime decompression) "Norwegian Kindness" AMiGA demo by Spaceballs (2011) EXTERNAL LINK Here is the music decoder optimized for Mc68030. (written by me in 1998) (The algorithm is written by me and The Nightraver (Delta bit rate reduction variation)) On the Mc68030 50mhz a chipmem write could pipeline 22 cycles. Most of the cycles in this loop is performed while the cpu is writing to memory.(copyspeed) but the writes are not longwords. so on 060 the speedup can be signifficant. On the Natami the latancy on writes to chipmem will not be as extreme as on the amiga (5.6 meg / second) I think 1 minute of music compressed down to around 1 mb(lossy). 28KHZ ; NTSP DECODER V2.0 NTSP_DECODE move.l .First(PC),a0 ; a0 = offset move.l AUDIO_LOGICAL(PC),a1 ; a1 = destination move.w #NTSP_BUFFERSIZE/9-1,d5 ;Bytes to be converted... moveq #0,d0 .ytre moveq.l #0,d6 move.b (a0)+,d6 move.l d6,d0 lsr.b #5,d6 move.b d6,d4 subq.b #1,d4 bpl.b .pll moveq.l #0,d4 .pll lsl.b #3,d0 asr.b #3,d0 lsl.b d4,d0 move.b d0,(a1)+ move.b (a0)+,d0 move.l d0,d1 asr.b #4,d0 and.l #$f,d1 lsl.l d6,d0 lsl.l #8,d0 lsl.b #4,d1 asr.b #4,d1 lsl.l d6,d1 move.b d1,d0 move.b (a0)+,d1 move.w d0,(a1)+ move.l d1,d0 asr.b #4,d1 and.l #$f,d0 lsl.l d6,d1 lsl.l #8,d1 lsl.b #4,d0 asr.b #4,d0 lsl.l d6,d0 move.b d0,d1 move.b (a0)+,d0 move.w d1,(a1)+ move.l d0,d1 asr.b #4,d0 and.l #$f,d1 lsl.l d6,d0 lsl.l #8,d0 lsl.b #4,d1 asr.b #4,d1 lsl.l d6,d1 move.b d1,d0 move.b (a0)+,d1 move.w d0,(a1)+ move.l d1,d0 asr.b #4,d1 and.l #$f,d0 lsl.l d6,d1 lsl.l #8,d1 lsl.b #4,d0 asr.b #4,d0 lsl.l d6,d0 move.b d0,d1 move.w d1,(a1)+ dbf d5,.ytre add.l #NTSP_BUFFERSIZE*5/9,.First rts .First dc.l NTSP_SAMPLE+8+NTSP_FILSTART
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 22 Mar 2011 23:27
| Hi SP, Thanks for the code example. Do you want to discuss its execution on the current 050 or how the 070 would execute it?
| |
Marcel Verdaasdonk Netherlands
| | Posts 3991 23 Mar 2011 00:52
| this might be out of line but deeming from SP's post i would assume he would prefer the 070 over the 050. And Gunnar you pointing out the power of super scalar doesn't reduce that zeal.So just give the 070 example not further making us feel a little down on that the 050 isn't super scalar. ;)
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 23 Mar 2011 07:14
| OK, then let us try how the 070 design will execute this example code:S P wrote:
| ; NTSP DECODER V2.0 NTSP_DECODE move.l .First(PC),a0 ; a0 = offset move.l AUDIO_LOGICAL(PC),a1 ; a1 = destination move.w #NTSP_BUFFERSIZE/9-1,d5 ;Bytes to be converted... moveq #0,d0 .ytre - Cycle moveq.l #0,d6 - 1 p1 fused move.b (a0)+,d6 - 1 p1 move.l d6,d0 - 1 p2 forwarded lsr.b #5,d6 - 2 p1 move.b d6,d4 - 3 p1 (depending code) subq.b #1,d4 - 4 p1 (depending code) bpl.b .pll - 5 p1 fused moveq.l #0,d4 - 5 p1 .pll lsl.b #3,d0 - 5 p2 asr.b #3,d0 - 6 p1 lsl.b d4,d0 - 7 p1 (depending code) move.b d0,(a1)+ - 7 p2 forwarded move.b (a0)+,d0 - 8 p1 move.l d0,d1 - 8 p2 forwarded asr.b #4,d0 - 9 p1 and.l #$f,d1 - 9 p2 lsl.l d6,d0 - 10 p1 lsl.l #8,d0 - 11 p1 (depending code) lsl.b #4,d1 - 11 p2 asr.b #4,d1 - 12 p1 lsl.l d6,d1 - 13 p1 (depending code) move.b d1,d0 - 13 p2 forwarded move.b (a0)+,d1 - 14 p1 move.w d0,(a1)+ - 14 p2 move.l d1,d0 - 15 p1 asr.b #4,d1 - 15 p2 and.l #$f,d0 - 16 p1 lsl.l d6,d1 - 16 p2 lsl.l #8,d1 - 17 p1 lsl.b #4,d0 - 17 p2 asr.b #4,d0 - 18 p1 lsl.l d6,d0 - 19 p1 (depending code) move.b d0,d1 - 19 p2 forwarded move.b (a0)+,d0 - 20 p1 move.w d1,(a1)+ - 20 p2 move.l d0,d1 - 21 p1 asr.b #4,d0 - 21 p2 and.l #$f,d1 - 22 p1 lsl.l d6,d0 - 22 p2 lsl.l #8,d0 - 23 p1 lsl.b #4,d1 - 23 p2 asr.b #4,d1 - 24 p1 lsl.l d6,d1 - 25 p1 (depending) move.b d1,d0 - 25 p2 forwarded move.b (a0)+,d1 - 26 p1 move.w d0,(a1)+ - 26 p2 move.l d1,d0 - 27 p1 asr.b #4,d1 - 27 p2 and.l #$f,d0 - 28 p1 lsl.l d6,d1 - 28 p2 lsl.l #8,d1 - 29 p1 lsl.b #4,d0 - 29 p2 asr.b #4,d0 - 30 p1 lsl.l d6,d0 - 31 p1 (depending) move.b d0,d1 - 31 p2 forwarded move.w d1,(a1)+ - 32 p1 dbf d5,.ytre - 32 p2 add.l #NTSP_BUFFERSIZE*5/9,.First rts .First dc.l NTSP_SAMPLE+8+NTSP_FILSTART
|
The 68070 does need 32 clocks for the complete workloop. 8 Instruction are depending and therefore limit super scaler execution slightly. Also there a few cases where instruction could be fused if they would be rearrenged. Example:
move.l d0,d1 asr.b #4,d0 and.l #$f,d1
1 and 3 could be fused together on the 050 and 070 if 3 swaps location with 2.I think 32 clocks for the workloop is not bad. How many did the 68030 need? I assume that a very clever coder could re-arrange the code slightly to improve the fusing and double issue rate. Maybe this would then go down to 25 clocks or so? Who can get the code down to the lowest clock number?
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 23 Mar 2011 10:02
| This was impressive.. moveq.l #0,d6 - 1 p1 fused move.b (a0)+,d6 - 1 p1 move.l d6,d0 - 1 p2 forwarded The forwarding is new. The Mc68060 will use 3 clocks for this. .. Cycle Timings Mc68030 (50mhz) .ytre moveq.l #0,d6 ;2 move.b (a0)+,d6 ;4 move.l d6,d0 ;2 lsr.b #5,d6 ;4 move.b d6,d4 ;4 subq.b #1,d4 ;2 bpl.b .pll ;6 moveq.l #0,d4 ;2 .pll lsl.b #3,d0 ;4 asr.b #3,d0 ;4 lsl.b d4,d0 ;6 move.b d0,(a1)+ ;4 move.b (a0)+,d0 ;4 move.l d0,d1 ;4 asr.b #4,d0 ;4 and.l #$f,d1 ;6 lsl.l d6,d0 ;6 lsl.l #8,d0 ;4 lsl.b #4,d1 ;4 asr.b #4,d1 ;4 lsl.l d6,d1 ;6 move.b d1,d0 ;2 move.b (a0)+,d1 ;6 move.w d0,(a1)+ ;6 move.l d1,d0 ;2 asr.b #4,d1 ;4 and.l #$f,d0 ;6 lsl.l d6,d1 ;6 lsl.l #8,d1 ;4 lsl.b #4,d0 ;4 asr.b #4,d0 ;4 lsl.l d6,d0 ;6 move.b d0,d1 ;2 move.b (a0)+,d0 ;4 move.w d1,(a1)+ ;4 move.l d0,d1 ;2 asr.b #4,d0 ;4 and.l #$f,d1 ;6 lsl.l d6,d0 ;6 lsl.l #8,d0 ;4 lsl.b #4,d1 ;4 asr.b #4,d1 ;4 lsl.l d6,d1 ;6 move.b d1,d0 ;4 move.b (a0)+,d1 ;4 move.w d0,(a1)+ ;4 move.l d1,d0 ;2 asr.b #4,d1 ;4 and.l #$f,d0 ;6 lsl.l d6,d1 ;4 lsl.l #8,d1 ;4 lsl.b #4,d0 ;4 asr.b #4,d0 ;4 lsl.l d6,d0 ;6 move.b d0,d1 ;2 move.w d1,(a1)+ ;4 dbf d5,.ytre ;6 Total cycles: 240
The 4chipmem writes will stall for additional 104 cycles(26 cycles per write) (but in this loop the writes will pipeline with the cpu). The fastram reads will also stall for 4(?) cycles if cached and 12(?) if not cached.
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 23 Mar 2011 12:50
| Wow, 240 clocks for the 68030 - thats a big difference. This means the routine would run on the 68070 as fast as on a 68030 clocked at 750 MHz. Not counting the performance difference from the bigger cache - which would make the 68070 go even faster than the 750 MHz 68030.... The code runs already quite nice on superscalar. But would you be able to tweak it more to run even better? Will you get the code to 68030 at 1000MHz Speed?
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 23 Mar 2011 13:08
| Remember that most Amigas where equipped with Mc68000, Mc68020 and MC68030. The Mc68060 from 1994 was too expensive for the common user, so ppl bought PC's instead. I will write a new version when I come home from work. It will fetch 32bit and work with longwords instead. I will optimize it for the N070 with superscalar.
| |
Angel of Paradise Germany
| | Posts 61 23 Mar 2011 14:46
| How about this? NTSP_DECODE move.l .First(PC),a0 ; a0 = offset move.l AUDIO_LOGICAL(PC),a1 ; a1 = destination move.w #NTSP_BUFFERSIZE/9-1,d5 ;Bytes to be converted... moveq #0,d0 .ytre - Cycle moveq.l #0,d6 - 1 p1 fused move.b (a0)+,d6 - 1 p1 move.l d6,d0 - 1 p2 forwarded lsr.b #5,d6 - 2 p1 lsl.b #3,d0 - 2 p2 move.l d6,d4 - 3 p1 fused subq.b #1,d4 - 3 p1 asr.b #3,d0 - 3 p2 bpl.b .pll - 4 p1 fused moveq.l #0,d4 - 4 p1 .pll lsl.b d4,d0 - 5 p1 move.b d0,(a1)+ - 5 p2 forwarded move.b (a0)+,d0 - 6 p1 move.l d0,d1 - 6 p2 forwarded and.l #$f,d1 - 6 p2 fused asr.b #4,d0 - 7 p1 lsl.b #4,d1 - 7 p2 lsl.l d6,d0 - 8 p2 asr.b #4,d1 - 8 p1 lsl.l #8,d0 - 9 p1 lsl.l d6,d1 - 9 p2 move.b d1,d0 - 10 p1 move.b (a0)+,d1 - 10 p2 move.w d0,(a1)+ - 11 p1 move.l d1,d0 - 11 p2 fused and.l #$f,d0 - 11 p2 asr.b #4,d1 - 12 p1 lsl.b #4,d0 - 12 p2 lsl.l d6,d1 - 13 p1 asr.b #4,d0 - 13 p2 lsl.l #8,d1 - 14 p1 lsl.l d6,d0 - 14 p2 move.b d0,d1 - 15 p1 move.b (a0)+,d0 - 15 p2 move.w d1,(a1)+ - 16 p1 move.l d0,d1 - 16 p2 fused and.l #$f,d1 - 16 p2 asr.b #4,d0 - 17 p1 lsl.b #4,d1 - 17 p2 lsl.l d6,d0 - 18 p1 asr.b #4,d1 - 18 p2 lsl.l #8,d0 - 19 p1 lsl.l d6,d1 - 19 p2 move.b d1,d0 - 20 p1 move.b (a0)+,d1 - 20 p2 move.w d0,(a1)+ - 21 p1 move.l d1,d0 - 21 p2 fused and.l #$f,d0 - 21 p2 asr.b #4,d1 - 22 p1 lsl.b #4,d0 - 22 p2 lsl.l d6,d1 - 23 p1 asr.b #4,d0 - 23 p2 lsl.l #8,d1 - 24 p1 lsl.l d6,d0 - 24 p2 move.b d0,d1 - 25 p1 move.w d1,(a1)+ - 26 p1 dbf d5,.ytre - 26 p2
This would be 57 instructions in 26 clocks.
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 23 Mar 2011 16:49
| I got it down to 11 (10.5 cycles) but there is one catch. The Byteorder in the ouputbuffer is scrambled 0 1 3 5 7 2 4 6 8 Instead of 0 1 2 3 4 5 6 7 8. This scrambling can be done in the encoder. Or you need some more code to unscramble. C2pMerge d0,d1,#8 C2pMerge d0,d1,#16 This will use 5.5 more cycles. (increase the loop to write 18 bytes (Run two parallell c2p merges for the last 16bytes). This will give a 100% compatible output in 16 cycles for 9 bytes written ; NTSP DECODER V2.1 (N070) NTSP_DECODE move.l .first(PC),a0 ; a0 = offset move.l AUDIO_LOGICAL(PC),a1 ; a1 = destination move.w #NTSP_BUFFERSIZE/9-1,d5 ;to be converted... moveq #0,d0 .loop moveq.l #0,d6 ;1 P1 fused move.b (a0)+,d6 ;1 P1 move.l d6,d0 ;1 P2 forwarded lsr.b #5,d6 ;2 p1 lsl.b #3,d0 ;2 p2 move.b d6,d4 ;3 p1 asr.b #3,d0 ;3 p2 lsr.b #4,d6 ;4 p1 subq.b #1,d4 ;4 p2 bpl.b .pll ;5 p1 fused moveq.l #0,d4 ;5 p1 .pll move.l (a0),d1 ;5 p2 fused and.l #$f0f0f0f0,d1 ;5 p2 lsl.b d4,d0 ;6 p1 move.b d0,(a1)+ ;6 p2 forwarded move.l (a0)+,d2 ;7 p1 fused and.l #$0f0f0f0f,d2 ;7 p1 lsr.l #4,d1 ;7 p2 sub.l .submask(pc,d6.w*4),d1 ;8 p1 sub.l .submask(pc,d6.w*4),d2 ;8 p2 lsl.l d6,d1 ;9 p1 move.l d1,(a1)+ ;9 p2 forwarded lsl.l d6,d2 ;10 p1 move.l d2,(a1)+ ;10 p2 forwarded dbf d5,.loop ;11 p1 add.l #NTSP_BUFFERSIZE*5/9,.First rts .first dc.l NTSP_SAMPLE+8+NTSP_FILSTART CNOP 0,16 .submask: dc.l 0 dc.l $02020202 dc.l $04040404 dc.l $08080808
|
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 23 Mar 2011 19:33
| Here is the code in C. (Go and beat 10.5 cycles compiler junkies) :D
void DecodeNTSP(int length,unsigned char *input, signed char *output) { int inputindex=0; int outputindex=0; unsigned char tempbyte; unsigned int shift; unsigned short tempword; signed int highnible,lownible,tempint; for(int i=0;i<length;i++) { tempbyte=input[inputindex++]; //tempbyte=%SSbbbbbb (2bit shift 6bits of data) shift=((unsigned int)tempbyte)>>6; tempint=((signed int)(tempbyte<<2))>>2; //extend the 5 bits to a signed byte tempint<<=(shift-1); output[outputindex++]=(signed char)tempint; for(int k=0;k<4;k++) { tempbyte=input[inputindex++]; lownible=((signed int)((tempbyte<<4))>>4)<<shift; highnible=(signed int)((tempbyte)>>4)<<shift; output[outputindex++]=(signed char)highnible; output[outputindex++]=(signed char)lownible; } } }
| |
Thomas Richter Germany
| | (MX-Board Owner) Posts 1425 23 Mar 2011 21:30
| S P wrote:
| Here is the code in C. (Go and beat 10.5 cycles compiler junkies) :D
|
Why? Or, what is the code supposed to do in first place?If it is audio decompression, then why compress it in first place with a machine with that much RAM? If you want to compress it, then a mainstream compression like mp3 would be more useful. And if it is only supposed to be fast, then Natami will be faster in first place, so why do you need to optimize it? Thus, I don't quite get why it makes sense to analyze this algorithm in particular, or why you bother about counting cycles? Greetings, Thomas
| |
Marcel Verdaasdonk Netherlands
| | Posts 3991 23 Mar 2011 21:48
| algorithm analyze is a very important process in optimization. ThoR your comment can be seen as one of those nice things like 640K is enough.Since why do we need new software, the old one still works right? Well why do we need a Natami then? If people are willing to invest time in optimal algorithms we will not fall down in the slums of bloated and slow software since the knowledge is retained. If this is not done, it would be more economical/sane to buy a PC to emulate. I have said it before a important part of this forum is to get a wave of new developers into good software development. good software for one would be quick and lean. This doesn't per scribe the need for Assembler, what it does is creating understanding of how things actually work, and having a clear and readable Assembler language helps here!
| |
Thomas Richter Germany
| | (MX-Board Owner) Posts 1425 23 Mar 2011 22:12
| Marcel Verdaasdonk wrote:
| algorithm analyze is a very important process in optimization. ThoR your comment can be seen as one of those nice things like 640K is enough. Since why do we need new software, the old one still works right? Well why do we need a Natami then? If people are willing to invest time in optimal algorithms we will not fall down in the slums of bloated and slow software since the knowledge is retained. If this is not done, it would be more economical/sane to buy a PC to emulate. I have said it before a important part of this forum is to get a wave of new developers into good software development. good software for one would be quick and lean. This doesn't per scribe the need for Assembler, what it does is creating understanding of how things actually work, and having a clear and readable Assembler language helps here!
|
You don't understand. This *specific* algorithm is tuned to a specific machine. Old, obsolete. Tuning the machine for this specific algorithm doesn't provide a future direction or development - it is oriented backwards instead of forwards.Instead, either analyze algorithms that are of some importance and relevant for general purpose - and mp3 would be one - or rather ask yourself which *new* algorithms you could compile with the powers of the new machine available. For example, compression algorithms that are a bit less naive than the above, though require a bit more power. Power that is now available. But looking at *old* special purpose algorithms once designed with the limitations of the old machine in mind really makes no sense. Greetings, Thomas
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 24 Mar 2011 05:51
| Angel of Paradise wrote:
| How about this? ..... This would be 57 instructions in 26 clocks.
|
I have not checked it 100% but it looks correct. You did only swap instructions to optimize the pipeline utilization, right? If I count this correctly the 68070 would execute the code not with the performance of a 950 MHz 68030 CPU.
| |
|
|
|
|