 |
Welcome to the Natami / Amiga ForumThis forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.
|
Do you have ideas and feature wishes? Post them here and discuss your ideas. |
|
|---|
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 08 Apr 2011 19:27
| On the old computers it was common to use Self modified code Can this be supported on the N050? 2 cycles: move.l (sp)+,d0 add.l (a0),d0 1 cycle(with SMC): move.l #xxx,d0 add.l (a0),d0 Instead of storing temp variables on the stack the programmer can put them directly into the code. This will allow 2 memory accesses per clock. But will require some kind of cache snooping..
| |
Team Chaos Leader USA
| | (Moderator) Posts 2094 08 Apr 2011 19:49
| S P wrote:
| On the old computers it was common to use Self modified code Can this be supported on the N050? 2 cycles: move.l (sp)+,d0 add.l (a0),d0 1 cycle(with SMC): move.l #xxx,d0 add.l (a0),d0
|
Where is the SMC? /me hand u a cup of coffee :)
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 08 Apr 2011 20:01
| Sometimes the programmer is out of registers. Take a look at my latancy free matrix muls routine in the sysinfo topic for an example.. instead of writing this: move.l d0,-(sp) ... move.l (sp)+,d0 add.l (a0),d0 You can write this: move.l d0,.smc+2(pc) ... .smc move.l #xxx,d0 add.l (a0),d0 The N050 could reach superscalar performance with this feature.. 2 memreads per clock.. One memread is stored in the instructioncache. the other read is stored in the datacache.
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 08 Apr 2011 20:28
| Marcel Verdaasdonk wrote:
| Doesn't the N68050 already do. load ea opp store In a single pipeline every cycle?
|
Yes it does!Marcel Verdaasdonk wrote:
| it doesn't matter to me i was asking for the size of a unit with three pipelines and a FPU in a FPGA.
|
But your question was unprecise. I did not want to give you a wrong answer. And to be able to answer you correctly you really need to give some proper examples of what type of instructions (EXAMPLES!!) you want toi execute in 1 cycle with your 3 pipes.I was asking you for examples to clarify this before already, didn't I?
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 08 Apr 2011 20:34
| S P wrote:
| On the old computers it was common to use Self modified code
|
Selfmodifying code is quite messy - we certainly all agree in this. We all know that you could shave off a cycle or two with this in the 68000. Selfmodifying code makes invalidating of the cache necessary - this can be really costly. I know what you mean and where you are coming from - but for the sake of code clearness I really hope that people will refrain from using such hacks ...
| |
One Thousand USA
| | Posts 832 08 Apr 2011 21:41
| @SP I also think SMC is neat. It is that cache stuff that is messy! :) Really! Here is an interesting example: Sun's MAJC processor even had specific instructions to do it.
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 09 Apr 2011 09:28
| S P wrote:
| move.l #xxx,d0 add.l (a0),d0 The N050 could reach superscalar performance with this feature.. 2 memreads per clock.. One memread is stored in the instructioncache. the other read is stored in the datacache.
|
Yes in this special case you could win 1 cycle. But there is no compiler which would be able to use this feature. :-/ This means for 99% of all applications this feature will remain unused. Improving the CPU in a way which benefits all code would be the best. General improvements like being able to do 2 memory reads per cycle would benefit all code - it would even accelerate existing legacy code.
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 09 Apr 2011 11:26
| Ofcourse to support 2 memreads per clock would be the best. It will probobly require twice the number of LE's and alot of work.. My Cachesnooping idea is much easier to implement isn't it? In compiled code you often see function calls like this: movem.l d0-d7/a0-a6,-(sp) jsr method movem.l (sp)+,d0-d7/a0-a6 I would be happy to implement support in GCC for SMC style calling convention to save datacache and time.. :D movem.l a0-a6,-(sp) move.l d0,.i+2(pc) move.l d1,.i+8(pc) move.l d2,.i+14(pc) move.l d3,.i+20(pc) move.l d4,.i+26(pc) move.l d5,.i+32(pc) move.l d6,.i+38(pc) move.l d7,.i+44(pc) jsr (a6) movem.l (sp)+,a0-a6 .i move.l #0,d0 move.l #0,d1 move.l #0,d2 move.l #0,d3 move.l #0,d4 move.l #0,d5 move.l #0,d6 move.l #0,d7
Another nice instruction would be to have a movem with modulo. movem.w #2,d0-d7/a0-a6,(a0)+ here a0 will point to: dc.l d0 dc.w 0 dc.l d1 dc.w 0 dc.l d2 ... movem.l #2,d0-d7/a0-a6,.i+2(pc) jsr (a6) .i move.l #0,d0 move.l #0,d1 move.l #0,d2 move.l #0,d3 move.l #0,d4 move.l #0,d5 move.l #0,d6 move.l #0,d7 move.l #0,a0 move.l #0,a1 move.l #0,a2 move.l #0,a3 move.l #0,a4 move.l #0,a5 move.l #0,a6
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 09 Apr 2011 11:34
| Gunnar von Boehn wrote:
| Yes in this special case you could win 1 cycle. But there is no compiler which would be able to use this feature. :-/ ...
|
Then we have to make a compiler that support this feature. :D My latency free 4x4 matrix multiplier will get 32 cycles faster on the N050 with this. Other innerloops can be improved too. Self modified code is messy. But to squeeze out more cycles of our low clocked chip we need it..
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 09 Apr 2011 11:41
| S P wrote:
| Ofcourse to support 2 memreads per clock would be the best.
|
2 memreads per clock are coming. Jens did already redesign the cache to support this.S P wrote:
| My Cachesnooping idea is much easier to implement isn't it?
|
I'm not so sure about this. Especially the case where you manipulate code very near to your current instruction could be impossible to handle correctly.S P wrote:
| In compiled code you often see function calls like this: movem.l d0-d7/a0-a6,-(sp) jsr method movem.l (sp)+,d0-d7/a0-a6 I would be happy to implement support in GCC for SMC style calling convention to save datacache and time.. :D movem.l a0-a6,-(sp) move.l d0,.i+2(pc) move.l d1,.i+8(pc) move.l d2,.i+14(pc) move.l d3,.i+20(pc) move.l d4,.i+26(pc) move.l d5,.i+32(pc) move.l d6,.i+38(pc) move.l d7,.i+44(pc) jsr (a6) movem.l (sp)+,a0-a6 .i move.l #0,d0 move.l #0,d1 move.l #0,d2 move.l #0,d3 move.l #0,d4 move.l #0,d5 move.l #0,d6 move.l #0,d7
|
Now you confused me. The lower code option is much longer and slower. Where is the benefit of it in your opinion?
| |
Marcel Verdaasdonk Netherlands
| | Posts 3976 09 Apr 2011 11:48
| Gunnar for some old code caches need to be disabled anyhow. :/ Does anyone know if the OS makes use of self modified code?
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 09 Apr 2011 11:51
| Gunnar von Boehn wrote:
| 2 memreads per clock are coming. Jens did already redesign the cache to support this. |
This is awsome! Gunnar von Boehn wrote:
| I'm not so sure about this. Especially the case where you manipulate code very near to your current instruction could be impossible to handle correctly. | So you need some rules.. Use a latency term. F.ex SMC is not allowed in the next 16 cycles. Gunnar von Boehn wrote:
| Now you confused me. The lower code option is much longer and slower. Where is the benefit of it in your opinion? |
A movem is still one cycle per longword moved to memory. ? But if Jens implement 2 reads per clock can the movem be 2 times faster? movem.l (sp)+,a0-a6 This example will save datacachespace but increase the instruction cache. With the limit of one memread per clock. this code will run at 4 cycles with 2 pipes: move.l #0,d0 move.l #0,d1 move.l #0,d2 move.l #0,d3 move.l #0,d4 move.l #0,d5 move.l #0,d6 move.l #0,d7 and this will run at 8 cycles: move.l (sp)+,d0 move.l (sp)+,d1 move.l (sp)+,d2 move.l (sp)+,d3 move.l (sp)+,d4 move.l (sp)+,d5 move.l (sp)+,d6 move.l (sp)+,d7
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 09 Apr 2011 12:06
| S P wrote:
| With the limit of one memread per clock. this code will run at 4 cycles with 2 pipes:
|
Ok, now I see what you mean. Currently we have 1 Read and 1 ALU. The 2nd ALU will come together with the 2nd readport. Maybe we even enable the 2nd readport at first because it will allow to execute some of the complex memory 68k instructions in 1 cycle. Like e.g CMP2 or CMPM.
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 09 Apr 2011 15:46
| Take a look at the code in my latency free 4x4 MatrixMuls.Every 3d game needs a matrix muls... In the sysinfo topic I managed to Optimize it down to 156 clocks. With SMC I can get it down to 148. If the N070 have two readports per cycle I can optimize it with SMC like this:
fmove.s (sp)+,fp0 ;fused fadd.s #xxx,fp0 ;1 P1 fmove.s (sp)+,fp1 ;fused fadd.s #xxx,fp1 ;1 P2 fmove.s (sp)+,fp2 ;fused fadd.s #xxx,fp2 ;2 P1 fmove.s (sp)+,fp3 ;fused fadd.s #xxx,fp3 ;2 P2 fmove.s (sp)+,fp4 ;fused fadd.s #xxx,fp4 ;3 P1 fmove.s (sp)+,fp5 ;fused fadd.s #xxx,fp5 ;3 P2 fmove.s (sp)+,fp6 ;fused fadd.s #xxx,fp6 ;4 P2 fmove.s (sp)+,fp7 ;fused fadd.s #xxx,fp7 ;4 P2here is the old version: fmove.s (sp)+,fp0 ;fused fadd.s (sp)+,fp0 ;1 fmove.s (sp)+,fp1 ;fused fadd.s (sp)+,fp1 ;2 fmove.s (sp)+,fp2 ;fused fadd.s (sp)+,fp2 ;3 fmove.s (sp)+,fp3 ;fused fadd.s (sp)+,fp3 ;4 fmove.s (sp)+,fp4 ;fused fadd.s (sp)+,fp4 ;5 fmove.s (sp)+,fp5 ;fused fadd.s (sp)+,fp5 ;6 fmove.s (sp)+,fp6 ;fused fadd.s (sp)+,fp6 ;7 fmove.s (sp)+,fp7 ;fused fadd.s (sp)+,fp7 ;8
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 10 Apr 2011 06:42
| len 1 4 fmove.s (sp)+,fp0 ;fused 2 8 fadd.s #xxx,fp0 ;1 P1 3 4 fmove.s (sp)+,fp1 ;fused 4 8 fadd.s #xxx,fp1 ;1 P2
Sorry but fusing 4 FPU instructions does not work, because of two reasons: a) The fused 4 instructions are 24 bytes long. 24 bytes is more than the 16bytes our CPU can read and excute per clock. b) We have 1 FPU Core. Therefore we can not execute this many instructions per cycle. 2 FPU need a real lot of chip space. Which we can not spend on such a rarely used unit.
| |
Deep Sub Micron Germany
| | (MX-Board Owner) Posts 567 10 Apr 2011 09:52
| S P wrote:
| I would be happy to implement support in GCC for SMC style calling convention to save datacache and time.. :D |
Hi, I just like to mention that SMC is not reentrant. So GCC must be somehow aware of whether this is OK or not. Implementing SMC support without adding penalty cycles is not trivial, well it sounds actually nearly impossible. For example there are two cases. When data is still in the pipeline then it might be forwarded. But if data was written to data cache, then it takes some cycles until instruction fetch can access that data. There might be a gap in between both cases. If data is still in pipeline there must be a check whether the whole or just a part of the immediate value was affected (and aligned?) and a check if the opcode was affected. I guess there are dozens of other things that need to be considered, too.
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 10 Apr 2011 23:45
| I have added two more tests in the benchmark program. Here is a recursive qsort implementation. I compiled it with VBCC and then I wrote my own optimized version. In Winuae my handwritten ASM version is 2.3 times faster. :D void quickSort(int arr[], int left, int right) { int ii,j,tmp,pivot; ii = left; j = right; pivot = arr[(left + right)>> 1]; /* partition */ while (i <= j) { while (arr[ i ] < pivot) i++; while (arr[ j ] > pivot) j--; if (i <= j) { tmp = arr[ i ]; arr[ i ] = arr[j]; arr[ j ] = tmp; i++; j--; } }; /* recursion */ if (left < j) quickSort(arr, left, j); if (i < right) quickSort(arr, i, right); }
TEST_QSORT: lea data,a0 move.l a0,a3 move.l #quicksortnumbers,-(sp) move.l #0,-(sp) bsr.b SP_QSORT addq.l #8,sp rts SP_QSORT: move.l 4(sp),d0 ;1 P1 move.l 8(sp),d1 ;1 P2 move.l d0,d2 ;2 P1 fused add.l d1,d2 ;2 P1 move.l d1,d4 ;2 P2 move.l d0,d3 ;3 P1 asr.l #1,d2 ;3 P2 lea.l (a0,d0.l*4),a1 ;4 P1 lea.l (a3,d1.l*4),a2 ;4 P2 a0=a3 move.l (a0,d2.l*4),d2 ;5 P1 .outer .o cmp.l a1,a2 ;5 P2 beq.b .next ;free cmp.l (a1)+,d2 ;6 P2 ;arr<pivot ble.b .o ;? subq.l #4,a1 ;7 P1 .next addq.l #4,a2 ;7 P2 .o2 cmp.l a1,a2 ;8 P1 beq.b .s ;free cmp.l -(a2),d2 ;9 p bgt.b .o2 ;? ;arr[ j ]>pivot .s cmp.l a1,a2 ;10 P1 ; if (i <= j) blt.b .o3 ;free ;if a1>a2 then skip sub.l #4,a2 ;11 P1 move.l (a1),d5 ;11 P2 move.l -4(a2),(a1)+ ;12 P1 move.l d5,(a2) ;13 P1 cmp.l a1,a2 ;14 P1 bge.b .outer ;free .o3 sub.l a0,a1 ;14 P2 sub.l a3,a2 ;15 P1 move.l a2,d7 ;15 P2 fused asr.l #2,d7 ;15 P2 move.l a1,d6 ;16 P1 fused asr.l #2,d6 ;16 P1 .k cmp.l d3,d7 ;16 P2 ;left=d3, j=d7 if (d3 >= d7) then skip ble.b .skip ;? movem.l d4/d6,-(sp) ;18 move.l d7,-(sp) ;19 move.l d3,-(sp) ;20 bsr.b SP_QSORT ;21 addq.l #8,sp ;22 movem.l (sp)+,d4/d6 ;24 .skip cmp.l d6,d4 ;25 ;i=d6, right=d4 if (d6>=d4 ) then skip ble.b .skip2 ;? .kk move.l d4,-(sp) ;26 move.l d6,-(sp) ;27 bsr.b SP_QSORT ;28 addq.l #8,sp ;29 .skip2 rts ;30 Here is the VBCC compiled version with cpu=68060 and speedup=true TEST_QSORT_VBCC: lea data,a0 subq.w #4,a7 move.l #quicksortnumbers,-(a7) move.l #0,-(a7) move.l a0,-(a7) jsr _quickSort add.w #12,a7 addq.w #4,a7 rts _quickSort sub.w #16,a7 ; movem.l .l21,-(a7) ; movem.l d0-d6/a0-a6,-(a7) move.l (24+.l23,a7),(0+.l23,a7) move.l (28+.l23,a7),(4+.l23,a7) move.l (24+.l23,a7),d0 add.l (28+.l23,a7),d0 asr.l #1,d0 move.l (20+.l23,a7),a0 move.l (0,a0,d0.l*4),(12+.l23,a7) bra .l7 .l6 bra .l10 .l9 addq.l #1,(0+.l23,a7) .l10 move.l (0+.l23,a7),d0 move.l (20+.l23,a7),a0 move.l (0,a0,d0.l*4),d1 cmp.l (12+.l23,a7),d1 blt .l9 .l11 bra .l13 .l12 subq.l #1,(4+.l23,a7) .l13 move.l (4+.l23,a7),d0 move.l (20+.l23,a7),a0 move.l (0,a0,d0.l*4),d1 cmp.l (12+.l23,a7),d1 bgt .l12 .l14 move.l (0+.l23,a7),d0 cmp.l (4+.l23,a7),d0 bgt .l16 .l15 move.l (0+.l23,a7),d0 move.l (20+.l23,a7),a0 move.l (0,a0,d0.l*4),(8+.l23,a7) move.l (4+.l23,a7),d0 lsl.l #2,d0 move.l (20+.l23,a7),a0 add.l d0,a0 move.l (0+.l23,a7),d0 move.l (20+.l23,a7),a1 move.l (a0),(0,a1,d0.l*4) move.l (4+.l23,a7),d0 move.l (20+.l23,a7),a0 move.l (8+.l23,a7),(0,a0,d0.l*4) addq.l #1,(0+.l23,a7) subq.l #1,(4+.l23,a7) .l16 .l7 move.l (0+.l23,a7),d0 cmp.l (4+.l23,a7),d0 ble .l6 .l8 move.l (24+.l23,a7),d0 cmp.l (4+.l23,a7),d0 bge .l18 .l17 move.l (4+.l23,a7),-(a7) move.l (28+.l23,a7),-(a7) move.l (28+.l23,a7),-(a7) jsr _quickSort add.w #12,a7 .l18 move.l (0+.l23,a7),d0 cmp.l (28+.l23,a7),d0 bge .l20 .l19 move.l (28+.l23,a7),-(a7) move.l (4+.l23,a7),-(a7) move.l (28+.l23,a7),-(a7) jsr _quickSort add.w #12,a7 .l20 .l5 .l21 ;movem.l (a7)+,d0-d6/a0-a6 ;reg add.w #16,a7 .l23 equ 0 rts
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 10 Apr 2011 23:59
| deep sub micron wrote:
| Implementing SMC support without adding penalty cycles is not trivial, well it sounds actually nearly impossible. For example there are two cases. When data is still in the pipeline then it might be forwarded. But if data was written to data cache, then it takes some cycles until instruction fetch can access that data. There might be a gap in between both cases. If data is still in pipeline there must be a check whether the whole or just a part of the immediate value was affected (and aligned?) and a check if the opcode was affected. I guess there are dozens of other things that need to be considered, toot
|
A penalty is ok. In my code the I want to create instructions fe.x 100 cycles ahead. If the cache controller use 50 cycles to support this its ok. As long as it's done in paralell. If it's difficult to make, I am sure you can find a solution. :D ex: move.l #333,.o+2(pc) .o add.l #222,d0 ;50 cycle stall. move.l #333,.d9+2(pc) (50 cycles of code) .d9 add.l #222,d0 ;no stall.
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 11 Apr 2011 06:32
| Hi SP, I think it would most likely look like this:
S P wrote:
| move.l #333,.o+2(pc) .o add.l #222,d0 ;50 cycle stall.
|
Silently creating a wrong result.
S P wrote:
| move.l #333,.d9+2(pc) (50 cycles of code) .d9 add.l #222,d0 ;no stall. |
Works fine. The reason is simple: A CPU pipeline consists of several stages: 1) Fetch Instruction 2) Decode Instruction 3) Fetch Registers 4) Calculate EA 5) Data-Operants Fetch 6) Alu Operations 7) Store Result Your Instructions get fetch in stage 1. Your Cache will be updates in stage 7. This means by the time the instruction does alter the cached content 6 new instruction where already fetched and partly executed in the pipeline stages! Even if the ICache is snooping the CPU will from itself never notice that these already processed instructions are wrong. But this will normally not be a problem. As you certainly know even the 68000 could fail on selfmodifying code. The 68000 did have no cache but it did prefetch instructions by 6 bytes. If you did selfmodification in such a window then it could fail unnoticed on the 68000 too!On the 68K cores with cache 020 and up the problem was bigger. Because a cache could make selfmodification fail even thousand instructions later. If you disabled the cache the CPU did always try to prefetch 16bytes. Which means there was always a NO-GO area of 16bytes for selfmodified code plus the pipelinelength - Even if you did disable or flush the caches. This means your code would have failed of the old 68K pipelied CPUs even with disable caches! And if you would have written in bytes shorter selfmodified code then it could have failed on the 68000 too. I assume you agree with this. I think surviving selfmodifying code which alters code not already in the pipeline could be a nice feature. As it makes the CPU able to execute code with enabled caches that other 68K cores with disabled cache could execute. This would be an advantage in running incompatible written old code. Making the CPU survive code that is already processed in the pipeline would mean the CPU needed to add an extra trace unit alos for the CPU core which IMHO is an significant CPU cost increase which we should rather not do. But isnt't the main really: "Is Selfmodifying code needed today?" Maybe we should answer this first.
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 11 Apr 2011 08:43
| Gunnar von Boehn wrote:
| "Is Selfmodifying code needed today?" |
The most common instruction to change with SMC is this: move.b 0000(a0),d0 Instead of writing this: move.w (a1)+,d2 move.b (a0,d2.w),d0 This is one cycle faster. It frees 2 registers and Datacache access, and removes the ALU Stall. Here is a demo that use SMC: (Optimized for the Mc68030) EXTERNAL LINK An other example is to do this in 1 cycle.. 4 memreads with fusion. The Current N050 will use 4 cycles on this. With SMC 2 variables can be stored in the instruction cache while 2 in the datacache. move.l (sp)+,d0 ;1 add.l (sp)+,d0 ;2 move.l (sp)+,d1 ;3 add.l (sp)+,d1 ;4 With SMC (2 cycles on N050 and one cycle on N070) move.l #xxx,d0 ;fused add.l (sp)+,d0 ;1 P1 move.l #xxx,d1 ;fused add.l (sp)+,d1 ;1 P2
I can create more examples..
| |
|
|
|
|