 |
Welcome to the Natami / Amiga ForumThis forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.
|
Do you have questions about the Natami? Post it here and we will answer it!
|
|
|---|
Gabriele Budelacci Italy
| | Posts 111 11 Jul 2009 00:25
| Another solution may this: cmp.l #30,d0 sgt d1 and.l #10,d1 sub.l d1,d0 Not limited to have 16 bit d0. 4 cycles and not needed EXTB instruction since #10 is contained in the lower 8 bits => plain 68000 code. Obviously, these instructions are coupled too, so no advantage in superscalar execution.
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 11 Jul 2009 01:19
| Thomas Richter wrote:
| How do you handle the cases with overlapping instructions, that is, if the effective address of the addi instruction is used by the following instruction? In that case, decoding and probably fetching the input data of the following instruction is already completed before the previous instruction (let's say addi) had the chance to write its output to the cache? Example (considerably stupid, but whatever): addq.l #1,(a0)+ subq.l #1,-(a0)
|
The above is not a problem. I'll explain you why. 1) The 68050 has to write ports. One for the EA-Unit, and one for the ALU. 2) Also the 68050 uses result forwarding on both of them. The Write-port of the EA-Unit is used by the following instructions: 1) Address modes that alter the Adress pointer (An)+ and -(An) 2) By ADDI, SUBI, ADDQ AN, ... and friends These instructions use the EA-Writeport to update the Addres-register. The result of the EA-Unit is forwarded to the next Instruction using the EA-Unit and also to following instructions using the Addressregister. This means stuff like this works without any problem clr (An)+ clr (An)+ clr (An)+ The updated content of An is forwarded to the next instruction in time. The 2nd write port belongs to the ALU. The ALU does also forward is results. This means a instruction group like this works without any penalty: move D1,(A0)+ add D0,D1 neg D0 All the instructions are depending, nevertheless the 68050 can execute them all in 1 cycle each. But there is one "gotcha". This is happens when an ALU result is used by one of the following 2 instructions in the EA-Unit. The 68050 can not forward in time in this case of course. The reason that result forwarding is impossible is that the ALU stage is simple executed later than the EA-unit. This means when the first instruction reaches the ALU the 2nd already left the EA_unit. Typical example is this one: move (A0,D0),D1 neg D0 D0 is used in the EA-stage - but only updated in the ALU stage. If the 68050 runs into such code is has to issues 2 wait cycles. The bubble can be reduced and removed by adding other instructions inbetween. This behavior is by design of the 68K architecture. This problem is shared with all other 68K CPUs. E.g. The 68060 will create just the same penalty in this example. But this case is relative rare and the bubble can easily be avoided by instrution reordering. Did this help? Cheers
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 11 Jul 2009 01:47
| Gabriele Budelacci wrote:
| Another solution may this: cmp.l #30,d0 sgt d1 and.l #10,d1 sub.l d1,d0 |
This is very clever! Also 4 cycles is an excellent result! I think the 68050 is the only 68K CPU which can beat this. And it can beat this even without any cleverness on the programmer side :-) If you just code it this way: cmp.l #30,d0 ble .jump sub.l #10,d0 .jump Then its 3 instruction, which are executed in 3 cycles on the 68050. Normally the BRANCH after the CMP is a pain in the arse. We all know this. The branch needs the condition code - but the condition codes are generated at the end of the pipeline. The branch wants to alter the PC which sits at the beginning of the pipeline. This is normally the problem which makes a CPU pay with its pipeline length for such code.So normally this combination would need to wait until the CMP went down the complete pipeline before the jump can be executed. So we look at the pipeline diagram, then we can see: Unit diagram: 1) Fetch/ICache 2) Decoder 3) Register load 4) EA-Unit 5) DCache-Load 6) ALU Operation 7) Register-Store/DCache-Store That "normally" a jump after the CMP is expensive. In our case the pipeline is even short "only" 6 clocks for this. Other CPU would pay up to 30 clocks for a jump. Luckily the 68050 does simply cheat on this one. The decoder can rewrite the instructions to: cmp.l #30,d0 subgt.l #10,d0 The SUB is now conditional. The clue is that the SUB can be executed just normally. The condition is not needed early like with the branch - its only needed in the very last stage, the writeback stage. This means the rewritten instruction does not hurt the pipeline anymore. And the SUB can be executed just after the CMP. Nice isnt't it? Cheers Gunnar
| |
Mike Sieweke USA
| | Posts 11 11 Jul 2009 02:56
| Thomas Richter wrote:
| i = i - (((30 - i) >> 31) & 10) This, indeed, will do faster on many modern processors (and is of course a hack) due to the barrel shifter. |
This is a clever solution, but not correct in all cases. If i == -MAXINT then the subtraction will overflow and give the wrong answer. Also, some "modern" CPUs (like the P4) are quite slow in the barrel shifter. The MIPS32 does best with the simple "if" statement. The Gnu compiler produces this code, which executes in 3 cycles. The other suggestions take at least 4 instructions and 4 cycles. # if i > 30 then i = i - 10 slti t0, i, 31 # t0 = i .lt. 31 addi t1, i, -10 # t1 = i-10 movz i, t1, t0 # if t0 .eq. 0 then i=t1
ARM also supports conditional execution, so it would use similar code. It's best to check what your compiler does before writing clever code.
| |
Gone Gahgah Australia
| | Posts 237 11 Jul 2009 05:09
| Gabriele Budelacci wrote:
| Another solution may this: cmp.l #30,d0 sgt d1 and.l #10,d1 sub.l d1,d0
|
Very clever. I imagine the best way is if we could just write the branch form and the compiler reworked it to the most efficient way for the CPU. And if it could do this for all patterns then we wouldn't have to worry about it and just write the most easily read. I imagine that human ingenuity means that someone somewhere will be able to create code patterns that weren't predicted and which the compiler/optimiser didn't know how to rework more efficiently than a clever assembler coder could. Gunnar von Boehn wrote:
| The instruction does indeed quite a lot of work. :-) The 68050 can do this amount of work in 1 clock because of its pipelined design. Unit diagram: 1) Fetch/ICache 2) Decoder 3) Register load 4) EA-Unit 5) DCache-Load 6) ALU Operation 7) Register-Store/DCache-Store
|
Thanks for that; now I understand and appreciate pipelining. Cool existing solution.Gunnar von Boehn wrote:
| The predication conversion is supposed to rewrite BCC branches to become much faster!
|
Cool. I didn't even know to think that such a solution was about so thanks for volunteering this. It adds to my general understanding of things. I imagine it increases the number of required instructions fairly significantly? But so beneficial. All of this CPU writing and testing you are doing at the moment is being done virtually at the moment isn't it Gunnar? Do you actually get to have any FPGAs to burn too and test your ideas? Does that come later when it is all 'debugged'? Does an FPGA version always behave exactly like the virtual version? If it doesn't what sort of things cause the variances? These would be interesting things for me just to know. I'm unlikely to ever delve into CPU coding in a broad way but it is still fun to know these little things.
| |
Team Chaos Leader USA
| | (Moderator) Posts 2094 11 Jul 2009 09:01
| subgt.l #10,d0
Is that a real instruction on the 050 that I can write in my own asm programs?I have been wanting instructions like that since 1992!
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 11 Jul 2009 12:19
| gone gahgah wrote:
| All of this CPU writing and testing you are doing at the moment is being done virtually at the moment isn't it Gunnar?
|
Yes, I do all my development with chip simulator.gone gahgah wrote:
| Do you actually get to have any FPGAs to burn too and test your ideas? |
Yes, we have several FPGA dev-boards even including big Virtex and Stratix boards. But development in the simulator is faster so I prefer doing this. This is by the way very normal. Today all CPUs are developed in using simulator environments. Cheers
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 11 Jul 2009 12:28
| Team Chaos Leader wrote:
|
subgt.l #10,d0
Is that a real instruction on the 050 that I can write in my own asm programs? I have been wanting instructions like that since 1992!
|
Inside the 68050 : - all instructions are 3 operant instructions - and all instructions have a condition code. The 68K family sometimes uses conditions codes and sometimes 3 opperants - so its easier for the hardware to have this internally always. So far we do not expose this to the user with new opcodes. Doing this would mean we would need to "invent" new instructions. To be honest I'm not sure if we should ever do this. What do you think?
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 11 Jul 2009 21:04
| Here is a small (untested) piece of code that Rotate 3d points around the y-axis and perform perspective transformation.. The code is not optimized. The obvious bottleneck in this loop is the two divs's per point. The divses can be transformed to muls if I set the zcoord in the buffer to 1/(Z1+dist) but let us assume that the zcoords are being changed from frame to frame. If the N050 could perform the division in paralell to the other instructions this loop could be optimized almost 100%. Another improvement would be if movem.w could move to 2 or more registers in one clock(if the buffer is in the datacache) For faster fixed point transformations, a muls.w dx,dx,#xx where xx is asymertric shift right. would be extremely powerful. I always wanted a swap.b The optimal loop is different from processor to processor. If somebody could compile the c-code I provided we can check how my code is doing compared to the compiler. If I get the complete cycle-timing of the N050 I will try to create the fastest loop... Execution timing(could be small errors). Assuming the code is in the cache, and no superscalar. Mc68000 move 4 cycles muls.w 43 cycles sub.l 6 cycles asr.l(12) 32 cycles (8+2*12) lsl.l(12) 32 cycles ext.l 4 cycles divs.w 158 cycles cmpa.l 6 cycles bne.b 10 cycles Mc68030 move 2 cycles muls.w 28 cycles sub.l 2 cycles asr.l(12) 4 cycles lsl.l(12) 4 cycles ext.l 2 cycles divs.w 58 cycles cmpa.l 2 cycles bne.b 6 cycles Mc68060 move 1 cycle muls.w 2 cycle sub.l 1 cycle asr.l(12) 1 cycle lsl.l(12) 1 cycle ext.l 1 cycle divs.w 22 cycle cmpa.l 1 cycles bne.b 1-7 cycle N050 move 1 cycle muls.w 1 cycle sub.l 1 cycle asr.l(12) 1 cycle(?) lsl.l(12) 1 cycle(?) ext.l 1 cycle divs.w x cycle(?) cmpa.l 1 cycle bne.b 1-7 cycle(?) D0=X D1=Y D2=Z D3=SIN(V) (FIXED POINT 4:12) D7=COS(V) (FIXED POINT 4:12) A0=points to buffer with 3d coords.(input) A1=points to buffer with 2d coords.(output) a6=points to end of buffer with 3d coords. dist=192 moveq.l #12,d6 .loop movem.w (a0)+,d0-d2 ;x ;X1=X*COS-Y*SIN ;Y1=X*SIN+Y*COS move.l d0,d4 muls.w d7,d0 ;X*COS muls.w d3,d4 ;X*SIN move.l d1,d5 muls.w d7,d1 ;Y*cos muls.w d3,d5 ;y*sin sub.l d5,d0 ;X*COS-Y*SIN sub.l d1,d4 ;X*SIN+Y*COS ; x=x1*dist / (z1+dist) ; y=y1*dist / (z1+dist) asr.l d6,d0 asr.l d6,d4 muls.w #dist,d0 muls.w #dist,d4 add.w #dist,d2 ext.l d2 lsl.l d6,d2 move.l d2,d1 divs.w d0,d1 divs.w d4,d2 movem.w d1-d2,(a1)+ cmpa.l a0,a6 bne.b .loop .. C-Code: for(int i=0;i
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 11 Jul 2009 21:06
| The editor is destroying my c-code.
| |
Gone Gahgah Australia
| | Posts 237 11 Jul 2009 23:20
| S P wrote:
| The editor is destroying my c-code.
|
Such is Natami website life...
Gunnar von Boehn wrote:
| Luckily the 68050 does simply cheat on this one. The decoder can rewrite the instructions to: cmp.l #30,d0 subgt.l #10,d0
|
Gunnar von Boehn wrote:
| So far we do not expose this to the user with new opcodes. Doing this would mean we would need to "invent" new instructions. To be honest I'm not sure if we should ever do this. What do you think?
|
Another thing I have just learnt/just dawned. Thanks Gunnar. Cool. The decoder has its own sub-psuedo-instruction set that it can modify actual instructions to; eliminating branching. I think that if this is just as quick as allowing the coder do it directly then there isn't any point adding the new commands. I guess it is just a psychological thing. I would have a easier time mentally if I were able to code it directly as it is then obvious what is happening and I am happy because I can directly see that I do not to have a branch anymore. But what I should do is rest easy confident that the N68050 will eliminate the nasty branch for me automatically. At the same time my code maintains backward compatibility so that I can run it on older 680x0s without exceptions. I imagine also it makes things more jammed if we start adding new instructions; especially multiple flag variations on existing instructions. The above are my interpretations from what you have said Gunnar. I hope it is correct. I imagine the decoder has to be bigger? Does it have its own little clever tricks and shortcuts?
Gunnar von Boehn wrote:
| JMP xxx.w/d16(pc)/xxxx.l = 1 clock JSR xxx.w/d16(pc)/xxxx.l = 1 clock All other JMP and JSR = 3 clocks
|
Could you possibly do a little expose on what makes those other JMP and JSR instructions 3 clocks instead of 1 Gunnar? I'm interesed by exceptions to the rule and I have not a clue to guess it myself.
| |
Denis Markovic Germany
| | (Natami Team) Posts 41 12 Jul 2009 07:20
| Hi Gunnar, > The 68K family sometimes uses conditions codes and sometimes 3 > opperants - so its easier for the hardware to have this internally > always. Cool, that makes it really powerfull, just like e.g. ARM in ARM32 mode or other processors :) > So far we do not expose this to the user with new opcodes. > Doing this would mean we would need to "invent" new instructions. > To be honest I'm not sure if we should ever do this. > What do you think? I fear, you are right. If you always recognize such code (conditional branch over following instruction) and take the shortcuts, you do optimization while staying quite compatible, so no need to have new opcodes for that and it runns on other 68k processors. On the other hand ... if you maybe decide to add things like vector instructions, array min/max, norm etc. to some maybe future (far far away) 68080 then I would also make the conditional instructions available in the future (then it does not matter anyway because you have a lot of new instructions). Btw.: I think I missed all discussion about Robin capabilities and 3d Core capabilities. Could you give me a hint in which forum post to find some information about instruction set, vector processing, parallel instructions etc. in Robin and the 3D core? Just interessted as I was programming such things the last 10 years :-)
| |
Marcel Verdaasdonk Netherlands
| | Posts 3977 12 Jul 2009 12:30
| I think we should do the same thing IBM is upto with there new compiler. they are working on a compiler that compilers and then runs code to see how fast is goes then compiles it in a different way and test again, then compairs the run time, and takes the fastest code.
| |
Denis Markovic Germany
| | (Natami Team) Posts 41 12 Jul 2009 15:53
| Isn't gcc able to do the same? If I remember right, there was a mode where you could compile a special executable, let it run on typical testdata and during that the executable writes information about loop counts etc. into a file.Then the same code can be recompiled using the collected information for optimization, usually gives a few % more speed. Marcel Verdaasdonk wrote:
| I think we should do the same thing IBM is upto with there new compiler. they are working on a compiler that compilers and then runs code to see how fast is goes then compiles it in a different way and test again, then compairs the run time, and takes the fastest code.
|
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 12 Jul 2009 17:21
| gone gahgah wrote:
| Gunnar von Boehn wrote:
| JMP xxx.w/d16(pc)/xxxx.l = 1 clock JSR xxx.w/d16(pc)/xxxx.l = 1 clock All other JMP and JSR = 3 clocks |
Could you possibly do a little expose on what makes those other JMP and JSR instructions 3 clocks instead of 1 Gunnar? I'm interesed by exceptions to the rule and I have not a clue to guess it myself. |
Sure, this is easy to explain. Unit diagram: 1) Fetch/ICache 2) Decoder 3) Register load 4) EA-Unit 5) DCache-Load 6) ALU Operation 7) Register-Store/DCache-Store The Branch acceleration unit is located in stage 1). The Fetch/Branch acceleration unit contains the program counter PC. The 68k knows two main branch types. Conditional Branches - which need the condition codes which are available at the end of the Pipeline. And unconditional branches. Unconditional branches are: BRA,BSR,JSR,JMP and also RTS,RTE,... The unconditional branches can ge grouped in four different types of how to address to jump to is calculated. a) JUMP to an absolute address, included in the opcode JMP $100000 b) PC-relative JUMP BRA.w label BSR label JMP label(pc) c) A jump to a calculated Effective Address (EA) JMP -240(A6) d) Jump to an address located in memory. RTS The branch acceleration unit can on its own handle the branch types A and B. (A) is easy to handle as the address is contained in the instructions. (B) is also easy to handle as only an addition to the PC is needed. The branch acceleration unit contains three adders-unit for this purpose. The branch type C needs access to the register set and an EA calculation. This can only be done by the EA-unit. Which comes 3 stages later in the pipe. This is way this type of branch take longer. The branch to memory (RTS) is basicly a MOVE.l (A7)+,PC And therefore needs also be done later in the pipeline Did this answer your question?
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 13 Jul 2009 08:24
| Gabriele Budelacci wrote:
| Another solution may this: cmp.l #30,d0 sgt d1 and.l #10,d1 sub.l d1,d0 |
If I understand the main usage of the SCC instruction then its most useful in combination with an AND and then with any other instruction that you want to replace. If we wanted, then we could create a combined SCC/AND instruction which does the same but in 1 cycle. Something like : SCC.L #immidiate,DnWould that make sense?
| |
Marcel Verdaasdonk Netherlands
| | Posts 3977 13 Jul 2009 09:06
| oddly enough it does make sence. However this shouldn't force the programmer to allways use it commbined.
| |
Gabriele Budelacci Italy
| | Posts 111 13 Jul 2009 11:54
| @Gunnar I disagree, it whould be used only in few cases like this. Personally, if new instructions will be added, I strongly prefer math+cc instructions, eg: cmp.l #30,d0 subgt.l #10,d0 cheers ;-)
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 13 Jul 2009 20:39
| Gabriele Budelacci wrote:
| Personally, if new instructions will be added, I strongly prefer math+cc instructions, eg: cmp.l #30,d0 subgt.l #10,d0 cheers ;-)
|
There is one challenge with doing "CC" on any type of instruction - which I should not hide. If someone sees a clever solution to this problem please speak up! CC-ing an instruction can be easily done on any type of instruction with do their alteration only in the ALU-stage. If an instruction does any type of alteration in the EA-stage then this is very difficult to "CC" is the condition codes are not available in that early stage. In other words: move (a0),D0 - can be easily CC'ed as it only alter D0, and this is done in th ALU-stage. move (a0)+,D0 -- is a problem as it alter the content of A0 in the EA-stage also. Any idea?
| |
Gabriele Budelacci Italy
| | Posts 111 13 Jul 2009 22:04
| Gunnar von Boehn wrote:
| In other words: move (a0),D0 - can be easily CC'ed as it only alter D0, and this is done in th ALU-stage. move (a0)+,D0 -- is a problem as it alter the content of A0 in the EA-stage also. Any idea?
|
Firsts, I don't know much on HW development :-P Idea #1: place the CC test before in the pipeline: Unit diagram: 1) Fetch/ICache 2) Decoder 3) Register load 3-bis) CC test, continue if true 4) EA-Unit 5) DCache-Load 6) ALU Operation 7) Register-Store/DCache-Store Idea #2: allow CC-test only for specified EA (eg: allowed for (a0), not allowed for (a0)+ ) Idea #3: insert a rollback stage after ALU operation stage: Unit diagram: 1) Fetch/ICache 2) Decoder 3) Register load 4) EA-Unit 5) DCache-Load 6) ALU Operation 6-bis) CC rollback test 7) Register-Store/DCache-Store
| |
|
|
|
|