|
|---|
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 24 Jun 2010 00:39
| Wojtek P wrote:
| If only 68050 decoder can decode more instructions than ALU can execute - it will all IF's execute as linear code with unexecuted instructions taking 0 cycles!! |
You praise the bitmap IF instruction but I fail to see the benefit over of it a real "BCC". Why should it be any faster than a normal BCC instruction? I see nor reason why it should or could.
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 24 Jun 2010 00:47
| Angel of Paradise wrote:
| Gunnar von Boehn wrote:
| But PowerPC is unfortunately bit more difficult to code for. And some PowerPC chips do not perform well with "normal" or "average" algorithms - these chip need very special tuned algorithms to reach their performance. |
What PowerPC chips are most easiest to code for and which are most difficult? |
60x is good to code for. G3 is good to code for. The G3 is average fast and has not too major issues that you need to work around. One limitation of the G3 is its relative weak L2 cache/protocol and the resulting limited memory performance. G4 is good to code for. G4 has relative good performance. Has not many issues except maybe that instruction fetch is relative weak and the CPU can starve itself a little after jumps. The G4 has relative elegant with its short pipeline. Freescale did a quite good job on it. If the memory throughput would have been increased it would have been even better. G5 is one of the best PowerPC. Its fast and has very good Out of Order execution. This allows it to reach good performance with normal code. Any PPC which came out after the G5 was much more difficult to code for. Cell has many big issues that you need to work around as coder to prevent serious degradation of performance. Xenon has the same issues as CELL. Power6 also a high theoretical performance but in real live way to many issues that eat of the performance.If you ask me what the best PPC is, then I would say that the 970 (G5) also known as "Antares" was it. The load store unit was strong. All address modes did work reasonable well (which is unfortunately not the case with most later PowerPCs) The integer unit of the G5 was clock by clock quite good. The G4 was in integer sometimes sometimes a little better clock by clock. But the G5 reached much higher clockrate, was stronger in float and because of its faster memory interface was not so limited as the G4. The G5 decoder had some surprises but its good Out-of-order execution could make up for this. Its a pity that the G5 was discontinued so quickly. The G5 design was neat and did had good potential. If IBM would had iterated on the G5 and had brought out newer versions with more cache or less power consumption then those could have been all great PowerPCs. Its a shame that the PowerPC is vanished from the desktop. But of course its would have been difficult to build proper desktop machines anyway with PowerPC. After the G5, no PowerPC CPU was produced that would make commercial sense to build a Desktop out of it. All PowerPC CPU which came out after the G5 were weaker than the G5 in general code performance.
| |
Asaf Ayoub United Kingdom
| | Posts 332 24 Jun 2010 04:19
| what do you think about : Freescale: P3041: QorIQ P3041 Quad Core EXTERNAL LINK Freescale: P5020: QorIQ P5020 and P5010 Dual and Single 64-bit Core Processors EXTERNAL LINK Have they improved at all ?
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 24 Jun 2010 07:12
| Asaf Ayoub wrote:
| what do you think about : Freescale: P3041: QorIQ P3041 Quad Core EXTERNAL LINK Freescale: P5020: QorIQ P5020 and P5010 Dual and Single 64-bit Core Processors EXTERNAL LINK Have they improved at all ?
|
You are right, we forgot the QUICKIES. The PowerQuick live in their own little world. I've not enough practical experience with them so I can not give a fair quote about them.
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 24 Jun 2010 07:19
| I think we can learn from the come and go of the different PowerChips. The reason why the G5 was so good was that a lot of energy was put in making the chip able to execute all code properly. This means even while this chip was fast it was relative good to code for. The later high clocked chips did focus on high clock but traded in easy of use and the ability to execute all code. While the POWER architecture was a risc architecture some of the newer chips suddenly dropped support for RISC execution of many instructions and only implemented them with Microcode. Suddenly instructions which used to take 1 cycle in previous CPUs needed 10,20 or more cycles. This gave of course performance problems with existing software.The moral that I take from this is: Easy of coding, and support for existing code is very important. But we can also learn form the good and bad of the existing 68K chips. Would someone like write up a little summery about the different 68K chips and their abilities and there strength and weaknesses?
| |
Cesare Di Mauro Italy
| | Posts 526 24 Jun 2010 08:17
| Wojtek P wrote:
| @Cesare The only common ISA i know to have denser code than 68k is Thumb-2 actually :) The least dense 32-bit CISC is intel 386 and newer. Usually you need much more instruction with x86 to perform the same than most RISCs, which is really funny. |
Do you have some data about it? It doesn't sound real for me.
And the achievement of intel and amd team to make this completely stupid instruction set execute so fast is worth of nobel prize. And i did A LOT of coding with this instruction set. Every time i've seen other, most important 68k and ARM, i though - how simple and efficient it is! Just think when motorola define 68k ISA and when Thumb-2 was defined. pure 32-bit ARM is from the beginning densest and most efficient RISC. The trick to include barrel shifting as part of instruction execution (in one clock with everything else) is great. 99% of constants used in program can be put directly into opcode that way. |
I don't agree. You are missing one of the key points of CISCs: the ability to use memory addressing into regular operations (aritmetic, logic, and move too).That's why CISCs had much better code density over the RISCs. ARM included (which I think is one of the best RISC, and which have denser code compared to other RISCs).
| |
Cesare Di Mauro Italy
| | Posts 526 24 Jun 2010 08:21
| Wojtek P wrote:
| @cesare bitmap storage is not a problem. for simple cases - up to 3 instructions following - can be fit in single word. everything else encoded in 2 words. |
I don't see where you want to put the IF bitmap mask. Do you want to add a brand new register to the 68K ISA?
MOST if/else fragments in C code are not larger than 16 instructions. many fits 4 - so even 4 instruction IF-ELSE is great improvement what ARM Thumb-2 actually shows. 16 instruction IF-ELSE will make zero cost conditional execution for more than 95% of code fragments. |
16 conditional instructions completely kills the advantage of this feature, which is borned to prevent pipeline trash and reload.It works well with a bunch of instructions: 4 is a good compromise.
| |
Wojtek P Poland
| | Posts 1597 24 Jun 2010 09:09
| Gunnar von Boehn wrote:
| Why should it be any faster than a normal BCC instruction? I see nor reason why it should or could.
|
simple: instruction 1 jump if not zero l1 instruction 1 instruction 2 instruction 3 . . . instruction 10 jump l2 l1: instruction 11 . . . instruction 20 l2: simple case where 10 instructions have to be executed on zero and other 10 on non zero. how do you optimize this? imagine if nonzero, 20 instructions, bitmap 01010101010101010101 instruction 1 instruction 11 instruction 2 instruction 12 instruction 3 instruction 13 . . instruction 10 instruction 20 here - it's just enough to have decoder that can digest 2 instructions per cycle. Good if it would be able to take 3 simple ones per cycle. CPU will have each cycle 2 instructions to choose from and select one according to needs. ACTUALLY no bitmap is needed. All you need is to give how much instructions are IF and how much are ELSE. the sequence could be just predefined simply to make the instructions evenly mixed. you know that memory throughput isn't a problem but memory latency is. this make one (or even zero) cycle "branch" costs with unredictable branches. the problem will be doing calls within such if-else fragments, which could perform nested if-else. i think it can be handled by if-else stack but it's not needed. simply disallow such things. If someone needs to call procedure from that sequence he should make it last instruction in sequence. this both remove the need of nesting and improve branch logic - instructions before call will fill the pipeline and decoder could easily "predict" unconditional call/branch.
| |
Wojtek P Poland
| | Posts 1597 24 Jun 2010 09:12
| Cesare Di Mauro wrote:
| Wojtek P wrote:
| Do you have some data about it? It doesn't sound real for me. |
yes i did wrote a lot of assembly programs for up to 486 (in 32-bit flat mode of course) and a bit for ARM/Thumb-1. |
There is no comparision.
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 24 Jun 2010 09:22
| Wojtek P wrote:
| CPU will have each cycle 2 instructions to choose from and select one according to needs. |
I fail to see the performance advantage in doing it like this. :-/The CPU can not decide which halve of instruction to take until the BITMAP instruction has reached the very end of the pipeline where the flags are valid. This means the CPU needs to execute all instructions of both groups. This means in your 10 instruction group example the CPU has to executes 21 instructions. I real BCC could be predicted and if predicted right then it would only execute 12 instructions. If predicted wrong the BCC would execute 1+pipelinelength+other-halve. So in the wrong case the BCC would execute about 16-17 instructions. So in any way the the BCC is far more efficient than the BITMAP solution.
| |
Ceti 331 United Kingdom
| | Posts 282 24 Jun 2010 09:52
| i dont know enough about implementation details but it seems the advantage of the ifelse bitmap is early encoding of the else case, if i had to bet on it the bitmap seems intuitively like the easier one to speed up. From a software perspective the option to interleave for 50:50 case sounds great
| |
Deep Sub Micron Germany
| | (MX-Board Owner) Posts 567 24 Jun 2010 12:33
| So where to store the bitmap (or else branch code size)? In case of bsr/jsr/interrupts/traps/contexswitches (inside the true branch) it must be stored somewhere. Also when one tries to do cascaded ifthenelse it will fail because the bitmap register is already occupied. And what is the benefit? There is none. So lets say it with the words of Wojtek: "IfThenElse instructions are crap!"
| |
Ceti 331 United Kingdom
| | Posts 282 24 Jun 2010 12:36
| deep sub micron wrote:
| So where to store the bitmap (or else branch code size)? In case of bsr/jsr/interrupts/traps/contexswitches (inside the true branch) it must be stored somewhere. Also when one tries to do cascaded ifthenelse it will fail because the bitmap register is already occupied. And what is the benefit? There is none. So lets say it with the words of Wojtek: "IfThenElse instructions are crap!"
|
I dont know the implementation details, but I strongly suspect storing an exrta predefined Bit along with instructions is insignificant compared to all the effort modern CPUs go to, e.g. trying to do superscalar dispatch from variable length instructions, or accumulating information for branch-prediction
| |
Deep Sub Micron Germany
| | (MX-Board Owner) Posts 567 24 Jun 2010 13:53
| It is not the instruction encoding. It is the context information that concerns me. Let's say a IfThenElse command is executed that tells to execute either next 1,2,3,4 instruction or execute next 5,6,7 instruction. In the "true" case it starts executing instruction 1. Then instruction 2 is a subroutine call (that can contain other "IfThenElse" as well). After it returns from this subroutine call the processor must still know that after executing instructions 3,4 it must continue with instruction 8. There must be a kind of stack to store this instruction skip (which equals a simple forward jump). The question is, is storing the skip information so early really better than the jump instruction at the end of the true branch. No it is not.
| |
Wojtek P Poland
| | Posts 1597 24 Jun 2010 13:53
| @Gunnar it's incredibly simple. decoder should turn all instructions into conditional instructions within if-else block, with some of them the condition equal to what if specify, other reversed (else). ALU must be able to handle multiple instruction at once but ONLY by checking conditions and executing one per cycle with conditions right. The result is like ALWAYS properly predicted branches.
| |
Wojtek P Poland
| | Posts 1597 24 Jun 2010 13:56
| deep sub micron wrote:
| | So lets say it with the words of Wojtek: "IfThenElse instructions are crap!"
|
Maybe crap but at least with ARM Thumb-2 it allows to execute big part of code branchless with no branching penalty. Even with only 4 instructions allowed within if-else block.
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 24 Jun 2010 13:57
| Wojtek P wrote:
| The result is like ALWAYS properly predicted branches. |
No - the result is 50% performance.I think you are missing the fact that ALL instructions will be executed. But not all result will be written back. This means the ARM using the IfThen instruction to make 4 instructions conditional will in any case have to execute 5 instructions - taking the time for 5 instructions.
| |
Deep Sub Micron Germany
| | (MX-Board Owner) Posts 567 24 Jun 2010 14:00
| Wojtek P wrote:
| it's incredibly simple. decoder should turn all instructions into conditional instructions within if-else block, with some of them the condition equal to what if specify, other reversed (else).
|
It is not simple if it supports subroutine calls or can be interrupted by interrupts.
Wojtek P wrote:
| The result is like ALWAYS properly predicted branches.
|
Nonsense! You end up with plenty of bubbles in your pipeline.
| |
Marcel Verdaasdonk Netherlands
| | Posts 3976 24 Jun 2010 15:47
| Wojtek it's a nice idea, but wouldn't work in a CISC style CPU. if-else code IIRC are conditional branches!
| |
Megol .
| | Posts 674 24 Jun 2010 16:14
| deep sub micron wrote:
|
Wojtek P wrote:
| it's incredibly simple. decoder should turn all instructions into conditional instructions within if-else block, with some of them the condition equal to what if specify, other reversed (else). |
It is not simple if it supports subroutine calls or can be interrupted by interrupts. Wojtek P wrote:
| The result is like ALWAYS properly predicted branches. |
Nonsense! You end up with plenty of bubbles in your pipeline.
|
Speculation: In a limited case one could perhaps use bits 7-5 in the status register. It would only allow 3 masked instructions per conditional but should be safe for interrupts. For a scalar/inline processor it would be easy to implement too, let the pipeline check bit 5 before executing (or committing) an instruction. Each clock the pipeline isn't stalled and doesn't execute a conditional mask instruction shift in a zero from bit 7 downwards. A one at bit 5 would indicate that the current instruction shouldn't execute. A conditional mask instruction loads the 3 bits with the provided pattern (if true) or the inverse pattern (if false). In this scheme a subroutine call would be treated either as a normal instruction (so they can be skipped) or always do a call.EOR D5,D5 ; Z = 1 IFZ %101 ; LSB SR=101XNZVC ADD D1,D2 ; bit 5=1, don't execute. LSB SR=010XNZVC SUB D3,D4 ; bit 5=0, execute. LSB SR=001XNZVC ADD D5,D2 ; bit 5=1, don't execute. LSB SR=000XNZVC ... ; execute following EOR D5,D5 ; Z = 1 IFNZ %110 ; LSB SR=001XNZVC -> the inverse of %110 ADD D1,D2 ; bit 5=1, don't execute. LSB SR=000XNZVC SUB D3,D4 ; bit 5=0, execute. LSB SR=000XNZVC ADD D5,D2 ; bit 5=0, execute. LSB SR=000XNZVC ... ; execute following Don't know if it would work. Don't think it's worth it. ;)
| |
|