 |
Welcome to the Natami / Amiga ForumThis forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.
|
Do you have ideas and feature wishes? Post them here and discuss your ideas. |
|
|---|
Phil "meynaf" G. France
| | (Natami Team) Posts 393 10 Mar 2011 10:38
| My personal opinion about these... 1) Useful for programmers who are constantly out of data regs, like myself. But not worth slowing down the whole chip if it's costly to implement. 2) This cannot really be encoded without instruction grow. F.e there will be no MOVEQ, and no MOVE, in 2 bytes only. So for me pure data regs can't be properly added. 3) 4) I have to see a routine using that ;) 5) Prefixes stink the x86 too much for my taste :p But bit3 of the extension word to encode more registers sounds fine and remembers me of something i proposed earlier... 6) Read first sentence of 5) again :) To add more addressing modes to instructions, it may well be interesting to poll people about what they really need, that is, ask them to write whole routines using what they want to get added (yeah i know sometimes i repeat myself :)).
| |
Phil "meynaf" G. France
| | (Natami Team) Posts 393 10 Mar 2011 10:44
| Gunnar von Boehn wrote:
| Very "strong" instruction are: ADDM (An)+,(Am)+
|
I'm more for : ADD -(An),-(An) which follows the ADDX scheme (and can be followed by it in multi-precision code).
| |
Megol .
| | Posts 690 10 Mar 2011 12:25
| Phil G. wrote:
| My personal opinion about these... 1) Useful for programmers who are constantly out of data regs, like myself. But not worth slowing down the whole chip if it's costly to implement. 2) This cannot really be encoded without instruction grow. F.e there will be no MOVEQ, and no MOVE, in 2 bytes only. So for me pure data regs can't be properly added. 3) 4) I have to see a routine using that ;) 5) Prefixes stink the x86 too much for my taste :p But bit3 of the extension word to encode more registers sounds fine and remembers me of something i proposed earlier... 6) Read first sentence of 5) again :) To add more addressing modes to instructions, it may well be interesting to poll people about what they really need, that is, ask them to write whole routines using what they want to get added (yeah i know sometimes i repeat myself :)). |
Prefixes are used for RISC processors too nowadays. It is a "clean" way to extend the instruction set without too much trouble. Basing ones opinion on a proposed extension whether x86/ARM or other disliked architecture uses it is idiotic. The basis of decision should be complications due to it, the real world usability and backwards compatibility. What about redefining the extension word if the previously unused bit is set? In this case several of the proposals could be implemented. If I read the manual right one could "free" up to 7 bits in the full extension word (but in reality it perhaps would be hard). The cases that are most useful IMHO would be: . suppress memory operation -> pass the address to execute stage . update base register . extend scale field with one bit . select address/data register as base #2 and #4 would be incompatible, perhaps #1 and #2 too? There would IMHO be room for future expandability here if wanted. Extending the amount of registers can be difficult for backwards compatibility so I no longer think that's a good choice. In most cases the amount of 68k register are enough anyway.
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 10 Mar 2011 12:59
| Megol . wrote:
| #2 and #4 would be incompatible, perhaps #1 and #2 too? There would IMHO be room for future expandability here if wanted.
|
The 68K EA mode does include the "An" Ea-encoding. While basically all instructions have this encoding - mostly none does support it. This means we have here potentail of improveing the core - by supporting this encoding. The idea of both #1 and #2 would be to use this encoding. This means #1 and #2 exclude each other. #1 = allowing access to An in ALU Operations Is obviously very sexy from a programmers point of view. As the Instruction set becomes more flexible. #1 has a drawback which we need to speak openly about. The 68K architecture is designed so that the "ideal 68K CPU" has two ALUs above each other in the pipeline. The first ALU does the EA Calculationd, and the 2nd ALU the ALU operation. The is a beautiful design. #1 will create a hazard which a normal programmer does not see. Example: 1) MOVE (A0)+,D0 2) ADD (A0)+,D0 There is no hazard between these two instructions and an 68K can execute both in 2 cycles. 1) MOVE (A0)+,D0 2) ADD A0,D0 Because of the different timing of the 2 ALUs these two instruction create a big hazard. The above instructions will not take 4 cycles. To fully understand this hazard you need to understand how the pipeline diagram of the ideal 68K core looks like. Is this fully clear to everyone? Of not please ask! I'll happily explain any questions. #1 looks very good from a SW programmers point of view. But the slowdown hazard is easily overlooked. While its no problem to write hazard free code. Novice programmers or sub optimal compilers might a lot of problems with this. #2 offers us to use the "free encoding" for a different purpose. For adding more DATA registers. This means #2 will create more registers which can be used in ALU operations WITHOUT any hazards. This hazard free operation makes this idea looking very good to me. Of course using Address register could be re-anled in this modce with the #4 hack. By falling throught the EA stage an address register could always be passed to the ALU. Ex: MUL {A0},D0 What do you think?
| |
Megol .
| | Posts 690 10 Mar 2011 15:11
| Gunnar von Boehn wrote:
|
Megol . wrote:
| #2 and #4 would be incompatible, perhaps #1 and #2 too? There would IMHO be room for future expandability here if wanted. |
The 68K EA mode does include the "An" Ea-encoding. While basically all instructions have this encoding - mostly none does support it. This means we have here potentail of improveing the core - by supporting this encoding. The idea of both #1 and #2 would be to use this encoding. This means #1 and #2 exclude each other. #1 = allowing access to An in ALU Operations Is obviously very sexy from a programmers point of view. As the Instruction set becomes more flexible. #1 has a drawback which we need to speak openly about. The 68K architecture is designed so that the "ideal 68K CPU" has two ALUs above each other in the pipeline. The first ALU does the EA Calculationd, and the 2nd ALU the ALU operation. The is a beautiful design. #1 will create a hazard which a normal programmer does not see. Example: 1) MOVE (A0)+,D0 2) ADD (A0)+,D0 There is no hazard between these two instructions and an 68K can execute both in 2 cycles. 1) MOVE (A0)+,D0 2) ADD A0,D0 Because of the different timing of the 2 ALUs these two instruction create a big hazard. The above instructions will not take 4 cycles. To fully understand this hazard you need to understand how the pipeline diagram of the ideal 68K core looks like. Is this fully clear to everyone? Of not please ask! I'll happily explain any questions. #1 looks very good from a SW programmers point of view. But the slowdown hazard is easily overlooked. While its no problem to write hazard free code. Novice programmers or sub optimal compilers might a lot of problems with this. #2 offers us to use the "free encoding" for a different purpose. For adding more DATA registers. This means #2 will create more registers which can be used in ALU operations WITHOUT any hazards. This hazard free operation makes this idea looking very good to me. Of course using Address register could be re-anled in this modce with the #4 hack. By falling throught the EA stage an address register could always be passed to the ALU. Ex: MUL {A0},D0 What do you think?
|
As I previously wrote I'm fully aware of this. Why otherwise would I suggest considering moving (=replicate) the address generation into the data cache access stage to make it even more useful for computation? The 80486/Pentium/Cyrix 6x86 and all VIA x86 processors except the last (the Nano processor uses out of order execution) have the same design with address generation/cache access/execute/writeback stages. The LEA instruction was useful even though one would get stalls (called AGI - Address Generation Interlock) if one didn't try avoiding them. As GCC have been good at generating Pentium code getting good N68k performance should be relatively easy. The x86 LEA instruction is used whenever possible in optimizing compilers and this extension would be even more useful/powerful/orthogonal.Do you think normal programmers would even try to optimize assembly code? ;) In short: the delay isn't that big of a problem and supporting it would make the N68k even more powerful/instruction than even ARM shift+ALU design. If deemed useful it would be possible to replicate the address generation stage for computation purposes in later processors.
| |
Cesare Di Mauro Italy
| | Posts 528 11 Mar 2011 05:34
| Shortly:
Gunnar von Boehn wrote:
| We have some ideas floating around regarding the EA unit .... Maybe we can together summerize them and find out which of them are the most useful ones? 1) Allow access to Address register from ALU This allows stuff like: MUL A0,D0 or ANDI #121231,A0 The benefit of this mode is clear. It gives the CPU more registers to work as data registers and it makes the CPU easier to use in same ways as the registers are more flexible. |
Can be of little interest, since address registers aren't fully equivalent of data ones.
2) Use the above encoding to instead add 8 more DATA registers. This would allow to have 16 data registers + 8 Address register. This can be encoded without instruction grow. This option would seperate the AN and DN more cleanly but increase the Data register which many will find useful. |
It breaks ISA compatibility. I don't like it.
3) Use an unused bit encoding of the FULL EXTENSION WORD to enable address register update. This would enable this: Ex: MOVE 1234(A0,D0*2)!,D0 -- would store the EA in A0 This would make certain memory operations save extra instruction to update the pointer. As the CPU has this path already this is basically free to add. |
Please consider this at the top priority, since it can help A LOT pointers handling.It's also both human and compiler-friendly.
4) Use an unused bit encoding of the FULL EXTENSION WORD to allow passing the EA to the ALU - without doing a memory access. This would allow combining the result of a LEA with any ALU operation. Ex: OR {A0,D0},D2 This like a special case optimization. The drawback of this mode is the latency dependancy between the banks as their updates are done in different pipeline stages. This might make it difficult to support this in a compiler. |
I agree. It isn't easy to change the back-end to take this optimization in considerations.Anyway, using the same bit of 3) will exclude this feature, because I consider 3) much more useful.
5) Another proposed option would be adding more ADDRESS Registers. This could relative simply be done be using PREFIX words. Without using PREFIX words this is tricky. By using Bit3 in the FULL EXTENTION WORD someone could add cheat mode which allows doubling the registers. This could be combined with the option to add 8 more Dataregisters ... |
With the full extension word you can have 2 address registers, and bit 3 can "cover" (extend to the new address registers) just one of them.Anyway, I'm against such incompatible features.
6) PREFIX WORDS We can also use PREFIX WORDS to make a single 68K instruction do more work. We partly do this already by FUSING certain combinations. E.g right now we do: MOVEQ #5,D2 ADD.L D1,D2 This fuses two operations into one more powerful. This could be enhanced by doing even more complex operations. This would make the code denser and enhance our possibilities to use all of the powerful 070 ALU per clock. Very "strong" instruction are: ADDM (An)+,(Am)+ This instruction does 2 memory loads, 1 memory update, 2 AE calcualations, 2 Address register updates, 1 Alu operation. The 070 design is tweaked to be able to do this in 1 single cycle. Therefore we have to have all units needed for this. By strengthening more instructions to do this much work we can improve the power per clock of our system |
I agree, especially if a second EA can be added to (almost) each existing instruction. It can produce much denser code, more work per clock, and less dependencies that can help the pipelines and speed in general.Prefixes AREN'T the evil.
I think we all see that there is some ideas and potential for strengthening the CORE.... Which benefits do you see? |
Definitely 3 and 6 are the easier and more useful both for programmers and compilers. And speed, of course.
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 11 Mar 2011 06:44
| Hi Cesare,Cesare Di Mauro wrote:
| Shortly: Gunnar von Boehn wrote:
| We have some ideas floating around regarding the EA unit .... Maybe we can together summerize them and find out which of them are the most useful ones? 1) Allow access to Address register from ALU This allows stuff like: MUL A0,D0 or ANDI #121231,A0 The benefit of this mode is clear. It gives the CPU more registers to work as data registers and it makes the CPU easier to use in same ways as the registers are more flexible. |
Can be of little interest, since address registers aren't fully equivalent of data ones. |
What do you mean by "aren't fully equivalent"? Technically you could do all operations on them which you can do on Data registers .... Cesare Di Mauro wrote:
| 2) Use the above encoding to instead add 8 more DATA registers. This would allow to have 16 data registers + 8 Address register. This can be encoded without instruction grow. This option would seperate the AN and DN more cleanly but increase the Data register which many will find useful. |
It breaks ISA compatibility. I don't like it. |
What do you mean by breaks ISA compatibilty? All enhancements will create incompatibility to previous CPUs. The 68000 can not execute 68020 instructions. The 68020 can not execute move16 ... The 68ks can not execute the new Coldfire instructions. Whatever new we add won't run on old cores. But this is no problem we want to make the CPU able to provide more performance than old Cores have anyway... Cesare Di Mauro wrote:
| | SuperCISC - enabling direct memory operation to all Instructions. |
I agree, especially if a second EA can be added to (almost) each existing instruction. It can produce much denser code, more work per clock, and less dependencies that can help the pipelines and speed in general.
|
Yes I agree. The 68070 cache is designed to be extremly powerful for direct memory operations. We could enable the 68K instruction to make use of this potential. This way we could basicallye merge up to three 68K intructions into one. If you include predication it could be even four instructions. Cheers
| |
Wojtek P Poland
| | Posts 1597 11 Mar 2011 07:46
| Megol . wrote:
| As GCC have been good at generating Pentium code getting good N68k performance should be relatively easy. The x86 LEA instruction is
|
Another funny post. If you want to get "relatively easy" good output for completely different architecture by patching gcc x86 output it is really funny. You can't really think of any processor other way than seeking x86 instruction equivalent. You just should not discuss about anything else than improving x86 architecture. Actually most of your sentences looks like pseudoknowledge got as a mix/cut&paste from x86 technical papers - no understanding at all.IMHO discussion with you is nothing else than waste of time.
| |
Wojtek P Poland
| | Posts 1597 11 Mar 2011 07:55
| Gunnar von Boehn wrote:
| The 68070 cache is designed to be extremly powerful for direct memory operations. We could enable the 68K instruction to make use of this potential. This way we could basicallye merge up to three 68K intructions into one. If you include predication it could be even four instructions. Cheers
|
Cache and memory access logic is actually the most important part of processor that determines it's power on real workload. You work hard to make it powerful and keep it that way. Hardware coprocessors are for number crunching, main CPU is not. Main CPU should quickly execute complex codepaths. What you already did is exactly that. If you make it even better then it's great - just don't get distracted by PC-style ideas.Todays CPU are designed to quickly compress movies or do similar number crunching things. As this old 1970-years test showed my 2.3GHz CPU is just 13 times faster than old 18MHz CPU. The old one can do no more than 2 instructions per cycle (one of it being branch), the new one in theory can four. But in practice can't even one per 5 cycles. What is already done in N68k is more like old mainframe style than new PC style. 0 cycle branches, multiple accesses to cache within cycle, single cycle instructions even if they operate on memory operands, optimized DBRA etc. Keep it that way just don't get distracted by stupid ideas. No need to design another PC style processor while i can just go to the nearest shop and buy one with high clock speed.
| |
Megol .
| | Posts 690 11 Mar 2011 12:45
| Wojtek P wrote:
|
Megol . wrote:
| As GCC have been good at generating Pentium code getting good N68k performance should be relatively easy. The x86 LEA instruction is |
Another funny post. If you want to get "relatively easy" good output for completely different architecture by patching gcc x86 output it is really funny. You can't really think of any processor other way than seeking x86 instruction equivalent. You just should not discuss about anything else than improving x86 architecture. Actually most of your sentences looks like pseudoknowledge got as a mix/cut&paste from x86 technical papers - no understanding at all. IMHO discussion with you is nothing else than waste of time.
|
This is the last of your posts I will reply to. You are all talk and no action, you can't backup what you say with facts and doesn't acknowledge measured facts from real systems that show you don't know shit.
| |
Cesare Di Mauro Italy
| | Posts 528 11 Mar 2011 13:17
| Gunnar von Boehn wrote:
| Hi Cesare, Cesare Di Mauro wrote:
| Shortly: Gunnar von Boehn wrote:
| We have some ideas floating around regarding the EA unit .... Maybe we can together summerize them and find out which of them are the most useful ones? 1) Allow access to Address register from ALU This allows stuff like: MUL A0,D0 or ANDI #121231,A0 The benefit of this mode is clear. It gives the CPU more registers to work as data registers and it makes the CPU easier to use in same ways as the registers are more flexible. |
Can be of little interest, since address registers aren't fully equivalent of data ones. |
What do you mean by "aren't fully equivalent"? Technically you could do all operations on them which you can do on Data registers .... |
Yes, but instruction encodings is many times related to data registers only.For example, you can't do: MUL EA,An SWAP An,Am LSL #8,An BSET An,EA and so on.
Cesare Di Mauro wrote:
| 2) Use the above encoding to instead add 8 more DATA registers. This would allow to have 16 data registers + 8 Address register. This can be encoded without instruction grow. This option would seperate the AN and DN more cleanly but increase the Data register which many will find useful. |
It breaks ISA compatibility. I don't like it. |
What do you mean by breaks ISA compatibilty? All enhancements will create incompatibility to previous CPUs. The 68000 can not execute 68020 instructions. The 68020 can not execute move16 ... The 68ks can not execute the new Coldfire instructions. Whatever new we add won't run on old cores. But this is no problem we want to make the CPU able to provide more performance than old Cores have anyway... |
You are taking about instructions, but take into account an operating system that have to switch task context: adding new registers requires a new scheduler (at least).That was my primary concern.
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 11 Mar 2011 15:00
| Cesare Di Mauro wrote:
| | For example, you can't do: MUL EA,An SWAP An,Am LSL #8,An BSET An,EA SWAP An,Am LSL #8,An BSET An,EA
|
Of course you can add this if you want. Long MUL and DIV have even an extra bit reserved to enable this. Cesare Di Mauro wrote:
| You are taking about instructions, but take into account an operating system that have to switch task context: adding new registers requires a new scheduler (at least). That was my primary concern.
|
But how simple is changing the register save of the scheduler compared to developing a CPU?
| |
SID Hervé France
| | Posts 666 11 Mar 2011 15:29
| Gunnar von Boehn wrote:
| But how simple is changing the register save of the scheduler compared to developing a CPU?
|
HelloI remember the Executive progam modifying the scheduler and created problems with system stability in some cases. Does this not pose a compatibility issue? Thank you
| |
Cesare Di Mauro Italy
| | Posts 528 11 Mar 2011 19:48
| Gunnar von Boehn wrote:
|
Cesare Di Mauro wrote:
| For example, you can't do: MUL EA,An SWAP An,Am LSL #8,An BSET An,EA SWAP An,Am LSL #8,An BSET An,EA |
Of course you can add this if you want. Long MUL and DIV have even an extra bit reserved to enable this. |
Only these instructions. What about all the others that haven't extra space in the opcode? DBRA, for example; can you make this: DBRA An,Loop ?
Cesare Di Mauro wrote:
| You are taking about instructions, but take into account an operating system that have to switch task context: adding new registers requires a new scheduler (at least). That was my primary concern. |
But how simple is changing the register save of the scheduler compared to developing a CPU? |
Because you need to change all the code that can do it, as SID stated.
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 11 Mar 2011 21:44
| Cesare Di Mauro wrote:
| Because you need to change all the code that can do it, as SID stated.
|
Come on! This is no problem. We need to change this anyhow to add support for an enhanced FPU or SIMD.
| |
SID Hervé France
| | Posts 666 11 Mar 2011 22:22
| SID Hervé wrote:
| I remember the Executive progam modifying the scheduler and created problems with system stability in some cases.
|
Big mistake! Executive did not change the scheduler but replaced it. This is different!
| |
Cesare Di Mauro Italy
| | Posts 528 12 Mar 2011 04:36
| Gunnar von Boehn wrote:
| Come on! This is no problem. We need to change this anyhow to add support for an enhanced FPU or SIMD. |
FPU isn't a problem, since it uses the coprocessor protocol to save & restore its state.For SIMD, it depends on the implementation. If you plan to "map" it into the FPU (ala MMX / 3DNow!) there's no problem since the coprocessor protocol covers it. If you make it as different coprocessor, then a new scheduler is needed if the old one isn't able to handle multiple coprocessors. Anyway, how do you think that new registers can be added? With the unused An mode it is very limited, since you can use it only on EA, and not in all cases. If you want to make it more flexible, you need to leave the unused An in EA, and use a prefix word with 2 bits available to cover all the data register usages. On more bit if you plan to add more address registers (which is a more sensible thing, IMO).
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 12 Mar 2011 07:08
| Cesare Di Mauro wrote:
| If you want to make it more flexible, you need to leave the unused An in EA, and use a prefix word with 2 bits available to cover all the data register usages.
|
Using a prefix gives of course more bits for encoding. But the point was that using the unused EA will allows us to add 8 more registers without instruction size increase. This means you can get more registers and keep the high code density. Of course there are some instructions which do not have the free bit for this atm. For those instructions new encoding allowing to access the other 8 Registers would need to be used. There are a number of encoding free - so with some clevernesss this could also be covered. Well #2 (adding more registers) is an option. It could be done.
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 12 Mar 2011 08:40
| Code fusing 2.0: Lets revisited which instruction fusing makes most sense ... I think the following fusing will be usefull: 1) MOVEQ #imm,Dn OPP <ea>,Dn 2) MOVE.L Dm,Dn OPP <ea>,Dn 3) MOVE.L #imm,Dn OPP <ea>,Dn Fusing can be done using only a single ALU. This means a super scalar 68070 could this way do two fused operations of each two instructions per cycle. Not possible with fusing, but should be handled with result forwarding: 4) OPP <ea>,Dn MOVE.L Dn,<ea> Case 4 does need both EA units, this means both pipes will be utilized by this. The first instruction of case 4 could be also fused with another one. Doing this 4B) MOVEq #imm,Dn OPP <ea>,Dn MOVE.L Dn,<ea> Forwarding Part 2: Not only ALU result forwarding could be done, but also Read-result Forwarding - before the ALU. This would then look like this: 5) MOVE <ea>,Dn OPP Dn,<ea>
| |
Matt Hey USA
| | Posts 737 12 Mar 2011 10:13
| Gunnar von Boehn wrote:
| Cesare Di Mauro wrote:
| Because you need to change all the code that can do it, as SID stated. |
Come on! This is no problem. We need to change this anyhow to add support for an enhanced FPU or SIMD. |
Any kickstarts that would want to be used would have to be changed also. This is a pretty big negative to some people that want maximum retro compatibility. I also don't feel it is necessary. The N68k can work in memory with minimal slowdown. It also has or could have the following register reducing features... 1) Byte and word is as fast as long. (pipelined for all) 2) Word extended immediate longs don't need a trash register. 3) Fast bit field instructions reduce register usage. 4) More instructions supporting An means less Dn<->An swapping. 3) Large instruction fetch makes these efficient... A) Memory Pre-indexed and Post-indexed Indirect addressing modes B) 3 op register miser instructions like... and.l dn,dm,(An) and.l #123,dn,(An) ;no change/use penalty on src and.l dn,dm,NIL ;or syntax and.l dn,dm,{} Cesare Di Mauro wrote:
| FPU isn't a problem, since it uses the coprocessor protocol to save & restore its state. For SIMD, it depends on the implementation. If you plan to "map" it into the FPU (ala MMX / 3DNow!) there's no problem since the coprocessor protocol covers it. |
Yes. Not as much of a problem as main CPU. Cesare Di Mauro wrote:
| Anyway, how do you think that new registers can be added? With the unused An mode it is very limited, since you can use it only on EA, and not in all cases. |
Yes. Several instructions have 3 bits for a data register only.
| |
|
|
|
|