|
|---|
Thomas Richter Germany
| | (MX-Board Owner) Posts 1425 12 Apr 2011 09:48
| Gunnar von Boehn wrote:
| Why are you against more register? Do you see unsolveable problems wiht more registers that we do not see?
|
Very simple: The exec scheduler had to change. Which means that you can no longer run old kickstarts if you had to. It also enlarges the latency when switching tasks because more registers had to be saved. Of course you could argue that this is not an issue and programs should better use the (then new) Os instead, but as we all know, this is not given.A longer latency might be less a problem provided the throughput remains high, i.e. one could address this by a secondary FP ALU which could absorb a second FPU mul/add instruction. Similarly, such a secondary ALU could handle SIMD instructions and double the throughput. So long, Thomas
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 12 Apr 2011 10:00
| Thomas Richter wrote:
|
Gunnar von Boehn wrote:
| Why are you against more register? Do you see unsolveable problems wiht more registers that we do not see? |
Very simple: The exec scheduler had to change. Which means that you can no longer run old kickstarts if you had to.
|
You can run old kickstarts. But you can't run old Kickstarts together with new games using the FPU stuff. But would this a problem? Thomas Richter wrote:
| A longer latency might be less a problem provided the throughput remains high, i.e. one could address this by a secondary FP ALU which could absorb a second FPU mul/add instruction. Similarly, such a secondary ALU could handle SIMD instructions and double the throughput.
|
How do you mean this? The throughput is already high. The FPU does not stop it runs in fully pipelined. As you know for a 4x4 Matrix Mul you need, about this: MUL MUL MUL MUL ADD ADD ADD MOVE MUL MUL MUL MUL ADD ADD ADD MOVE MUL MUL MUL MUL ADD ADD ADD MOVE MUL MUL MUL MUL ADD ADD ADD MOVE You can execute the Muls easily as they are independant, but the Adds becomes a problem as they depend on the muls to be finished and the later adds depend on the previous adds to be finished. Do you see a trick to do the Matrix mul "bubble free"? SIMD does not solve this issue as SIMD will alos have a latency and with such depending constructs the SIMD unit will also have to wait and stall. Regarding SIMD, How wide (in bits) would you make the SIMD unit?
| |
Deep Sub Micron Germany
| | (MX-Board Owner) Posts 567 12 Apr 2011 10:33
| In my opinion more registers should be implemented by renaming memory to registers. So it would work similar to a tiny mult-ported level 0 cache. These renamed register can use all the forwarding paths (if there are any in FPU). That is a way to add registers in a backwards compatible way, because you can access them with d16(An). But this is just a raw proposal.
| |
Thomas Richter Germany
| | (MX-Board Owner) Posts 1425 12 Apr 2011 10:36
| Gunnar von Boehn wrote:
|
Thomas Richter wrote:
| Gunnar von Boehn wrote:
| Why are you against more register? Do you see unsolveable problems wiht more registers that we do not see? |
Very simple: The exec scheduler had to change. Which means that you can no longer run old kickstarts if you had to. |
You can run old kickstarts. But you can't run old Kickstarts together with new games using the FPU stuff. But would this a problem?
|
Two answers for this: First of all, as far as I know my fellow Amiga users, it wouldn't stop them from trying, regardless of what the manual says, and having the option to *avoid* something breaking, we should rather take this option. Second answer is a "war time story": Versions of win2000 did not save and restore the SSE registers without proper patches installed, and this means that for some of my "professional products", usage of such extensions had to be simply disabled because customers did not understand or did not want to install official MS patches. Thus, creating additional registers without having a stable Os platform to use them is asking for trouble. It basically means that I cannot use the registers in professional software that should run stably. I come back to you concerning the matrix multiplication. Greetings, Thomas
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 12 Apr 2011 10:40
| Jens, can you explain your idea more in detail please? Maybe with an example which shows how this would work on old and new systems?
| |
Jakob Eriksson Sweden
| | (Moderator) Posts 1097 12 Apr 2011 12:04
| Jens, Gunnar, this reminds me of the register windows of the SPARC.
| |
Steve Thomas United Kingdom
| | Posts 178 12 Apr 2011 12:41
| Thomas Richter wrote:
|
Gunnar von Boehn wrote:
| Why are you against more register? Do you see unsolveable problems wiht more registers that we do not see? |
Very simple: The exec scheduler had to change. Which means that you can no longer run old kickstarts if you had to. It also enlarges the latency when switching tasks because more registers had to be saved. Of course you could argue that this is not an issue and programs should better use the (then new) Os instead, but as we all know, this is not given.
|
Isn't rewriting parts of the OS unavoidable? There must be a lot additional data to be stored from the Superaga, 3d core, bopper and multiple 050/070 processors. Multiple 050/070's could/should reduce the need for extra registers.
| |
Jakob Eriksson Sweden
| | (Moderator) Posts 1097 12 Apr 2011 12:55
| Multiple 68k CPUs would not reduce need for extra registers, IMHO. It is a completely different level of abstraction.
| |
Megol .
| | Posts 676 12 Apr 2011 15:38
| Jakob Eriksson wrote:
| Jens, Gunnar, this reminds me of the register windows of the SPARC.
|
How come? A small multi-ported L0 cache that looks like memory vs. a register set with overlapping windows. Not the same in any way.
| |
Jakob Eriksson Sweden
| | (Moderator) Posts 1097 12 Apr 2011 15:41
| No, but they are opposites. One is registers backed by memory, the other is memory backed by registers.
| |
Megol .
| | Posts 676 12 Apr 2011 16:32
| I can see what you mean :)
| |
Marcel Verdaasdonk Netherlands
| | Posts 3976 12 Apr 2011 17:10
| Jakob Eriksson wrote:
| Jens, Gunnar, this reminds me of the register windows of the SPARC.
|
Bench swapping....
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 12 Apr 2011 19:08
| In games you normally perform a limited amount of 4x4 * 4x4 products. This calculation is needed to calculate the camera matrix, lightsouces. Recursive dependent objects. etc.. After the 4x4 product is finished the game will normally multiply it with all the coordinates in the scene. I 3d scene can consist of 10 static 4x4 matrixes(Static per frame) and 100 000 cordinates Then It makes perfect sense to hardcode the matrixes directly into the code.. This will only require 10 innerloops. It can be Cache safe by creating the innerloops on new memory locations every frame. The innerloop buffer can be set at f.ex 1 megabyte (much bigger than the L1 Cache) If the N050 can fetch 16 bytes per cycle. And it can fuse 4. Then this would be possible: .loop fmove.d (a0)+,fp0 ; fused (4 bytes) fmul.d #A11,fp0 ;1 (12 bytes) fmove.d (a0)+,fp1 ; fused fmul.d #A21,fp1 ;2 fmove.d (a0)+,fp2 ; fused (4 bytes) fmul.d #A31,fp2 ;3 (12 bytes) fmove.d (a0)+,fp3 ; fmul.d #A41,fp3 ;4 fmove.d (a0),fp4 ; fmul.d #A42,fp4 ;5 fmove.d -(a0),fp5 ; fmul.d #A32,fp5 ;6 fmove.d -(a0),fp6 ; fmul.d #A22,fp6 ;7 fmove.d -(a0),fp7 ; fmul.d #A21,fp7 ;8
A0 points to the coordinates and will slide up and down since the cordinates need to be read 4 times. This is possible because the constant matrix (in the code) is scrambled. .... By using the same stack trick I used in the 4x4 * 4x4 Latency_free_N070 sourcecode (in the benchmark) I think I can get it down to 32+8 cycles for a 4x4 * 4x1 transformation. On a N050.. Without SMC I will use at least 64+8 cycles. My 3d game will run at 12.5 fps instead of 25 fps. With the matrix multiplier compiled in VBCC or GCC my game will probobly run at 6.25 fps.With my optimalizations from the 80'ies I let the N050 run at 400mhz. compared to the Compiled code wich will run at 100mhz.
| |
Megol .
| | Posts 676 12 Apr 2011 19:49
| Using the 3D accelerator for polygon transformations would limit the load on the CPU to mostly physics calculations. The optimal solution would of course be a vector coprocessor doing the transformations/physics calculations backed up by a polygon rendering core. Would require loads of LUTs...
| |
Marcel Verdaasdonk Netherlands
| | Posts 3976 12 Apr 2011 20:01
| good point Megol, Gunnar would the 3D core have a specialized 4*4 matrix multiplier? And it has been talked about literaly for years already here about adding or not adding a VPU. ;)
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 12 Apr 2011 21:43
| I got it pretty fast already.. 16 muls 10 adds in 40 cycles. Including 4 doubles(32bytes) to read and 4 doubles to write. On the N070 it would probobly run at 20-30 cycles. With A VPU unit we need a bus. Pushing and pulling data out of the unit will steal cycles. When a 64bit FPU has 8 cycle latency, a vpu unit that performs a 4x4 matrix muls will have more. To make it run in paralell we need to add alot of logic. (expensive) If the VPU is located outside the FPGA datatransfer will be slow.. Perhaps 8 cycles to read and 8 cycles to write? and 16 cycles to perform the operation? Adding another pipeline in the N070 would be bether.. 2 memreads per clock from the datacache will speedup things alot.. I'm just a software guy, so the hardware guys will probobly correct me :D
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 12 Apr 2011 21:59
| Gunnar von Boehn wrote:
| I'm very sorry but actually the SMC will make your code _slower_ and not faster. The reason is that a cache snooper works a little different than you expect here. The cache snooper will purge the affected cache regions, it will not merge your updates by itself. This means your SMC will make you loose cache content of ICache lines.
|
Take a look at my second optimalization. Here the constant matrix is multiplied with 100 000 cordinates. Then its swapped with SMC. And performs 100 000 more multiplications. Losing the content of a ICache line is no problem. When I only create a new loop for every 3 200 000 cycles.. What I am asking for is to get a cachesnooper that can fix the smc in 50-1000 cycles... If not I will just implement the innerloop in different memory locations. You can call SMC dirty and outdated but it can double the speed of you chip with proper programming.
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 12 Apr 2011 22:19
| Thomas Richter wrote:
| Testing a 2010 CPU design through a 1980's loeoking glass" is not quite appropriate. The CPU is designed for different coding principles, and one of the principles of the Havard architecture is that the CPU *does not* modify its own code. Which means that you're trying to work against the design, and hence face the challenge.
|
I have removed the latency problem in a matrix muls on a cpu that has 8 registers, 8 cycle latency on muls and add. I can do this because the instruction cache is fetching data in paralell with the datacache. By using Self modified code, I push this 100mhz CPU to its limits. I have turned the 100mhz FPGA to a 400 MHZ FPGA. And didn't write paralell.for() with 8 cores :D The stoneage teqniques still work.
| |
Wojtek P Poland
| | Posts 1597 12 Apr 2011 22:19
| S P wrote:
| Take a look at my second optimalization. Here the constant matrix is multiplied with 100 000 cordinates. Then its swapped with SMC. And performs 100 000 more multiplications. Losing the content of a ICache line is no problem. When I only create a new loop for every 3 200 000 cycles.. What I am asking for is to get a cachesnooper that can fix the smc in 50-1000 cycles... If not I will just implement the innerloop in different memory locations. You can call SMC dirty and outdated but it can double the speed of you chip with proper programming. |
Everything that makes code more efficient is not outdated but good. self modified code was widely used in older times. The proper solution for your - and 1000 other cases would be an instruction to invalidate given cache line - as programmer know what lines were updated.Anyway - other optimizations CPU core for DSP style work make only sense if it's not costly. What is important for main CPU is to handle efficiently code with lots of branching and unpredictable paths. The proper solution would be DSP like coprocessor, ultrasimple and optimized for repetitive calculations with little branching, and with high power to cost ratio, possibly replicating it.
| |
Thomas Richter Germany
| | (MX-Board Owner) Posts 1425 12 Apr 2011 22:44
| S P wrote:
| I have removed the latency problem in a matrix muls on a cpu that has 8 registers, 8 cycle latency on muls and add. I can do this because the instruction cache is fetching data in paralell with the datacache. By using Self modified code, I push this 100mhz CPU to its limits. I have turned the 100mhz FPGA to a 400 MHZ FPGA. And didn't write paralell.for() with 8 cores :D The stoneage teqniques still work.
|
No, it doesn't. And this is exactly the point. To make it work, you would need to call CacheClearE() to ensure that the cache is written out correctly before the CPU executes it. Which makes the code slow.It is an inappropriate solution for the design. Fast programs are fast because the algorithm is fast. Not because someone wrote self-modifying code.
| |
|