Welcome to the Natami / Amiga ForumThis forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.
|
Do you have ideas and feature wishes? Post them here and discuss your ideas. |
| SIMD Unit for the NatAmi | page 1 2 3
|
|---|
|
|---|
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 10 Sep 2010 00:39
| Amiga Believer wrote:
| I was thinking of the fact that integer calculation is usually faster than floating point calculation.
|
This was true in the 1980th.Today this is different. A properly pipelined CPU can do an Integer ADD per cycle. A properly pipelined CPU can also do a FLoating point FADD per cycle. A properly pipelined CPU can do a Integer MUL per cycle. A properly pipelined CPU can also do a Floating point FMUL per cycle. That Integer is faster than Float is not a fact anymore. ;-)
| |
Denis Markovic Germany
| | (Natami Team) Posts 41 10 Sep 2010 01:32
| Amiga Believer wrote:
| @Loïc Dupuy > It is the case since the 68020 I know, but this was an answer to the suggestion to drop 64 bit support. I was saying that dropping 64 bit support for float is fine (in the SIMD only, the FPU should support it), but that we should at least keep partial 64 bit support for integers.
|
This sounds very reasonable to me (i.e. double only in FPU and float for SIMD).
Amiga Believer wrote:
| > 16bit float is very very imprecise. > Therefore I would not implement it. I disagree. Half precision floating point (IEEE754 binary16) is obviously not precise enough for scientific calculations. It is however quite enough for many everyday tasks, such as MPEG
|
I agree with Amiga Believer. 16 bit float is very common nowadays for vector processors. I would also add complex multiplication support (at least for 16 bit), this will speed up a lot of signal processing tasks (FFT, complex filters, ...); for 16 bit is it not very expensive; if it is too much for one cycle, you can pipeline the execution phase so the latency gets up but at least the initiation time is still 1. For float16 it would be nice, to support 2 precissions, the IEEE half (if I remember right one sign bit, 5 for the exponent and one hidden plus 10 stored bits for the mantissa) and maybe another format with more exponent and less mantissa bits, possibly switchable during runtime by setting a config bit in some register? Things like sqrt or log can be implemented very fast for float16. Amiga Believer wrote:
| > 1) Convert InT to Float, > 2) Do SQRT > 3) Convert Float to Int It may be useful to have, like I said, a square root instruction for integers. > Do you talk about FLOAT or Integer? Well... both... I was thinking of adding it for signed integers and floating point. > In 3D games you always need to rotate Vectors with a Matrix. I was thinking of creating a SIMD unit with the goal of accelerating compression and decompression of video, audio and images, not to accelerate games. For games, I have another solution which I will bring in another forum topic. @Loïc Dupuy > Ok, SIMD can do 4 operation in // > But what are the real case that benefit from this ? > Outside this, all other instruction will be a waste, and it will be very difficult to justify the gates/power ration gives by the SIMD. See the example which I gave about MPEG in this message. When working with macroblocks, a SIMD unit can give a boost for the DCT or the IDCT calculation.
|
Exactly. And for MPEG and speech recognition etc. there should be some support for sum of absolute differences (L1 norm), of course the latency can be a bit longer as long as initiation is 1. Also useful: more intra vector instructions like find min/max (best implementation would search for both with the same instruction), also loopable over several vectors for integer normalization (and of course min/max for SIMD float); abs for simd float is very simple by setting the sign bit to 0; abs for simd integer would be helpful; and of course a nice intra vector permutation unit etc. :-) ...
| |
Denis Markovic Germany
| | (Natami Team) Posts 41 10 Sep 2010 01:36
| Denis Markovic wrote:
| I would also add complex multiplication support (at least for 16 bit), ...
|
Of course I meant complex mul for 16 bit floating point. Any chance we get 256 bit registers?
| |
Amiga Believer Canada
| | Posts 282 10 Sep 2010 05:58
| @ Gunnar von Boehn > That Integer is faster than Float is not a fact anymore. Is this true for divisons and square roots too?> That Integer is faster than Float is not a fact anymore. ;-) This just means that in theory it is possible to make a FPU which is as fast as an integer unit. It does not mean that all processors now have such a unit (Pentium 4 class CPUs do floating point much slower than integer) or even that we shall have such a powerful floating point on our SIMD unit, the design of said unit hasn't started yet. This would however hold true to our case if the floating point of our SIMD is performant, so the final answer is: it depends. > Any chance we get 256 bit registers? This seems unlikely to me, moreover, unless the unit is massively extended, instructions would take twice the clock cycles. I do think that 128 bit is fine. @ Denis Markovic > the IEEE half (if I remember right one sign bit, 5 for the exponent and one hidden plus 10 stored bits for the mantissa) You remember right, the IEEE754 half precision floating point format (binary16) is as you described. > For float16 it would be nice, to support 2 precissions, the IEEE half [precision floating point format] [...] and maybe another format with more exponent and less mantissa bits, possibly switchable during runtime by setting a config bit in some register? I am not against another 16 bit floating point format, however I cannot think of a example of a case where the IEEE754 binary16 would be unsuitable (it is certainly suitable for compression or decompression of MPEG, JPEG, audio) and a format with more bits for the exponent would be suitable. Can you please provide such an example? Having IEEE754 binary16 has the advantage that it is a standard (this does not mean that non-standard options are bad). > I would also add complex multiplication support (at least for 16 bit), this will speed up a lot of signal processing tasks (FFT, complex filters, ...); for 16 bit is it not very expensive > Things like sqrt or log can be implemented very fast for float16. > And for MPEG and speech recognition etc. there should be some support for sum of absolute differences (L1 norm) > Also useful: more intra vector instructions like find min/max (best implementation would search for both with the same instruction), also loopable over several vectors for integer normalization (and of course min/max for SIMD float); abs for simd float is very simple by setting the sign bit to 0; abs for simd integer would be helpful; > and of course a nice intra vector permutation unit etc. You seem to be very knowledgeable for the following task: implementing signal processing on a SIMD unit. Why don't you make a proposal as to the instruction set for the SIMD unit? You seem to be the right person to design the SIMD instruction set.
| |
Cesare Di Mauro Italy
| | Posts 528 10 Sep 2010 08:59
| Gunnar von Boehn wrote:
| Regarding the SIMD WIDTH I would have thought that 128 bit will be perfect. A design with 512 bit registers is way to costly IMHO - even when limiting to a 128 ALU that needs 4 clocks. Example: Cost of the Register File: A FPU register file of 16 registers each 80bit implemented as real register with needed muxes will cost about: 2500 LE A register file of 16 registers each 128 Bit implemented as real register with muxes will cost about: 4000 LE. If you combine the FPU and SIMD Unit you will get the SIMD register file for 1500 LE. A register file of 16 registers 512bit width does cost about: 16000 LE. This is outrages expensive. If you implement the register file with FPGA internal SRAM then you need 32 Memory blocks for a fullclocked 512 bit registerfile with 2 read ports and 1 write port. This means this register file costs as much as 32 KB CPU Cache! Our goal is not to beat Larrabee or any upcoming PS4. I would be more than happy with a 128bit SIMD Unit like PowerPC have them. Also I think that a SIMD Unit which does one 128bit operation every clock is easier to program and a lot more powerful at the end of the day, than a SIMD Unit which does one 512bit operation every 4 clocks. |
OK, I understand that there are limits with current FPGAs. My only concern was about having an ISA that can take advantage of future FPGAs enhancements in a transparent way for applications (no need to rewrite the code for wider SIMD units), as I stated previously. That's all. Anyway, a good ISA design can be a starting point. I think that the ISA can define 32 registers (which are a good number; 16 can be too few), but in the first release only let available the first 16, for example. I don't know if you plan to use line-F to implement new opcodes for the SIMD ISA, but I like to use line-A plus an extension word, which gives a good 12 + 16 = 28 bits to carefully define the new opcodes with three operands (15 bits), a register mask (3 bits for the Kn "mask" register), the address mode (3 bits) for the solely memory operand, and the remaining 7 bits for the instruction code. It'll be a clean and powerful design, I think. Line-A is not used by normal applications, and an additional processor doesn't have to take into account legacy code that can COULD use it (but who made it?). May be Line-F can be used to just implement SAVE/RESTORE, and similar "control" instructions.
| |
Deep Sub Micron Germany
| | (MX-Board Owner) Posts 567 10 Sep 2010 12:01
| Amiga Believer wrote:
| @ Gunnar von Boehn > That Integer is faster than Float is not a fact anymore. Is this true for divisons and square roots too?
|
Especially for division and square root. This is because floating point does not require to normalize the operants first. For example for Goldschmidt algorithm or for SRT division.
Amiga Believer wrote:
| > That Integer is faster than Float is not a fact anymore. ;-) This just means that in theory it is possible to make a FPU which is as fast as an integer unit. It does not mean that all processors now have such a unit (Pentium 4 class CPUs do floating point much slower than integer) or even that we shall have such a powerful floating point on our SIMD unit, the design of said unit hasn't started yet. This would however hold true to our case if the floating point of our SIMD is performant, so the final answer is: it depends.
|
Only for throughput, a floating point unit can have the same throughput as as an integer unit. The latency will be always a at least a little bit larger (pipeline is probably longer). So it is a good idea not to use the result of an operation as operant in the next operation. As long as this is possible the latency is hidden.
| |
Marcel Verdaasdonk Netherlands
| | Posts 3991 11 Sep 2010 11:45
| Gunnar von Boehn wrote:
|
Amiga Believer wrote:
| I was thinking of the fact that integer calculation is usually faster than floating point calculation. |
This was true in the 1980th. Today this is different. A properly pipelined CPU can do an Integer ADD per cycle. A properly pipelined CPU can also do a FLoating point FADD per cycle. A properly pipelined CPU can do a Integer MUL per cycle. A properly pipelined CPU can also do a Floating point FMUL per cycle. That Integer is faster than Float is not a fact anymore. ;-)
|
Gunnar pipelining hides the problem. ;) But your right on the big picture. Besides i am no fan of counting picoseconds. I might be a nit picker but even i have my limits. ;)
| |
Deep Sub Micron Germany
| | (MX-Board Owner) Posts 567 11 Sep 2010 17:26
| A properly pipelined jingamalator can do one jingamalation per cycle ;-)
| |
Lord Aga
| | Posts 129 11 Sep 2010 18:58
| See, I told you it was the ultimate feature :) And a strong card you can always play :)
| |
Denis Markovic Germany
| | (Natami Team) Posts 41 12 Sep 2010 10:31
| Marcel Verdaasdonk wrote:
|
Gunnar von Boehn wrote:
| That Integer is faster than Float is not a fact anymore. ;-) |
Gunnar pipelining hides the problem. ;) But your right on the big picture. Besides i am no fan of counting picoseconds. I might be a nit picker but even i have my limits. ;)
|
Right on the big picture but not in general, integer is still faster than float, at least in the rare cases when your calculation has a lot of dependencies and you just want 1 value as output of a long chain of calculations; I really like counting cycles :)
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 12 Sep 2010 10:49
| Denis Markovic wrote:
| Marcel Verdaasdonk wrote:
| Gunnar von Boehn wrote:
| That Integer is faster than Float is not a fact anymore. ;-) |
Gunnar pipelining hides the problem. ;) But your right on the big picture. Besides i am no fan of counting picoseconds. I might be a nit picker but even i have my limits. ;) |
Right on the big picture but not in general, integer is still faster than float, at least in the rare cases when your calculation has a lot of dependencies and you just want 1 value as output of a long chain of calculations; I really like counting cycles :) |
No, it really depends on what you are doing. If you do 3D stuff where your can run over your integer range or where you need to do a lot of normalizations then float will be faster as integer calculations. As the floats do range correction / normalization automatically and for free.
| |
Denis Markovic Germany
| | (Natami Team) Posts 41 12 Sep 2010 10:54
| Amiga Believer wrote:
| > Any chance we get 256 bit registers? This seems unlikely to me, moreover, unless the unit is massively extended, instructions would take twice the clock cycles. I do think that 128 bit is fine.
|
Yes, you are probably right. If you have 128 bit register, you should be able to fetch at least 128 data bits from memory/chache in one cycle in parallel to another operation (add/mac/cmac on another register) or you might starve your core. The only advantage I see with 256 bit registers is, that you could implement instructions slower in a first run but for some future implementation with 256 bit speedup you would already have all the programs for it in place, i.e. programs would run faster automatically without a change (I guess, the pipeline will not be exposed, otherwise you might get problems with this?). Amiga Believer wrote:
| | @ Denis Markovic > For float16 it would be nice, to support 2 precissions, the IEEE half [precision floating point format] [...] and maybe another format with more exponent and less mantissa bits, possibly switchable during runtime by setting a config bit in some register? I am not against another 16 bit floating point format, however I cannot think of a example of a case where the IEEE754 binary16 would be unsuitable (it is certainly suitable for compression or decompression of MPEG, JPEG, audio) and a format with more bits for the exponent would be suitable. Can you please provide such an example? Having IEEE754 binary16 has the advantage that it is a standard (this does not mean that non-standard options are bad). |
Hm, maybe matrix inversions? But of course you could say that if you need more exponent bits, you could just use 32 bit float; I guess most use cases would be for high performance low power architectures in mobile phones, so maybe not so important for natami ... one advantage with having less mantissa bits is that you could implement functions like 1 cycle sqrt, ... very very cheap and simple in hardware. Anyhow, at least I would try to get rid of not a number stuff etc. in 16 bit mode and use saturation instead, much more useful for signal processing if you have that small amount of bits (maybe at least settable via config bit or encoded in the opcode); nan can get really ugly for recursive algorithms. |
Amiga Believer wrote:
| > I would also add complex multiplication support (at least for 16 bit), this will speed up a lot of signal processing tasks (FFT, complex filters, ...); for 16 bit is it not very expensive ... Why don't you make a proposal as to the instruction set for the SIMD unit? You seem to be the right person to design the SIMD instruction set.
|
Thanks a lot. I just used such instructions sets (Vektor-processors and DSPs) but I never did one. I would be happy to make some proposals if anyone from the natami team is interessted? (I guess a lot of ideas can already be found in Neon, Altivec, etc. :)
|
|
Ceti 331 United Kingdom
| | Posts 282 12 Sep 2010 17:57
| Right on the big picture but not in general, integer is still faster than float, at least in the rare cases when your calculation has a lot of dependencies and you just want 1 value as output of a long chain of calculations; I really like counting cycles :)
|
latency is an important issue for ease of programming. Comparing xbox 360 to ps3, the xcpu scores points with its dotproduct instruction i.e. lower latency than is possible emulating the same function on the simd unit on the other.( lost again in its issues with handling conditionals..). very usefull in general game ai/physics/collision etc.. I think the Dreamcast's dot-product instruction had a precision tradeoff most likely to reduce latency too - i.e. using one exponent across all 4 multiply-adds - almost like halfway between Fixed and Floating point. (can anyone confirm/deny?)
| |
Marcel Verdaasdonk Netherlands
| | Posts 3991 13 Sep 2010 00:34
| Ceti Have you ever read some papers on the Dot product it's a mater of how it is implemented. Mathematically speaking it is a multiplication of two vectors which often are denoted in a matrix. Euler, Quaternion or Matrix defines the amount of data you computing. So what notation did your code use to get it's Dot product. ;)
| |
Ceti 331 United Kingdom
| | Posts 282 13 Sep 2010 03:59
| well just about every 3d programmer has dealt with dot products in various forms :) on the 360 the dotproduct instruction has latency of 18, wheras most vector float instructions have latency 12. But.. if you implement dotproduct on that powerpc (ps3 ppu) or even spus' it will take quite a bit longer. you can get good throughput with parallelism (i.e. do 4 at once SOA style) but the latency will always be more than double, you always have the 'multiply' and 'add' operations either side of a permute. SOA style suits collision detection etc but not general purpose gamelogic AI code. So: at the cost of this specialized silicon the 360 cpu gives you a low-latency implementation of this very common function.. good move. Microsoft give you many implementations of common maths functions (like quaternion multiply) reformulated to use their dot-product instruction, specifically for that low-latency benefit. matrix multiply can be done 2 ways depending on whether the DP instruction is available. Some crossplatform code will swizzle the matrices if the DP instruction is present.
| |
|