|
|---|
Deep Sub Micron Germany
| | (MX-Board Owner) Posts 566 22 Feb 2011 13:45
| Matt Hey wrote:
| Newton Raphson, SRT and Goldschmidt |
You are right these divisions work good with floating point because all require normalized input. But SRT is a little bit different about using a lookup table in each iteration. I think the required normalization makes all of these much less attractive for integer division. The cost for adding is much less than the cost of normalization in an FPGA. That makes it even worse to use one of the above. As far as I know OpenSparc uses the floating point unit for integer division. A FPU has normalization hardware. I think it makes sense when one of these divide algorithms is used in FPU.
| |
Matt Hey USA
| | Posts 727 22 Feb 2011 14:06
| Thierry Atheist wrote:
| The FPU in a 68060 can do 80 bit calculations? |
The 68881-68060 FPU all use 80 bit extended format internally. This has 64 bits of "fraction". The 68060 can't load an immediate 80 bit value with fop #imm,fpn without trapping (very slow). It must use fop EA,fpn. That would assure the correct result of such a divide even if a double precision result is sought. Right? |
Yes, extended precision would provide enough extra precision for double precision math. Double precision (52 bit fraction) would provide enough extra precision for a 32 bit integer. There can still be other issues though. Claudio's idea is good if it would work in all cases and doesn't slow down other division too much. Converting division by powers of 2 into shift right is even simpler and faster yet are any processors doing this?@Deep Sub Micron I guess all we need is a FPU for the N68050 then ;). The great thing about an fpga is that we will be able to test different methods.
| |
Ander Strom Sweden
| | Posts 9 22 Feb 2011 16:36
| This dividing discussion should be moved to a separate thread, but since we are way of topic by now. Is it possible to create a analog divider for fast and sloppy dividing ? eg feed a op circut with two da converters and read back the result with an ad. Or are the da/ad to slow in the fpga unit for this?
| |
Amiga Ppc
| | Posts 246 22 Feb 2011 20:41
| @Thomas (guru) what phase are we in these days? any hardware testing in progress? Does everything goes as expected? Any photos of board in action? Thanks,
| |
André Jernung Sweden
| | (MX-Board Owner) Posts 988 22 Feb 2011 22:29
| amiga ppc wrote:
| @Thomas (guru) what phase are we in these days? any hardware testing in progress? Does everything goes as expected? Any photos of board in action? Thanks, |
Thomas is currently working on adapting the old memory controller to the new board. So no "action" just yet. :)
| |
Fabian Nunez USA
| | Posts 312 23 Feb 2011 03:14
| Wojtek P wrote:
| Gunnar von Boehn wrote:
| Wojtek P wrote:
| Claudio Wieland wrote:
| Gunnar von Boehn wrote:
| Claudio Wieland wrote:
| Gunnar: When I was on the team, I verified it. You just never cared. That's all there is to it. |
OK, then let look at it, in detail. :-D Can you please explain which cases your proposal covers? Are there any cases it does not cover? How did you verify the 3 cyles timing? Did you do timing analysis? Was this analyse done together with the 050 ALU? Where can I see this simulation? I have seen nothing checked in, did I miss it? |
Since you seem to have forgotten it completely, here it is again. I proposed a special-purpose divide instruction for cases X.?/[1..255]. Divides by smaller numbers are often used. For this, a 1/[1..255] lookup-table is used, which holds 32-bit inverse values, which can be multiplied with X within one clock cycle. One 1kB SRAM block holds the inverses of [1..255]. The End. |
this will work 32-bit inverse table fits 1KB What you need then is - take value from that table - 1 clock cycle - perform 32*32 multiply - 4 clock cycles at least,discard 31 LSBs leaving 33 MSBs,32 is result + low bit if low bit is 1 - add 1 to result - 1 cycle so 6 cycles. how long will normal divisor work? i'm not sure if it make sense. |
And where do you calculate the remainder? Normal 68K DIV instruction can also provide the remainder. |
Right. extra few cycles. So that feature does make even less sense. |
Any reason why there can't be a new "divq #x, ea" instruction that divides ea by x (0 < n < 256) and that doesn't calculate a remainder? I bet in a lot of the common cases where one divides by a small number, that number is a constant.
| |
Matt Hey USA
| | Posts 727 23 Feb 2011 04:10
| Fabian Nunez wrote:
| Any reason why there can't be a new "divq #x, ea" instruction that divides ea by x (0 < n < 256) and that doesn't calculate a remainder? |
divs.l and divu.l do not generate a remainder already. An assembler or compiler optimization is capable of converting immediate values to a series of instructions that do not use division. This would already be better than a divq #imm,EA because it can work with numbers greater than 256 and convert powers of 2 to shifts right which is faster than a lookup table. Another "quick" option is to use the shorter divs.w when possible which is <=22 cycles on the 68060 (similar on N68k?) vs the long forms that are 38 cycles. A remainder is not that difficult to create either. It's 1 more multiply and a subtract which is fast in comparison to a 38 cycle divide. The big advantage of Claudio's idea is that it would also work with non immediate divides that are more common. I expect that over 50% of divides are by less than 256. Maybe 3 different versions of div could be started in parallel and the fastest version would return first. This would be powers of 2 shifted right, Claudio's lookup table for small divisions, and a regular division convergence calculation. Would this slow down much if all 3 were started in parallel and the first one done correctly would stop the others?
| |
Fabian Nunez USA
| | Posts 312 23 Feb 2011 05:02
| I imagine the problem would be mostly the on-chip resources to divide everything three times, when division is a relatively rare thing to do. It would make a lot more sense on a DSP.
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 23 Feb 2011 10:45
| Matt Hey wrote:
| Would this slow down much if all 3 were started in parallel and the first one done correctly would stop the others? |
The challenge is doing the pipelining right. Lets look at the pipeline pipeline: 1) instruction read 2) Instruction decode 3) Register fetch 3) EA-Calculation 4) Cache fetch 5) ALU Operation 6) Write Back For sake of argument lets say you DIV can take 10 clock or 1 clock depending on the size. DIV 1,D0 - takes 1 clocks DIV 666,D0 - takes 10 clocks Lets say you have this code: DIV D1,D0 ADD DO,D2 The CPU will see a dependency between the two instruction. This means the CPU has to delay the second operation this much that it will only start in the ALU when the first is done. When is this point? In clock cycle 2 or clock cycle 11? Is difficult for the CPU to know this in advance. This makes planning the pipelining for the CPU not easy. So even if you make a DIV take 1 cycle the CPU might not be able to use it properly because it will not know that it does not need to delay the second instruction less than expected. And if there is no instruction dependency than the number of cyclesfor taking the DIV does not matter anyhow - as long as the CPU would be able to execute it in parallel.
| |
Marcel Verdaasdonk Netherlands
| | Posts 3974 23 Feb 2011 10:50
| div depends on the data size you can make a good estimate of how long a div takes on depending how big the numbers involved are. Thus a CPU should be aware of a few things the current instruction the next instruction and the data size of both.(there is a lot more to keep in mind but i wanted to keep it simple)
| |
Wojtek P Poland
| | Posts 1597 23 Feb 2011 13:04
| Gunnar von Boehn wrote:
| In clock cycle 2 or clock cycle 11? |
Doesn't your CPU implement scoreboarding now? Simply - start new instruction when first is done or is not writing to register/memory place next instruction require to read.
| |
Thomas Hirsch Germany
| | (MX-Board Owner) Posts 647 23 Feb 2011 14:03
| Niclas Aronsson wrote:
| Here is a quick cleanup on my first picture with more info added :P EXTERNAL LINK |
Hello Niclas, say, may I use your picture? I would like to use it for a short flyer to promote the board and gain some interest of chip manufacturers/distributors at a fair.
| |
Megol .
| | Posts 672 23 Feb 2011 15:51
| Wojtek P wrote:
|
Gunnar von Boehn wrote:
| In clock cycle 2 or clock cycle 11? |
Doesn't your CPU implement scoreboarding now? Simply - start new instruction when first is done or is not writing to register/memory place next instruction require to read.
|
It isn't always that simple. Even OOO processors sometimes increase instruction latency to make it easier to schedule.
| |
Niclas Aronsson Sweden
| | Posts 57 23 Feb 2011 16:09
| Thomas Hirsch wrote:
| Niclas Aronsson wrote:
| Here is a quick cleanup on my first picture with more info added :P EXTERNAL LINK |
Hello Niclas, say, may I use your picture? I would like to use it for a short flyer to promote the board and gain some interest of chip manufacturers/distributors at a fair. |
Off-course. You can do what you want with it :)Here is the original Photoshop file. EXTERNAL LINK
| |
Thomas Hirsch Germany
| | (MX-Board Owner) Posts 647 23 Feb 2011 16:59
| Thank you very much!!
| |
Jacek Rafal Tatko Espania
| | Posts 607 23 Feb 2011 20:04
| Great to read the Good News :)
| |
Thierry Atheist Canada
| | Posts 1828 23 Feb 2011 22:00
| Niclas Aronsson wrote:
|
Thomas Hirsch wrote:
| Hello Niclas, say, may I use your picture? I would like to use it for a short flyer to promote the board and gain some interest of chip manufacturers/distributors at a fair.
|
Off-course. You can do what you want with it :)
|
That's the first time I saw that... Don't know how I missed it!!!That is one CLEAN nice looking board!!!!! I'm going to (if it's okay) use it as my windross x86 wallpaper!!!! Try to give win some dignity.... On the other hand... WHY?
| |
Ed Dream Russia
| | Posts 28 24 Feb 2011 14:14
| Tell me, when the MX will ship to developers? I really want to help the project and is willing to buy a dev platform. Although the design and have no relationship.
| |
Thomas Hirsch Germany
| | (MX-Board Owner) Posts 647 24 Feb 2011 15:49
| Ok, a short update where I am right now: Unfortunately I found two schematic/layout errors which caused the 1.8V supply to be short circuited. So I had to work around this and this took a little time. This means I can now safely turn on the board and debug it. Now there are FPGA design changes to do. Namely pinning and IP adaption to the new main FPGA.
| |
André Jernung Sweden
| | (MX-Board Owner) Posts 988 24 Feb 2011 16:17
| *MODERATION POST* I removed the off-topic discussion about file sharing sites, JavaScript, stalinism etc. This thread is not for discussing these things, but for discussing the MX board and the progress of the bring-up. If you wish to resume the discussion, please start a new thread in the talk section. Wojtek, I know that you were perfectly able to download the file in question using several methods but just saw a chance to crusade a bit about web standards. If you want to do that, start a new thread and do not clutter this one. That goes for all of you.
| |
|