Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Do you have ideas and feature wishes? Post them here and discuss your ideas.

Interger ALU Design Ideaspage  1 2 
Marcel Verdaasdonk
Netherlands

Posts 3991
01 Oct 2010 10:14


To prevent the register thread to go off topic i make this one.

Okay i hope Gunnar won't mine moving his last post here.

Oh and since we are talking about ALU's anyhow.
Do we use a Demux on the carry out line between byte words?(either carry over, or set carry flag depending on instruction length)

Deep Sub Micron
Germany
(MX-Board Owner)
Posts 567
01 Oct 2010 14:32


Not a demux, a multiplexer is selecting adder result bit 32, bit 16 or bit 8. Carry over is not gated by anything to use fast carry path.


Marcel Verdaasdonk
Netherlands

Posts 3991
01 Oct 2010 14:45


Brain storming Deep sub, Brain storming. ;)

Perhaps i should have added it was a Idea i though of for SIMD if that wouldn't limited through put severly.(adds latency on the carry...)

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
01 Oct 2010 14:55


Well, the very best 68K Alu that I can image,
that someone could implement inside an FPGA would look like this:
 
 
 
  Quick legend:
  We have two AGUs.
  The AGU can calculate the addresses for a memory/cache access.
  Each AGU can combine one immediate value and two register values per clock. The result can be used to update an address pointer.
 
  We also have two ALUs
  Each ALU can combine Immediate, Memory values and register values.
  The ALU can update Data registers and memory location.
 
  We also have a Branch Acceleration Unit
 
  and a Parallel Load Unit with is able to update Register and Cache.
 
 
  This is the best 68K ALU that I can imagine.

Deep Sub Micron
Germany
(MX-Board Owner)
Posts 567
01 Oct 2010 15:38


> Brain storming Deep sub, Brain storming. ;)

yes, ok, when thinking about SIMD then inserting a dummy bit to both operants might be a good way to gate the carry. When using a simple AND gate this can cause a huge delay of an additional lookup table (because of breaking the special fast carry path). If the sum of the inserted bits is 1 then the carry is propagated. If both bits are '0' then the carry is gated for SIMD. And finally for SIMD both bits '1' inserts a carry (like in ADDX). The result contains also a dummy bit that need to be removed/ignored or used as carry out for SIMD.

So for example a 64bit adder that can also be used as SIMD 8 * 8 bit adder has a carry path timing like a 64+7bit adder.


Marcel Verdaasdonk
Netherlands

Posts 3991
01 Oct 2010 16:27


Deep sub I suppose if there is a shortage in space my Idea could be used.
Realistic speaking we could also make it with several ALU's at the cost of burning up a lot of free space.

@Gunnar you post that is the idea for the 68070 or am i mistaken?

Megol .

Posts 690
01 Oct 2010 21:13


deep sub micron wrote:

  > Brain storming Deep sub, Brain storming. ;)
 
  yes, ok, when thinking about SIMD then inserting a dummy bit to both operants might be a good way to gate the carry. When using a simple AND gate this can cause a huge delay of an additional lookup table (because of breaking the special fast carry path). If the sum of the inserted bits is 1 then the carry is propagated. If both bits are '0' then the carry is gated for SIMD. And finally for SIMD both bits '1' inserts a carry (like in ADDX). The result contains also a dummy bit that need to be removed/ignored or used as carry out for SIMD.
 
  So for example a 64bit adder that can also be used as SIMD 8 * 8 bit adder has a carry path timing like a 64+7bit adder.

Wouldn't it be better to split a 64 bit addition into two stages? Most FPGAs seem to be optimized for ~32 bit adds and one cycle extra latency for something not backwards compatible shouldn't be too big a problem. SIMD code in my experience is more about throughput than low latency. With an extra bypass one still could get back-2-back dependent additions however that would of course cause a slowdown again.

Marcel Verdaasdonk
Netherlands

Posts 3991
02 Oct 2010 11:41


it wouldn't really matter much since the ALU would never will have the full 64 as one.(or am i wrong here?)

Megol .

Posts 690
02 Oct 2010 13:49


Marcel Verdaasdonk wrote:

it wouldn't really matter much since the ALU would never will have the full 64 as one.(or am i wrong here?)

An extended float uses a 64 bit mantissa so a compatible FPU would require such an adder (in reality somewhat bigger due to rounding). On the other hand a carry-select adder wouldn't be much bigger and should allow a 64 bit add in one pipestage.


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
06 Oct 2010 09:34


The 68K ISA is really designed for super scalar.

There are many 68K instructions which need several clocks and need to be executed in microcode on a non-super scalar CPU.

These are:

move (ea),(ea)
subx
addx
cmpm
exg (some forms)
link
unlink
pea

A super scalar CPU outlined as above would be able to execute them in parallel on both pipes.

This means a good super scalar designed core does reduce the need for most of the  microcode.

Megol .

Posts 690
06 Oct 2010 14:53


Gunnar von Boehn wrote:

The 68K ISA is really designed for super scalar.
 
  There are many 68K instructions which need several clocks and need to be executed in microcode on a non-super scalar CPU.
 
  These are:
 
  move (ea),(ea)
  subx
  addx
  cmpm
  exg (some forms)
  link
  unlink
  pea
 
  A super scalar CPU outlined as above would be able to execute them in parallel on both pipes.
 
  This means a good super scalar designed core does reduce the need for most of the  microcode.

MOVE (EA), (EA) "only" needs an extra EA unit for the other address as it's still a LD-EX-ST operation and should fit the pipeline.
If one have the extra EA unit PEA should also be a one-clock affair.
SUBX and ADDX shouldn't need microcode unless you do like Intel and split them into an ordinary addition/subtraction plus a add/sub with the x-flag.

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
06 Oct 2010 15:24


Megol . wrote:

MOVE (EA), (EA) "only" needs an extra EA unit for the other address

Yes, you need to EA calculations for MOVE like this:
MOVE.L d(A0),d(A1)

Megol . wrote:

If one have the extra EA unit PEA should also be a one-clock affair.

Megol . wrote:

SUBX and ADDX shouldn't need microcode unless you do like Intel and split them into an ordinary addition/subtraction plus a add/sub with the x-flag.

As you certianly know ADDX and SUBX support these modes too:
ADDX -(Ay), -(Ax)
For those you of course need two EA Calculations.

The same addressing is done with CMPM.

But as mentioned all these 68K instructions can be done without microcode on a Superscalar design.

Marcel Verdaasdonk
Netherlands

Posts 3991
06 Oct 2010 17:42


Gunnar does this means the EA unit of one pipeline gets stolen by the other to execute some code?
I wonder what our in house ASM coders have to tell of this.


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
06 Oct 2010 18:05


Marcel Verdaasdonk wrote:

Gunnar does this means the EA unit of one pipeline gets stolen by the other to execute some code?
  I wonder what our in house ASM coders have to tell of this.
 

I assume that they would like it as it makes the CPU both faster and simpler to optimize for.

The CPU would have 2 A-Unit, and 2 D-Units.

68K Instructions would use 1 or more units in parallel.

Instructions using 1 A-Unit:
ADDA
SUBA
LEA

Instructions using 1 D-Unit
MOVEQ
SWAP
EXT
And generally all instructions using immidiates or Data registers as sources and updating Data Registers. These are very many instructions.

Instructions using 1 A-Unit and 1 D-Unit.
These are instructions using memory as source or destination.

Instructions using 2 A-Units (0 D-Units)
move (Ea),(Ea)

Instructions using 2 A-Units and 1 D-Unit
pea
addx -(An),-(An)
cmpm

So technically scuh a CPU has 4 Execution units (+ branch Unit)
Depending on the instruction flow it would execute 1 to 5 68K instructions per clock.

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
06 Oct 2010 19:43


The 68060 was Superscalar and allowed 1 to 2 instructions per clock.
The matrix which instruction combination allowed 2 instructions to be executed in parallel was relative difficult.

The above outlined CPU would be much more powerful and allow much more often 2 instructions per clock also instructions which took several clocks would instead be executed in parallel needing only 1 clock and with good code alginment even 4 or more instruction could be executed per clock.

Now the question is, would it make sense to have an Assembler which is able to highly how many instruction are executed per clock.
Or an Assembler which highlight when instruction can not be executed in parrallel.

What do you think is such a tool needed?
Would someone willing to write it?

Marcel Verdaasdonk
Netherlands

Posts 3991
07 Oct 2010 00:19


Gunnar we already have 5 ALU's basically, What would be the added cost of adding another EA and let it's primary usage be support of the Branch prediction unit, and second as a 'free address unit'?

As an answer to your question i have been working on adding the m68k Syntax to Notepad++.
I am not sure if i can add what your looking for, but when i am done with what i am working on i would like to look into it.

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
07 Oct 2010 04:02


Marcel Verdaasdonk wrote:

Gunnar we already have 5 ALU's basically,

Hmm, how do you count this?

Marcel Verdaasdonk wrote:

What would be the added cost of adding another EA and let it's primary usage be support of the Branch prediction unit, and second as a 'free address unit'?

The Branch prediction unit can calculate already.
It can do 1 free branch per clock.
 


Cesare Di Mauro
Italy

Posts 528
07 Oct 2010 05:56


Gunnar von Boehn wrote:
So technically scuh a CPU has 4 Execution units (+ branch Unit)
  Depending on the instruction flow it would execute 1 to 5 68K instructions per clock.

Terrific! How much instructions can be decoded by the decoder unit? Do you use a decoded instruction cache (similar to the Pentium4 trace cache)?

Marcel Verdaasdonk
Netherlands

Posts 3991
07 Oct 2010 10:40


1 Branch unit (Very specific ALU?)
2 Address units (limited ALU)
2 Data units (general ALU)

I thought the branch prediction needed Address calculation too
That is why i asked for a 3rd Address unit that can be 'used' by the other pipelines as well.
Making it easier to allow a sustained through put.

I might be wrong.


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
07 Oct 2010 12:35


Marcel Verdaasdonk wrote:

1 Branch unit (Very specific ALU?)
  2 Address units (limited ALU)
  2 Data units (general ALU)
 
  I thought the branch prediction needed Address calculation too
  That is why i asked for a 3rd Address unit that can be 'used' by the other pipelines as well.
  Making it easier to allow a sustained through put.
 
  I might be wrong.
 

Does not work this easy.

I'll explain you why.
There are three types of units:

A) Branch acceleration.
This Unit is in the very first pipeline step.
Before! reading any registers.
Only because its before reading registers it can accelerate the branches this good.

B) The Address units are in the middle of the pipeline, before reading memory.

C) The Data units are in the end of the pipeline, after reading memory.

Not only the registers that the units work with but also the location (beginning of the pipeline, middle or end) define its behaviour.

This CPU design lives from having units behind each others.
This "tricks" allows the CPU to do so many things in parallel.



posts 23page  1 2