Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.


All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
The team will post updates and news here

68050 Gets Bitfield Instrustionspage  1 2 
Angel of Paradise
Germany

Posts 32
09 Mar 2010 10:33


Isn't it a big advantage that the 68050 can operate on all 16 registers now?

Sounds like what AMD did with their x86-64bit enhancement.
Didn't AMD say that having 16 registers now, did give 20% performance boost?


Marcel Verdaasdonk
Netherlands

Posts 2100
09 Mar 2010 11:21


Angel of Paradise wrote:

Isn't it a big advantage that the 68050 can operate on all 16 registers now?
 
  Sounds like what AMD did with their x86-64bit enhancement.
  Didn't AMD say that having 16 registers now, did give 20% performance boost?
 

True it is a good think but with the same bit it was posible to double the address range, so it is more something what is important design wise.

Thomas Richter
Germany

Posts 699
09 Mar 2010 22:08


Gunnar von Boehn wrote:

Thomas Richter wrote:

  Hmm. Probably all the bit-fiddling instructions are too special to allow a broad use of them on address registers
 

 
  The 68K instruction set, defines a 4bit field (Bit:15/14/13/12) to allow addressing of all 16 registers freely.
 
  The N68050 can do full logic and arithmetic on Adressregisters.
  This means the N68050 can do MUL, AND, OR, BIT operation on them.

Well, thanks, understood. All I'm saying is that - in my experience - using address registers because I run out of data registers was never really an issue. The other way 'round was indeed an issue, but that's harder to address. Well, if you say we can do a

move.l dn(d0),reg

quickly (via the 020 addressing mode of move.l dn(za0,d0),reg)), then this would do it. IIRC, the extension is one word longer, so it is still not quite as efficient, isn't it? Sorry, I forgot that part.

 
Gunnar von Boehn wrote:
 
 
Thomas Richter wrote:

  Hey, an idea! The following *would* be actually useful: Address register indirect with multiplier:
 
  move.l (a0*4),d0 -or- move.l (d2*4),d1
 

 
  Yes, this is supported and these address modes are free.
  They do NOT cost any clock extra.

The worst part is that I'm not even sure whether I was joking or not. (-; I understant that, by the design of the CPU, this should be faster than the 020 implementation of the same mode (actually, indirect with outer displacement, index register with shift and suppressed base register), but still having some construct in the CPU to (or which) support(s) BCPL looks to me like having a power-drill to wind-up your grandfather clock. (-;

 
Gunnar von Boehn wrote:
 
  All clear now?
  Cheers

It was mostly clear to begin with, already; I wasn't totally serious about the "BCPL address mode" (which is, in a way, existing, of course).

Just the size of the extension word for the data register "indirect" mode (actually, a different mode, see above), is something I should look up. My guess is still that it takes longer because one additional instruction word need to be fetched.

Greetings,
Thomas
 

Thomas Richter
Germany

Posts 699
09 Mar 2010 22:09


Team Chaos Leader wrote:

 
Thomas Richter wrote:

  Hey, an idea! The following *would* be actually useful: Address register indirect with multiplier:
 
  move.l (a0*4),d0 -or- move.l (d2*4),d1
 

 
Gunnar von Boehn wrote:
 
  Yes, this is supported and these address modes are free.
  They do NOT cost any clock extra.
 
 

  Maybe I am mixed up, but I thought Thor was inventing a brand new ASM BCPL addressing mode?
 
  Did you use your time-machine to back-support this new mode? (:
 

See above. I wasn't completely serious. Or probably, not completely sane. Whatever you prefer. (-: But that mode does exist, as a special case of the extended 020 addressing modes.

So long,
Thomas


Team Chaos Leader
USA
(Natami Team Member)
Posts 1199
10 Mar 2010 01:11


@Gunnar

So you are saying that these are both the same speed on 050?


move.l (d0),d1
move.l (a0),a1

Is there such a thing as data-register indirect with post-increment?


move.l (d0)+,d1



Matt Hey
USA

Posts 204
10 Mar 2010 02:41


Thomas Richter wrote:
 
  quickly (via the 020 addressing mode of move.l dn(za0,d0),reg)), then this would do it. IIRC, the extension is one word longer, so it is still not quite as efficient, isn't it? Sorry, I forgot that part.

The brief extension word (d8,An,Xn.SIZE*SCALE) and full extension word (bd,An,Xn.SIZE*SCALE) are both a word. It's only when there is a base displacement or an outer displacement of the latter that the instruction grows in size with the added extensions. The scale size is encoded in the brief and full extension words so does not require an extension. These 2 extension words are very similar with bits 9-15 being practically the same. I use Frank Wille's vasm assembler which doesn't require typing the zero register and can optimize into the most efficient encoding. Everything looks much cleaner without those silly zero registers, just like you typed in your first examples. That's how my adopted disassembler ADis outputs which reassembles with vasm without changes. Check out vasm if you haven't before. It's very good.


Gunnar von Boehn
Germany
(Natami Team Member)
Posts 3738
10 Mar 2010 07:36


Team Chaos Leader wrote:

@Gunnar
 
  So you are saying that these are both the same speed on 050?
 
 

  move.l (d0),d1
  move.l (a0),a1
 

Yes, both lines take 1 cycle (assuming cache hit).
But one type of them is 2 bytes longer than the other.
The instruction move.l (d0),d1 is 4 bytes long
The instruction move.l (a0),a1 is 2 bytes long.

 
Team Chaos Leader wrote:

  Is there such a thing as data-register indirect with post-increment?
 

  move.l (d0)+,d1
 

 

No, such a mode does not exists.

I would not recommend using the (Dn)-mode for work loops doing something like this.


.loop
  move.l (d0)+,d1
  addq.l #4,d0
  dbra d7,.loop

The post increment is done in the EA-ALU in the upper part of the pipeline. ADDA or SUBA are also done in the upper EA-ALU.
ADD on DN is done in the lower MAIN-ALU which can also set the flags.
The EA-ALU can forward to the next instruction without a penalty this means update on An and direct usage of the An in the next instruction works without drawback.
Update in Dn and using this Dn in a EA-Calculation in the next clock can not work because the MAIN-ALU is 2 pipeline steps further away from the EA-Calculation. This update penalty rule is true or all 68K and Coldfire CPUS.
 

If you run out of ADDRESS register there is sometimes a simple trick which allows you to "borrow" more pointers.

Lets say you want to code a memcopy but have only 1 ADDRESS register free.

Then you can do the following.
D0 = PTR to Source
A0 = PTR to Destination.
D1 = Loopcounter


sub.l A0,D0  ; D0 is now the difference between src and dst
.loop:
  move.l (A0,D0),(A0)+
  dbra d1,.loop

This is a simple way to create more ADDRESS register.
This usage of registers does not create any bubbles as the OFFSET in D0 does not need to be updated in such workloops.
This means the above is just as fast as using 2 address registers.

Maybe this simple tricks can help someone.

Claudio Wieland
Germany
(Natami Team Member)
Posts 365
10 Mar 2010 08:36


It helped me ;-)) .

Thomas Richter
Germany

Posts 699
10 Mar 2010 19:26


Gunnar von Boehn wrote:

Team Chaos Leader wrote:

  @Gunnar
 
  So you are saying that these are both the same speed on 050?
 
 

  move.l (d0),d1
  move.l (a0),a1
 

 

 
  Yes, both lines take 1 cycle (assuming cache hit).
  But one type of them is 2 bytes longer than the other.
  The instruction move.l (d0),d1 is 4 bytes long
  The instruction move.l (a0),a1 is 2 bytes long.

No, sorry, I still don't get this. Are you trying to say that the *execution* stage of the instruction takes one cycle, or that the complete overall *throughput* is one cycle?

I wouldn't understand the latter. After all, the CPU would have to fetch and decode one additional word from memory, namely the extended instruction word. How could a instruction that is twice as wide have the same throughput? It would require twice the bandwidth for the instruction (code pipeline), making it overall probably by 50% slower due to bandwidth limitations?

Ok, an example to make my point more clear:

Given move.w (a0),d0, the CPU has to fetch one word (the instruction), and another word (the data), requiring in total a bandwidth of four bytes for this instruction. If that executes in one cycle, four bytes per cycle (in a cache-hit case).

Given move.w (d0),d1, the CPU has to fetch two words (the instruction and its extension word), and another word (the data), requiring in total six bytes, thus 2/4 = 50% more. If that instruction is *also* taking one cycle, where's the increased bandwidth coming from?

Confused,

Thomas



One Thousand
USA
(Natami Team Member)
Posts 716
10 Mar 2010 20:15


Yes, he means the throughput is one cycle.
 
How?  The 050 cache is made to give burnt offerings of up to 8 bytes a clock.  And the rest of the pipeline is made to flow with it all.

Gunnar von Boehn
Germany
(Natami Team Member)
Posts 3738
10 Mar 2010 20:19


Thomas Richter wrote:

Gunnar von Boehn wrote:

  Yes, both lines take 1 cycle (assuming cache hit).
  But one type of them is 2 bytes longer than the other.
  The instruction move.l (d0),d1 is 4 bytes long
  The instruction move.l (a0),a1 is 2 bytes long.
 

  No, sorry, I still don't get this. Are you trying to say that the *execution* stage of the instruction takes one cycle, or that the complete overall *throughput* is one cycle?

Both.
Throughput 1 instruction per clock.

Thomas Richter wrote:

I wouldn't understand the latter. After all, the CPU would have to fetch and decode one additional word from memory, namely the extended instruction word.

But this would only be a problem, if the CPU ICache would be limited to 2 byte per clock.

On the 68000/68020/68030 this was like this.

The 68050 does load 8 bytes per clock form the ICache.
Therefore it makes no difference whether the instruction is 2 bytes 4 bytes long.

 

Thomas Richter wrote:

  How could a instruction that is twice as wide have the same throughput? It would require twice the bandwidth for the instruction (code pipeline),

This would only be a problem if the instruction is longer than the bandwidth of the CPU.

The 68040 did have a bandwidth of 8 bytes per clock.
The 68060 did have a bandwidth of 4 bytes per clock.
The 68050 does have a bandwidth of 8 bytes per clock and we want to increase (double this) in preparation for the 070.

The above numbers are for ICache.
The Datacache bandwidth comes in addition to this.

Does this answer your question? :-)

Thomas Richter
Germany

Posts 699
10 Mar 2010 20:34


Gunnar von Boehn wrote:

  Does this answer your question? :-)


Ah, thanks, indeed! So the program is possibly longer, but the bandwidth is high enough to satisfy the pipeline all the time.

Greetings,
Thomas


Gunnar von Boehn
Germany
(Natami Team Member)
Posts 3738
10 Mar 2010 21:45


Thomas Richter wrote:

Gunnar von Boehn wrote:

 
  Does this answer your question? :-)
 

  Ah, thanks, indeed! So the program is possibly longer, but the bandwidth is high enough to satisfy the pipeline all the time.
 
  Greetings,
  Thomas

Yes, the bandwidth is high enough on the 68050.

For some CPUs this was not the case and increasing the instructions size was slowing these CPUs down. So your concern was very valid for some older CPUs.

The 68000 needed 4 clocks for each 2 bytes.
The 68020 and 68030 needed 2 clocks for each 2 bytes.
AFAIK also the TG68 needs 2 clock for each 2 bytes.

On the 68050 this is not the case.
Even long and complex instructions like
MOVE.L #$12346789,$20(A0,D0.L*8)
Take only 1 clock.

Cheers :-)

Team Chaos Leader
USA
(Natami Team Member)
Posts 1199
10 Mar 2010 23:17



  The 68050 does load 8 bytes per clock form the ICache.
  Therefore it makes no difference whether the instruction is 2 bytes 4 bytes long.

 
  So this means that when Jens finishes cooking up the superscalar 68070 that we can then execute both of these instructions simultaneously in 1 clock?
 

  move.l (d0),d1
  move.l (d2),d3
 

 

Matt Hey
USA

Posts 204
11 Mar 2010 04:58


@TCL
  The 68060 can do both in 2 cycles. (d8,An,Xi*SF) is free but (bd,An,Xi*SF) takes an additional cycle even though both instructions are 4 bytes. If the base address register could be suppressed (it can't) with (d8,An,Xi*SF) then the 68060 would be able to do both in 1 cycle. It's really not that inefficient on the 68060 but pretty slow on 68040 and below.
 
  Note that move.l (d0),d1 or move.l (d2*4),d1 or move.l (a0,d0),(a0)+ will assemble without warning in vasm (and probably other assemblers) but will do move.l (d0.w),d1 and move.l (d2.w*4),d1. You and Thomas would need to put move.l (d0.l),d1 and move.l (d2.l*4),d1 if using a data register as a pointer. I believe Gunnars example should also be move.l (a0,d0.l),(a0)+.


Matt Hey
USA

Posts 204
11 Mar 2010 05:57


Gunnar von Boehn wrote:

  The post increment is done in the EA-ALU in the upper part of the pipeline. ADDA or SUBA are also done in the upper EA-ALU.
  ADD on DN is done in the lower MAIN-ALU which can also set the flags.
  The EA-ALU can forward to the next instruction without a penalty this means update on An and direct usage of the An in the next instruction works without drawback.
 

 
  Is addq and subq to an address register done in the upper EA-ALU? How about the new instruction variations like and and mul to an address register?


Gunnar von Boehn
Germany
(Natami Team Member)
Posts 3738
11 Mar 2010 06:19


Matt Hey wrote:

Gunnar von Boehn wrote:

    The post increment is done in the EA-ALU in the upper part of the pipeline. ADDA or SUBA are also done in the upper EA-ALU.
    ADD on DN is done in the lower MAIN-ALU which can also set the flags.
    The EA-ALU can forward to the next instruction without a penalty this means update on An and direct usage of the An in the next instruction works without drawback.
 

 
  Is addq and subq to an address register done in the upper EA-ALU?

Yes, this is how the better 68K CPU are designed.
This is how the 68040,68050, and 68060 are designed.
This is also how the Coldfire V4, and V5 are designed.

Matt Hey wrote:

How about the new instruction variations like and and mul to an address register?

Instruction which before could only write to DATA registers are always executed in the MAIN-ALU. If this instruction is now enhanced to be able to write to ADDRESS registers also this is done in the MAIN-ALU of course too.

Cheers
Gunnar



posts 37page  1 2