Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Welcome to the Natami lounge.
Meet new AMIGA friends here and enjoy having a friendly chit chat.

OK Teamers, Could Someone Show Us the Progress?page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 
Matt Hey
USA

Posts 735
22 Mar 2012 00:04


wawa tk wrote:

 
Rune Stensland wrote:

    CF instructions are left out. But they can be added. 020+ indirect addressing modes might be moved out to emulation/trapped.
 

   
  im fine with that so far, this would only introduce incompatibility to existing hardware.
 

 
  The ColdFire instruction additions to the N68k will NOT cause compatibility problems. The 68k of the N68k will work 100% the same as any other 68020+ CPU. ColdFire compatibility can't be 100% on the N68k though.
 
  68020+ software = 100% compatibility
  ColdFire software = maybe 97% compatibility
 
  It's fine that the CF instructions are not included yet. It may be beneficial to have a simpler base line for testing. It's good to get the core out faster too.
 
  Thanks for the updates Rune :).
 
 
comp arch wrote:

  Yes, you're right again. Anyone remember the CALLM or RTM instructions? Has anybody ever used them? Or at least know how they work? :)

 
  Don't use CALLM or RTM on an Amiga. The N68k may use that encoding space in the future. They were slow and a mistake anyway.
 
 
comp arch wrote:

  Is there already a list of all opcodes that can be fused? Or is there still room left for additions?

 
  There is room for more additions if they are common. There were several threads where fusion candidates were discussed but here is the main one:
 
  CLICK HERE 

 
comp arch wrote:

  Could I ask you, please, whether the instruction 'MOVEQ' will be fused with other instructions as well? I'm especially thinking of the combination
        MOVEQ  #0, d0
        RTS
 

 
  I believe Samuel's answer is correct (no fusion possible) but the rts should be much faster on the N68k with the addition of a link stack. The rts is 7 cycles on the 68060 but should be about 2 cycles on the N68k with a link stack. The link stack is talked about in the fusion thread above.
 

Wawa Tk
Germany

Posts 581
22 Mar 2012 01:23


Matt Hey wrote:

  The ColdFire instruction additions to the N68k will NOT cause compatibility problems. The 68k of the N68k will work 100% the same as any other 68020+ CPU. ColdFire compatibility can't be 100% on the N68k though.
 

they would encourage backwards incompatibility by their sole availability. this i would prefer to avoid as long as it is not really bringing an unbelievable boost.
im sick of all these cpu dependant versions of software on amiga.


Comp Arch

Posts 33
22 Mar 2012 02:10


Samuel D Crow wrote:

As nice as it would be, it would not be using 3 operand encoding nor predication so it cannot be fused.

Ah, I see. Thanks for the explanation. I hoped there were different internal buses for data and address registers. Well, maybe they will merge in the next superscalar version. :) Anyway, fusing opcodes like
  MOVEQ #value, Dx
  ADD.L d16(Ax), Dx
is already a great advancement!
I guess NOT is also included in the list of operations, right?
  MOVE.W D2, D1  ; common sequence for generating a bit mask
  NOT.W D1
And also SWAP, which is often used like this
  MOVE.L Dx, Dy
  SWAP Dy
Adjusting fix point arithmetic results is often done by
  ADD.L D0, D0
  SWAP D0
or
  ASL.L #2, D0
  SWAP D0
Could a fusion in these cases work although all the 'three' operands are the same and thus depend on each other? (I guess no, but I'm no expert.)

BTW.: What about the LineA opcodes? Will they cause an exception? AFAIK some old programs depend on it. (IIRC adventures by Magnetic Scrolls. There was also a flight simulator that crashed for this reason, but I'm not sure. It's so long ago...)
Matt Hey wrote:

The ColdFire instruction additions to the N68k will NOT cause compatibility problems. The 68k of the N68k will work 100% the same as any other 68020+ CPU. ColdFire compatibility can't be 100% on the N68k though.

According to my manual e.g. the ColdFire instruction MOVE to Accumulator (MOVE.L Ry, ACC) has the instruction format
  1010 0001 00 mmmmmm
and will thus cause a conflict with the LineA handling. (Unless, of course you trap these ACC instructions via LineA to emulate them...) Sorry I have to ask ('cause I really don't know): is ColdFire compatibility essential or somehow needed?

Matt Hey wrote:

Don't use CALLM or RTM on an Amiga.

Don't worry, I won't. :) Never used them, and I doubt anyone did.
Matt Hey wrote:

They were slow and a mistake anyway.

At last, I found someone who actually knows what they were doing. :)
Matt Hey wrote:

The rts is 7 cycles on the 68060 but should be about 2 cycles on the N68k with a link stack.

2 cycles only? That's real fast and would help a lot.
In a previous post I asked about the typical combination of
  JSR library_offset(A6) ; A6 = pointer to jump list
  ...

<jump list>
==> JMP label .L

label:    ; some library routine here
  RTS

Could I ask you, how does the 68050 handle such an indirect subroutine call? Is there any influence on the pipeline or are branches with a fix address free of charge?
Oh, and please let me know if I ask too many questions... (I know I do)


Matt Hey
USA

Posts 735
22 Mar 2012 04:58


wawa tk wrote:

Matt Hey wrote:

    The ColdFire instruction additions to the N68k will NOT cause compatibility problems. The 68k of the N68k will work 100% the same as any other 68020+ CPU. ColdFire compatibility can't be 100% on the N68k though.

 
  they would encourage backwards incompatibility by their sole availability. this i would prefer to avoid as long as it is not really bringing an unbelievable boost.
  im sick of all these cpu dependent versions of software on amiga.

The 68k with CF and additional changes adds modern features that were not popular back in the day but are common and expected today. They make programming easier and compiler support easier. CF support in compilers can easily be enabled and most existing CF software can be used. The code density is improved to what I expect to be one of the best available for any modern CPU and a good selling point for low end systems. The speed is improved and smaller caches and memory are needed. Individually, the advantages are small but, as a whole, the advantages are obvious IMHO. I would like to see a standard ISA that is supported by Natami and fpga Arcade to minimize the number of variations and encourage their use. We must plan for a positive future. It's the same reason the Natami is developed with modern advances and adds chunky screen modes and 3D to the Amiga. It's what the Amiga should have been if development had gone on. It's the Amiga and Natami spirit :).

comp arch wrote:

  According to my manual e.g. the ColdFire instruction MOVE to Accumulator (MOVE.L Ry, ACC) has the instruction format
  1010 0001 00 mmmmmm
  and will thus cause a conflict with the LineA handling. (Unless, of course you trap these ACC instructions via LineA to emulate them...) Sorry I have to ask ('cause I really don't know): is ColdFire compatibility essential or somehow needed?

The CF MAC instructions will not be supported. The CF instructions planned for support are bitrev, byterev, ff1, mov3q, mvs, mvz and sats. rems/remu would interfere with divsl/divsu so they can't be used without changing the encoding. There shouldn't be any line A encodings which will be left open if possible for Macintosh 68k emulation. All 68k instructions will be done 68k style including stack use. Some CF instructions will likely have more usable addressing modes than the CF supports. CF additions are not necessary but there shouldn't be any 68k compatibility problems and there are many benefits.

comp arch wrote:

  In a previous post I asked about the typical combination of
  JSR library_offset(A6) ; A6 = pointer to jump list
  ...
 
  <jump list>
  ==> JMP label .L
 
  label:    ; some library routine here
  RTS
 
  Could I ask you, how does the 68050 handle such an indirect subroutine call? Is there any influence on the pipeline or are branches with a fix address free of charge?

This was discussed in a thread but there was not a definitive answer in the best way to handle library function tables. They will probably be a few cycles faster than a 68060 but not several times faster like rts. It may be possible to detect and fold out (eliminate=0 cycles) the jmp instruction sometimes. The jmp would probably have to be in cache (caches are much bigger) and the library base may have to be in a6 several instructions before the jsr. You certainly want to load the a6 library base as early as possible so change:

  moveq #0,d0
  move.l execbase,a6
  jsr (-$24,a6)

to:

  move.l execbase,a6
  moveq #0,d0
  jsr (-$24,a6)

Loading address registers that can be used as pointers first is generally best for 68k family processors as it can avoid or reduce change use delays in some cases. There may be a branch cache added at some point to the N68k which should help as the target is usually the same from each jsr. Gunnar seemed like he did not want to add a branch cache for some reason though. The branch cache on the 68060 works great. A branch cache on the N68k with function table recognition might be able to fold out the jsr (except stack use) and jmp. Don't expect any time soon but I have been pleasantly surprised lately with the superscaler announcement ;).


Team Chaos Leader
USA
(Moderator)
Posts 2094
22 Mar 2012 06:27


@Matt Hey

I love the 060 branch cache as much as anyone.

But the interesting thing about the branch cache on the 060 is that by having it enabled, you limit the maximum Mhz of the 060 by a few Mhz.

This is according to Rune's hardcore clocking experiments.



Marcel Verdaasdonk
Netherlands

Posts 3979
22 Mar 2012 08:17


I am not sure what rune tested but yes Branch cache limits the maximum clock rate.
This is because these caches are synchrone with the CPU.

Nixus Minimax
Germany

Posts 273
22 Mar 2012 09:23


Rune Stensland wrote:

 

  RegToReg:  2662 (6)%
  MemToReg:  4414 (11)%
  RegToMem:  2505 (6)%
  Branch/Lea #:  6877 (17)%
  One reg:    848 (2%)
  MemToMem:  1364 (3)%
  No reg:      1151 (2%)
 

What are the other 53%?


Rune Stensland
Norway
(MX-Board Owner)
Posts 871
22 Mar 2012 10:31


Nixus Minimax wrote:

  What are the other 53%?
 

 
  As usual I was to quick to post the results. Let me fix the bug's and publish the exefile so you can play with it yourself.

Nixus Minimax
Germany

Posts 273
22 Mar 2012 10:36


Rune Stensland wrote:
As usual I was to quick to post the results.

:)

I really wasn't sure whether there could be something more that wasn't listed, so I preferred to ask.


Matt Hey
USA

Posts 735
22 Mar 2012 13:20


Team Chaos Leader wrote:

  I love the 060 branch cache as much as anyone.
 
  But the interesting thing about the branch cache on the 060 is that by having it enabled, you limit the maximum Mhz of the 060 by a few Mhz.
 
  This is according to Rune's hardcore clocking experiments.

Marcel Verdaasdonk wrote:

I am not sure what rune tested but yes Branch cache limits the maximum clock rate.
  This is because these caches are synchrone with the CPU.

Hmm. Interesting. I would think the branch cache would be as fast as the CPU caches and able to always keep up. Any delay should be hidden in the pipeline. The 68060 has 0 cycle branches taken on branch cache hits and that's with virtual address lookups. It makes even more sense on the N68k where there is no virtual addresses and the branch cache should not need to be flushed during task switching. All modern CPU's have much higher clock rates and virtual addresses with branch caches but maybe they have longer pipelines to hide the branch cache accesses? I would think that the minimum pipeline length to provide 0 cycle loop branches and 0 cycle predicted branches would be optimum. What is more important than reducing branch overhead in a modern CPU? If code doesn't use many branches then send it to a modern GPU ;).


Nixus Minimax
Germany

Posts 273
22 Mar 2012 13:40


Matt Hey wrote:
  Hmm. Interesting. I would think the branch cache would be as fast as the CPU caches and able to always keep up.

I would guess that there is a deliberate design limitation. If you read from I or Dcache, you always have to consider the case that the requested data aren't in the cache. Thus, you implement the possibility of a pipeline stall. Now if you overclock, it won't really matter whether the (cached) data is available within a cycle or whether it is two. The interface is flexible and designed to interface to different memory subsystems.

The branch cache, on the other hand, is part of the pipeline logic. You won't clock the logic higher than what you find in your tests to be the highest possible clock rate. If you clock just a tiny bit too high, you would get a pipeline stall for each branch. This would hurt the processing speed much more than you could ever gain from the extra clock rate. Thus, there is no reason to make this data path between the branch cache and the pipeline flexible. You would never clock the whole thing any higher than what can be done in the clock period. I expect that the higher clockrate with disabled branch cache shows a lower overall performance. The critical path has to be somewhere. I don't think it is too surprising to find it in the branch prediction which is very important for speed. Also, you wouldn't necessarily notice critical paths in some ALU units such as MUL or DIV which may depend on the data.


Samuel D Crow
USA
(Natami Team)
Posts 1295
22 Mar 2012 15:58


comp arch wrote:

 
Samuel D Crow wrote:

  As nice as it would be, it would not be using 3 operand encoding nor predication so it cannot be fused.
 

  Ah, I see. Thanks for the explanation. I hoped there were different internal buses for data and address registers. Well, maybe they will merge in the next superscalar version. :) Anyway, fusing opcodes like
    MOVEQ #value, Dx
    ADD.L d16(Ax), Dx
  is already a great advancement!
 

  This may have to wait for a future version of the core.  Currently only register-to-register moves are being considered for 3-operand fusion.
 

  I guess NOT is also included in the list of operations, right?
    MOVE.W D2, D1  ; common sequence for generating a bit mask
    NOT.W D1
 

  It's only word length so no.  It won't work.
 

  And also SWAP, which is often used like this
    MOVE.L Dx, Dy
    SWAP Dy
 

  This one should work.
 

  Adjusting fix point arithmetic results is often done by
    ADD.L D0, D0
    SWAP D0
  or
    ASL.L #2, D0
    SWAP D0
 

  This one won't work.  It needs to have a register-to-register move at the top.
 

  Could a fusion in these cases work although all the 'three' operands are the same and thus depend on each other? (I guess no, but I'm no expert.)
 

  Not possible, just a waste of memory:
 
  MOVE.L D0, D0 ; wasted memory
  ADD.L D0, D0 ; this line by itself will do the job

Matt Hey
USA

Posts 735
22 Mar 2012 23:24


Nixus Minimax wrote:

  I would guess that there is a deliberate design limitation. If you read from I or Dcache, you always have to consider the case that the requested data aren't in the cache. Thus, you implement the possibility of a pipeline stall. Now if you overclock, it won't really matter whether the (cached) data is available within a cycle or whether it is two. The interface is flexible and designed to interface to different memory subsystems.

 
  So the cache is handled like very fast memory and when the access time is greater than the pipeline slot will allow, a wait state is added to the memory access and the pipeline stalls? Perhaps this explains why the cache performance did not improve linearly on overclocked 68060s. I assumed (most likely incorrectly) that the cache logic would be the last to fail when the clock is increased because of it's simple logic compared to everything else in the CPU. The branch cache is quite a bit more complex logic than the cache.
 
 
Nixus Minimax wrote:
 
  The branch cache, on the other hand, is part of the pipeline logic. You won't clock the logic higher than what you find in your tests to be the highest possible clock rate. If you clock just a tiny bit too high, you would get a pipeline stall for each branch. This would hurt the processing speed much more than you could ever gain from the extra clock rate. Thus, there is no reason to make this data path between the branch cache and the pipeline flexible. You would never clock the whole thing any higher than what can be done in the clock period. I expect that the higher clockrate with disabled branch cache shows a lower overall performance. The critical path has to be somewhere. I don't think it is too surprising to find it in the branch prediction which is very important for speed. Also, you wouldn't necessarily notice critical paths in some ALU units such as MUL or DIV which may depend on the data.

 
  I would think that it would be better to add 1 more stage/level to the pipeline when the branch cache logic has trouble completing in it's slot in the pipeline causing a pipeline stall and useless bubble. Perhaps the branch cache stage could be broken into 2 stages or complete during 2 stages. Other work could be done in the extra pipeline stage like working on complex addressing modes which would reduce their penalties (like change use penalties). It's better to have a branch cache that predicts with 95% accuracy and 1 cycle/branch average than no prediction with 60% accuracy and 4 cycles/branch average even if it means extending the pipeline length by 1. The N68k branch prediction unit should be simpler than for the 68060 because of lack of virtual addresses. Maybe this would allow higher clock speeds without the branch cache being the limitation.
 
  @comp arch & Samuel
  The code fusion is more flexible and powerful than just a move.l+op.l fusion. It does allow combinations like scc+extb.l and 2 shifts. Values other than long sizes and more than 2 reads and 1 write may not be possible. It's best to read the old threads about code fusion for a better explanation. A list of code fusions in the N68k would be nice.


Rune Stensland
Norway
(MX-Board Owner)
Posts 871
23 Mar 2012 12:07


When I tested the CPU card of the Natami I managed to boot in 113 mhz with the copyback mode switched off, or the branchcache switched off. Instructioncache and burst enabled was booting at 110 mhz. When the copyback mode or the branchcache is enabled I can boot at maximum of 105 mhz.

Erik Bauer
Italy

Posts 301
23 Mar 2012 13:58


Rune Stensland wrote:

When I tested the CPU card of the Natami I managed to boot in 113 mhz with the copyback mode switched off, or the branchcache switched off. Instructioncache and burst enabled was booting at 110 mhz. When the copyback mode or the branchcache is enabled I can boot at maximum of 105 mhz.

I wonder if booting with not that much more MHZ is worth disabling the functions you named

Louis Dias
USA

Posts 217
23 Mar 2012 14:30


Erik Bauer wrote:

Rune Stensland wrote:

  When I tested the CPU card of the Natami I managed to boot in 113 mhz with the copyback mode switched off, or the branchcache switched off. Instructioncache and burst enabled was booting at 110 mhz. When the copyback mode or the branchcache is enabled I can boot at maximum of 105 mhz.
 

 
  I wonder if booting with not that much more MHZ is worth disabling the functions you named

For us end users, it won't matter.  We won't be running the '060 card ... unless you absolutely need an MMU...

Can anyone tell us the current speed of the new cpu core?


Samuel D Crow
USA
(Natami Team)
Posts 1295
23 Mar 2012 15:14


Matt Hey wrote:

@comp arch & Samuel
The code fusion is more flexible and powerful than just a move.l+op.l fusion. It does allow combinations like scc+extb.l and 2 shifts. Values other than long sizes and more than 2 reads and 1 write may not be possible. It's best to read the old threads about code fusion for a better explanation. A list of code fusions in the N68k would be nice.

You are right, Matt.  There are several instances other than 3-operand and predication opportunities for opcode fusion.

While it would be nice to have a list, it would be nicer still to know which ones are actually implemented in the core at this time as well.  Fortunately, Gunnar and Deep Sub Micron are hard at work implementing the core so we'll have to wait until they are done to have a definitive list.

The following encodings are known for sure to be fusable:

3-operand
    move.l d0, d3
    add.l xxx, d3

becomes
    d3= d0 + xxx

predication
    bne.b label
    add.l  d0, d1
label:

becomes
    addeq.l d0,d1

sign extension
    seq d0
    extb.l d0

becomes
    seq.l d0

These are the bare minimum.  There may be others.  The sign extension may use some Coldfire-style codepaths so they might not be implemented yet.

Wawa Tk
Germany

Posts 581
23 Mar 2012 15:17


Louis Dias wrote:

  For us end users, it won't matter.  We won't be running the '060 card ... unless you absolutely need an MMU...
 


especially that turning off those features means significant performance loss.

  Can anyone tell us the current speed of the new cpu core?

yeah, feed us some stuff..

Samuel D Crow
USA
(Natami Team)
Posts 1295
23 Mar 2012 18:32


Louis Dias wrote:

Can anyone tell us the current speed of the new cpu core?

Probably not until it's ready.  Keep in mind that its clock-speed may be adversely affected by some of the opcodes that are not yet implemented.  The current clock speed is the greatest common denominator of the existing opcodes.

Matt Hey
USA

Posts 735
23 Mar 2012 19:12


Samuel D Crow wrote:

  These are the bare minimum.  There may be others.  The sign extension may use some Coldfire-style codepaths so they might not be implemented yet.
 

 
  Hopefully, some of the CF code paths already exist internally. For example, mvs may be better internally than extb.l and ext.l because it's more general purpose allowing for more code reuse (less logic). The decoder could translate:
 
  extb.l Dn -> mvs.b Dn,Dn
  ext.l Dn -> mvs.w Dn,Dn
 
  Others could share code also but may become larger and I don't know if that would be a problem like ff1 -> bfffo.

Combinations (code fusions) like scc+extb.l may need the equivalent of a new instruction internally (or use a longer more generic version of scc internally). While it's better to share code/logic if possible, this is common, important to be fast to avoid branches and the result can be forwarded to another OEP (integer unit) as it turns into a long instruction.


posts 370page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19