Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Do you have ideas and feature wishes? Post them here and discuss your ideas.

N68k Enhancements Revisitedpage  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 
Wojtek P
Poland

Posts 1597
14 Apr 2011 09:24


Matt Hey wrote:

Wojtek P wrote:

  As usual - attacks from your side, then i would be accused for attacks.
 
  Not the first time. YOU don't read what i write.

 
  Wojtek, please can you be more tactful?

Exactly as i say. I am attacked and then - attacked again.
Does this have to repeat once again.
No i do not agree to be treated as trash as you do.
Why do you feel as "higher entities" to treat me that way?
Who gave you such rights.


Matt Hey
USA

Posts 734
14 Apr 2011 14:39


S P wrote:

Matt Hey wrote:

    After the SMC initialization, cache clearing and cache reloading, much of the gain of SMC will have gone away.
   

   
    This might be true for some problems. When I optimized the loop by storing values in the code that would execute 26 clocks later I introduced a problem I would like to be solved.. Jens proposed a solution for this. This is good. The hardware developer is able to read my code and solve my problem. Then we end up with a faster chip..

There are places where SMC looks like the simple solution but it's not as simple as it appears. I don't ever use it. Flushing the appropriate caches after SMC is supported of course. I also have no problem with methods that make this more efficient if it can be done in a way that will be compatible in the future.

S P wrote:

    In my next proposal I put the constant matrix inside the code in advance. This is very clever because it frees registers, It enables the instruction cache to fetch matrix entries in paralell with the datacache. By using this teqnique a matrix transformation will be almost as fast as copying memory. (Removing the need for an external VPU unit)

I believe the same advantage can be achieved with more registers, another EA, code fusion, and possibly another fpu ALU. This would speed up the chip for all processing without the need and overhead of using SMC.

S P wrote:

    Now I propose another optimalization:
   
    In many matrixes you have 0 values and 1 values. With my SMC generated innerloops I can remove instructions that multiply with zero. I can remove instructions that multiply with 1.

Multiply is not a bottleneck. Multiplication is faster than a branch and removing multiplications requires a branch. There is some benefit in removing multiplications in advance in some cases but it is likely more overall work.

S P wrote:

    Do you understand the difference?
    If I generate 10 innerloops with the size 200 bytes for each frame.
    I need to flush 2000 bytes.. This doesn't take long time.
   
    With 100 000 cordinates in the 3d scene  I save 100 000 * 10 * 32 cycles.
 
  If the natami has a cachesnoper that does this job for me in paralell, then I don't need the cacheclear..
 
  The logic I want is simple.. If the the datacache is trying to write to instructioncache then remove the instruction cache line...

Cache snooping would be nice for backward compatibility and safety as well. However, I don't like the idea of guaranteeing and relying on it to for the future. The caches are complicated enough already and messing around near the code being executed is tricky and could change with a different CPU.


Matt Hey
USA

Posts 734
14 Apr 2011 14:53


Wojtek P wrote:

  Exactly as i say. I am attacked and then - attacked again.
  Does this have to repeat once again.
  No i do not agree to be treated as trash as you do.
  Why do you feel as "higher entities" to treat me that way?
  Who gave you such rights.

No one is attacking you or singling you out. Everyone here has the same goal so we are all on the same side. We just have different opinions of how to accomplish the goal. Please focus on how you want to see this goal accomplished.


Rune Stensland
Norway
(MX-Board Owner)
Posts 871
14 Apr 2011 14:56


Matt Hey wrote:

  I believe the same advantage can be achieved with more registers,another EA, code fusion, and possibly another fpu ALU. This would speed up the chip for all processing without the need and overhead of using SMC.
 

   
  Adding another EA unit will allow 2 datacachereads per clock. Adding another FPU/integer pipe will allow 2 instruction cachereads per clock.
   
    With my trick the N070 can perform 4 memory reads in one clock.
    2 from the inst. cache. 2 from the datacache.
 
 
    move.l (a0)+,d0  ;1 P1
    muls.l #xxx,d0:d0 ;1 P1
    move.l (a1)+,d1  ;1 P2
    muls.l #xxx,d1:d1 ;1 P1
 

   
    Your proposal is good. It will make my code run 2 times faster.
   
 
Matt Hey wrote:

    Multiply is not a bottleneck. Multiplication is faster than a branch and removing multiplications requires a branch. There is some benefit in removing multiplications in advance in some cases but it is likely more overall work.
 

   
    You don't get it...
   
    I don't only remove the multiplication. I remove all instructions that are not needed. (move,muls,add) This is because my program compile innerloops at runtime for optimal performance.
   
    Instead of generating instructions that do nothing useful like this:
   
 

    move.l (a0)+,d0  ;1 P1
    muls.l #0,d0:d0 ;1 P1
    move.l (a1)+,d1  ;1 P2
    muls.l #0,d1:d1 ;1 P1
    ..
    add.l d0,d1 ;2
 

   
    They are simply removed from my innerloop.
   
 
Matt Hey wrote:

  Cache snooping would be nice for backward compatibility and safety as well. However, I don't like the idea of guaranteeing and relying on it to for the future. The caches are complicated enough already and messing around near the code being executed is tricky and could change with a different CPU.
 

   
  Gunnar solved this for me. I patch the os with an empty cacheclear on the Natami. (N050) When new CPU's arrive they can patch it back..

Wojtek P
Poland

Posts 1597
14 Apr 2011 21:04


Matt Hey wrote:

Wojtek P wrote:

  Exactly as i say. I am attacked and then - attacked again.
  Does this have to repeat once again.
  No i do not agree to be treated as trash as you do.
  Why do you feel as "higher entities" to treat me that way?
  Who gave you such rights.
 

 
  No one is attacking you or singling you out. Everyone here has the same goal so we are all on the same side.

yes this is true. The problem is half-reading.



Gunnar von Boehn
Germany
(Moderator)
Posts 5775
17 Apr 2011 17:49


Hi SP,

What do you think how good can we get without changing the 68k ABI and without using selfmodifying code?

S P wrote:

   

    fmove.d (a0)+,fp0 ;  fused (4 bytes)
    fmul.d  #A11,fp0 ;1  (12 bytes)
    fmove.d (a0)+,fp1 ;  fused 
    fmul.d  #A21,fp1 ;2
   

The 68K FPU can also use Data-Registers as source.
The typical Matrix Mul will operate on SINGLE size float data.
We can hold 8 single in the data registers.

How much would this in your opinion help us to tune the code without using SMC?

Cheers

Marcel Verdaasdonk
Netherlands

Posts 3976
17 Apr 2011 18:03


wait don't get me wrong but doesn't most games make use of doubles instead of single precision math?

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
17 Apr 2011 18:15


Marcel Verdaasdonk wrote:

wait don't get me wrong but doesn't most games make use of doubles instead of single precision math?

Well why should they?
Single precision is more than enough for games.

And that certain architectures like Cell == PS3,
are quite slow in doubles indicates that games won't use them.

Megol .

Posts 676
17 Apr 2011 20:31


Marcel Verdaasdonk wrote:

wait don't get me wrong but doesn't most games make use of doubles instead of single precision math?

Amiga games? They don't use floating point.
PC/console games? Single precision.

Wojtek P
Poland

Posts 1597
17 Apr 2011 21:23


Marcel Verdaasdonk wrote:

wait don't get me wrong but doesn't most games make use of doubles instead of single precision math?

if they actually do it only shows how dumb programmers are.



Matt Hey
USA

Posts 734
17 Apr 2011 21:33


Gunnar von Boehn wrote:

  The 68K FPU can also use Data-Registers as source.
  The typical Matrix Mul will operate on SINGLE size float data.
  We can hold 8 single in the data registers.

 
  If fp.s to fp.x doesn't cost any cycles (2 cycles on 68060), then using data registers for constant single precision floating point values is a great idea. Is it possible to make this no cycle cost? This might eliminate the need for adding more fp registers. It would be nice to add support in assemblers for specifying fp values in the integer unit. Something like...
 
    move.l #2.5.s,d0 ;not too easy to read :(
    move.l #2.5fp,d0 ;better?
    move.s #2.5,d0  ;another option
    move.l #$40200000,d0 ;hex fp notation works

  then these would be the same cycles?...
 
    fmul.s d0,fp0
    fmul.s fp1,fp0
 
  Loading a long fp.s value in a data register would be 6 bytes compared to loading a fp.s value in a floating point register being 8 bytes. That wouldn't be so bad as integer registers are easier to save and restore but efficient fpu code would look much different. If 1/2 IEEE fp was supported, a fp.h value could be loaded as a word in a data register taking 4 bytes compared to 6 in a fp register. I would still like to see the 1/2 IEEE fp conversion only (no calculation). Not only does it reduce code size but it's THE best format for a Z buffer. Integer Z buffers are not linear and values outside of the min and max rap around causing very undesirable affects. The min and max can be checked but then it's faster to use floating point. The Z buffer is read and written manually by the CPU for certain effects by the way. A 24 bit integer Z buffer doesn't work so well in the 68k EA just like 24 bit chunky and 16 bit fp.h saves memory too.
 
 
Gunnar von Boehn wrote:

  How much would this in your opinion help us to tune the code without using SMC?

 
  Yes, the fpu needs to be tuned without SMC because it will be used from shared drivers (libraries) and compilers may have trouble generating optimal code for SMC. I like your thinking, let's see what others think.


Matt Hey
USA

Posts 734
24 Apr 2011 16:37


I've been thinking about a little thread over on EAB...
 
  EXTERNAL LINK 
  I had previously requested bchg, bclr and bset be made conditional with unused bits in the 32 bit forms when I was talking about adding 3 op instructions. These would be valuable even without 3 op and it seems I am not the only one that would use them. Adding them is almost free, reduces code size and would reduce entries in the branch cache when we get one. S P is working hard on adding support for instructions in Asm Pro and it would be nice if efforts like this only needed to be done once. Some clarification on what changes are planned for the N68k would be nice at this point.
 
  From the same thread, Kalms pointed out another possible common fusion with SCC...
 
 
Kalms wrote:

      scs    d0
      and.l    #2,d0
      add.l    d0,a0

 
  It would be nice if the SCC and AND.L with immediate only having bits in the lower byte could be fused so the 32 bit result could be forwarded. This should be possible as the upper 24 bits are zeroed in this case. SCC and EXTB.L should be able to be fused so the result can be forwarded in a similar way I would imagine?
 

Thomas Richter
Germany
(MX-Board Owner)
Posts 1425
24 Apr 2011 17:47


Gunnar von Boehn wrote:

  The 68K FPU can also use Data-Registers as source.
  The typical Matrix Mul will operate on SINGLE size float data.
  We can hold 8 single in the data registers.

This seems to be the best possible solution, and it doesn't even require a change or a patch, nor does it have to store more registers on a task switch, and it gives eight additional registers. (Or, actually, eight additional registers are already available for single precision).

Single precision should be absolutely sufficient for most game engines, I agree. Of course, it requires better compilers than the ones we have now. Due to type promotion in C, computations usually give double results unless you are careful. For example "x * 0.3" is a double even if x is a float, instead you need to write "x * 0.3f" for this, and I believe the compiler is not even allowed to replace the former by the latter, even if the result is assigned back to a float because the result can be different due to different roundings (0.3 is not exactly representable as floating point number, thus 0.3 != 0.3f). Thus, programmers must be careful.

Greetings,
Thomas


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
24 Apr 2011 18:24


Thomas Richter wrote:

Gunnar von Boehn wrote:

  The 68K FPU can also use Data-Registers as source.
  The typical Matrix Mul will operate on SINGLE size float data.
  We can hold 8 single in the data registers.
 

  This seems to be the best possible solution, and it doesn't even require a change or a patch, nor does it have to store more registers on a task switch, and it gives eight additional registers. (Or, actually, eight additional registers are already available for single precision).

Actually we can pass both DATA and ADDRESS registers to the FPU.
This can gives us up to 15 extra single registers...
But code using the ADDRESS registers as FPU registers will not work on legacy 68k classics CPUs.

Matt Hey
USA

Posts 734
18 Sep 2011 17:31


This is an old thread! I dug it up because I have a related idea to suggestions starting about page 5 of this thread.
 
 
Megol . wrote:

    A equivalent to the x86 TEST instruction would be nice, it takes two arguments and changes the flags as a logical AND but doesn't change anything else. The implementation would be the same as AND but with the data write suppressed.
 

 
    We talked about the 68k BTST Dn,#<byte> which we found lacking as it only does byte size immediate constant masks and is not easy to use for me at least. We also talked about a 68k 3 op instruction set with a NIL destination which is pretty neat but I'm guessing we won't get 3 op for the N68050 at least. The benefit of the x86 TEST is that it doesn't destroy a register. We often reuse the same variable and only need the condition codes generated from an instruction. I was reading over part of this thread again and I don't know why we didn't think of using immediate values in a similar way to BTST Dn,#<byte>. It looks like the immediate encoding is free for the destination in most Dn,EA instructions. This would allow...
 
    and.size Dn,#<data.size>
    or.size Dn,#<data.size>
    eor.size Dn,#<data.size>
 
    add.size Dn,#<data.size>
    sub.size Dn,#<data.size>
 
  The destination is an immediate so the condition flags are updated but no data registers are written. The encoding and idea fit well with the current 2 op 68k. The compression of smaller longs into the upper half of a byte size immediate data is also possible. For example...
 
    and.l d0,#3 ;encoding is 2 words, CC flags updated, d0 preserved
 
  Here's the way I see the advantages vs disadvantages...
 
  + encoding is a good fit and compact
  + preserves a register
  + should be easy to use including for compilers
  - limited to data registers and immediate "constant" values
  - takes some encoding space
 
  Does anyone see any problems with this? Do you think this is worthwhile?
 

Phil "meynaf" G.
France
(Natami Team)
Posts 393
19 Sep 2011 08:19


I admit i like the idea of using destination immediates like that.
But i don't think it's really worth if it costs a significant amount of logical gates.


Matt Hey
USA

Posts 734
19 Sep 2011 14:28


@meynaf
    I believe the LE count would be very low as it's just enabling what already exists. The operation of the instruct is the same and btst Dn,EA already has this behavior. The encoding space is a bigger concern. Are the restrictions of constant immediate values and data registers only worth the encoding space?
 
  Example of use...
 
  if (!(x & 3) && (x != 16))
 
    This currently requires destroying and restoring the x variable in a register...
 
    moveq #3,d1
    and.l d0,d1
    bne .failed_if ;or scc
    cmp.l #16,d0
    bne .failed_if ;or scc
 
  ;68k=7 words, N68k=6 words, 2 registers used
 
    Add 2 words and 2 cycles if a register is not available. In that case, it's more efficient to keep the variable x in memory and do the operation to a register...
 
    moveq #3,d0
    and.l (var_x,sp),d0
    bne .failed_if ;or scc
    cmp.l #16,(var_x,sp)
    bne .failed_if ;or scc
 
  ;68k=9 words, N68k=8 words, 1 register used
 
    Allowing immediate destinations allows a single register variable for x with no restoring...
 
    and.l d0,#3
    bne .failed_if ;or scc
    cmp.l #16,d0
    bne .failed_if ;or scc
 
  ;68k=no, N68k=6 words, 1 register used
 
  The speed is similar in all versions. I think the last version is most readable and simplest. It might even make a compiler writers job easier ;).


Megol .

Posts 676
19 Sep 2011 16:51


I like it. However it would be more useful if the OR operation (which wouldn't be useful in this form) was replaced with an AND with some changes in the result:
Z=normal AND behaviour
X=((imm AND src)==imm)

Then we could check three conditions in one bitmask operation, if none of the bits in imm are set in the src (Z true), if at least one of the bits in the imm are set in src (Z false, X false) or if all the imm bits are set in the src (Z false, X true).

Matt Hey
USA

Posts 734
20 Sep 2011 02:54


@Megol
  I see your point about some variations of immediate destinations being of limited use. OR would be useless with only the zero and negative CC flags set as defined for the OR instruction. SUB would also be of limited use because CMP also does a subtract and throws away the result. Changing the CC flags of logic operations could make OR useful but also starts to take LEs I imagine. A CC flag (V or C+X flags?) that represents -1 (all bits set) might do it and would be useful for AND also. The X bit by itself is a little bit difficult to use as it's not used in branches. There are definitely disadvantages to immediate destinations that I did not see although the AND is very powerful and would be worth enabling by itself in my opinion. It's great for testing bit masks as is.


Phil "meynaf" G.
France
(Natami Team)
Posts 393
22 Sep 2011 09:03


Matt Hey wrote:

@meynaf
    I believe the LE count would be very low as it's just enabling what already exists. The operation of the instruct is the same and btst Dn,EA already has this behavior. The encoding space is a bigger concern. Are the restrictions of constant immediate values and data registers only worth the encoding space?

Encoding space is there already. As you said, it's just about allowing something that's already there.

However, i do not know if the internals can easily permit using #n as destination. BTST does that but it doesn't write data ; allowing immediates for writes may well pose a few problems.

Now, apart for AND, are there examples where it could directly work and be useful ?
OR is indeed not of much use.


posts 435page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22