Matt Hey wrote:
| Is there any preliminary timing info or optimization tips for N050 instructions or addressing modes yet?
|
Maybe the following could be filled out...
PROC CACHE RAdd MAdd Mul Index Bcc 68000 0/0 6 18 40 18 10/6 68020 256/0 2 6 28 9 6/4 68030 256/256 2 5 28 8 6/4 CPU32 0/0 2 9 26 12 8/4 68040 4K/4K 1 1 16 3 2/3 68060 8K/8K 1 1 2 1 0/1 N050 variable* 1 1 1 1 (0/0)*
1) The cache size of the 050 is variable and can be defined by compile time. Depending on your FPGA space you can have e.g 64/64KB. We aim for a cache of 32/32 or 64/64 in the NATAMI. 2) The 68050 does support some sort of branch acceleration. But this is not 100% finished yet. Currently some branches can be folded away. We are in the works of adding branch merging which means that short branches will be merged with their next instruction so that the next instruction becomes conditional. The advantage of this technique is that by doing this the branch will ALWAYS be predicted correctly. As of today with the halve finished branch acceleration some branches could also still take 5 cycles. 3) We are going to add a LINK-Stack which means that subroutine call (RTS instructoin) will be fast on the 68050. RAdd: Register to register 32 bit add (add.l d0,d1). MAdd: Absolute long address to register add (add.l _mem,d1). Mul: 16x16 multiplication (max. time) (mulu.w d0,d1). Index: Indexed addressing mode (move.l 2(a0,d0),d1). Bcc: Byte conditional branch taken/not taken (bne.b label) And maybe add N050 answers for these questions... Operations with long immediate values between -128 and 127: A: add.l #20,d1 B: moveq #20,d0 add.l d0,d1 68040/xx: A 68000/20/60: B N68050: A The 050 can load 16 byte per clock from the ICache. This means the length of an instruction does not slow the CPU down. Byte/word operations that could be replaced with long operations: A: add.w d0,d1 B: add.l d0,d2 68000/xx: A 68020/40: Any 68060: B N68050: Any Byte/Word/Long operations always take 1 clock on the 050. Keep memory operands in registers: A: add.l _var,d1 B: move.l _var,d0 add.l _var,d2 add.l d0,d1 add.l d0,d2 68040: A (as long as total # of instructions are less) 68000/20/60/xx: B N68050: A The 050 can do a memory read per clock therefore A is faster. Reschedule operations using address registers: A: add.l d0,d1 B: move.l (a1),a0 move.l (a1),a0 add.l d0,d1 move.l (a0),d2 move.l (a0),d2 68000/20: Any 68040/60/xx: B N68050: B The 050 has like the 040 and like the 060 a load/usage delay of address registers. Using and updating an adrr-register like this takes no penalty. 1) (A0)+,Dn 2) (A0)+,Dn But if you do a memory load to a register and then use it then there is a bubble between both instructions. Just like on the 040 and on the 060. Replace constant multiplications with adds/subs/shifts: A: mulu.w #254,d1 B: move.l d1,d0 lsl.l #8,d1 lsl.l #1,d0 sub.l d0,d1 68060: A 68000/20/40/xx: B N68050: A Mul is fast (1 clock) on the 050. Operations using indexing modes: A: add.l (a0,d7),d1 B: add.l d7,a0 add.l (a0,d7),d2 add.l (a0),d1 add.l (a0),d2 68000/60: A 68020/40/xx: B N68050: A Index adressing mode is free. Option A takes 2 clocks, Option B takes 3 clocks. Saving/restoring registers: A: movem.l d4-d7,-(a7) B: move.l d7,-(a7) move.l d6,-(a7) move.l d5,-(a7) move.l d4,-(a7) 68000/20/60/xx: A 68040: B (if time critical) N68050: A MOVEM takes 1 clock per loaded/stored register. Any tips like these... 68020: Use short instructions Keep values in registers Almost no scheduling necessary Code optimized for the 68060 runs great 68040: Use as few instructions as possible (even if they are longer) Values can be kept in memory Avoid pipe-line stalls for some effective addresses Avoid subtracts to address registers 68060: Use short instructions Keep values in registers Schedule instructions for superscalar execution Inline short functions N050: Almost no scheduling necessary Use as few instructions as possible (even if they are longer) Values can be kept in memory Avoid memory indirect addressing modes. I expect it will be smart to schedule instructions for superscaler execution for future N070 compatibility. Anything else to watch out for or shy away from for future compatibility? Yes, Jens is currently reworking the Cache to prepare for Superscalarity. There are a few things which are unique to the N68K line. 1) Like the 040 the length of an instruction does not matter. This means even LONG instructions will only take 1 cycle. 2) The 050 is internally a 3 Operant machine. Because of this the 050 can sometimes combine two 68K instructions into 1. This feature is currently 50% finished in the core. We need to add a hint to the Icache to finish this. When its fully finished the following will happen: Example: move.l D0,D1 add.l (A0),D1 The 050 will in the future do both instructions together in 1 clock. What the 050 internally does is ADD.l (a0)+D0,D1 The 070 is planned to be able do this twice per clock then. This means do enable support for this future please leave such depending instructions together. Do not stuff another instruction between them. 3) Branch converting. The 050 will convert short conditional branches and instructions to conditional instruictons. Example: bne .dontadd add.l (A0),D1 .dontadd Will be converted to: addeq.l (A0),D1 This combined instruction will only take 1 clock. This means in theory the best throughout that the 050 could reach will be:
bne .dontadd move.l D0,D1 add.l (A0),D1 .dontadd bra somewhere else
The 050 is designed to do all the above 4 instructions together in 1 cycle.The 050 can do an unconditional brach for free. *working The 050 can merge to 2 instructions. *needs hint tag in cache. The 050 can rewrite BCC to conditional instructions. *needs hints tag in cache. I believe the 050 is very easy to program for. All instructions are fast and all addressing modes are very fast. You can use long instructions as you like. This means here is no need to convert complex instruction into several instructions like people did on the older CPU to speed them up. What the 050 does NOT like are memory indirect addressing modes. But I believe that this is not a disadvantage as memory indirect addressing modes ALWAYS were very slow and were very rarely used. Does this answer all of your questions? Or do you have more questions?
|