Shader Processor v2a

For the purposes of this document a bit is a bit, a byte is 8 bits, a word is 16 bits, and a dword is 32 bits.

The shader processor has a Harvard architecture. A 32 bit instruction (d)word is used. There are eight general purpose 32 bit registers, and a single hidden 16 bit program counter register. The processor addresses ram at a granularity of 4 bytes (!not byte addressable!). The processor does not implement a branch delay slot. There is no condition code register. Exceptions are not triggered for overflow, overflow detection requires an explicit test.

The data memory of the shader processor is very small and expected to be very local to the processor. The framebuffer is also only addressable at dword granularity, is presumed located 'far' from the shader processor and can be accessed only by DMAing data into or out of the shader local memory. It is anticipated that shader programs will manually prefetch required data to hide memory latency. Shader program execution is not effected by DMA operation, except by inadvertent (silent) memory collisions.

Due to the FPGA target, instruction selection was not deemed critical. At the inception of the processor design it was expected that shader programs would be hand written in assembly, so instructions that simplified hand assembler coding like bit sets/clears and branches on bit states were added. Instructions were selected while writing the first programs, what became the floating point support library. Eight registers were settled upon when developing these first programs were complicated by the initial choice of four GP registers.

Hardware floating point support was omitted in this implementation for several reasons:

v3 / Future Enhancements

Critical: add instructions and dedicated hardware to support synchronized operation between shader processors. Explore vector opps and registers, i.e. add some SIMD instructions. Maybe add fp. Decisions relating to how best to integrate the DSP elements within Xilinx series-7 FPGAs are expected to heavily effect instruction selection and overall architecture in future revisions.

Shader program length, working set size, and mix of available logic resources on future FPGA targets will determine if a new memory scheme is needed. One possible alternative is for instruction and data rams to be switchable between the current internal only mode and a conventional cache mode. I think this could be implemented by small tweaks to the DMA unit(s).

A far future design question is if it would be beneficial to be able to dynamically partition a sub-array of shader cores to create small pools of coherence for certain tasks.

Instruction Encoding

Upper 7 bits are instruction opp (O), next 3 bits are source register a (A), next 3 bits are source register b (B), next three bits are destination register (D). The final 16 bits of the instruction dword is the immediate word (I).

so grouped by [] into 8 bits:

(31) [OOOOOOOA] [AABBBDDD] [IIIIIIII] [IIIIIIII] (0)

If the immediate word is used as an address, this allows 65535*4 bytes of native address space (256KB) to both the data and program memories. Given many of these processors are supposed to be tiled onto a single FPGA this address space seems large enough.

Native Instructions

Note: In the table below, %A, %B, %D, and %R refer to any GP register. The register letters in this section refer to the instruction encoding documented in the previous section.

Mnemonic

Desc

Enc


NOP

No operation

0x00

Lower bits ignored.

LUI %R IMM

Load Upper from Immediate

0x01

Sets highest 16 bits of register to be Immediate

assembler encodes %A and %D as specified %R

LLI %R IMM

Load Lower from Immediate

0x02

Sets lowest 16 bits of register to be Immediate

assembler encodes %A and %D as specified %R

ADD %A %B %D

Add %A to %B and store in %D

0x03


SUB %A %B %D

Subtract %B from %A and store in %D

0x04

Inputs and outputs are signed.

ADDL %A %D IMM

Add SIGN EXTENDED IMM to %A and store in %D

0x05


AND %A %B %D

Logical AND %A with %B and store in %D

0x06


OR %A %B %D

Logical OR %A with %B and store in %D

0x07


XOR %A %B %D

Logical XOR %A with %B and store in %D

0x08


NOT %A %D

Logical NOT %A and store in %D

0x09


BSET %A %D IMM

Set bit IMM from %A store in %D

0x0a

IMM cannot be larger than 31

BCLR %A %D IMM

Clear bit IMM from %A store in %D

0x0b

IMM cannot be larger than 31

RSL %A %D IMM

Left shift %A by IMM bits and store in %D

0x0c

IMM cannot be larger than 31, must be positive

a logical shift is performed only

RSR %A %D IMM

Right shift %A by IMM bits and store in %D

0x0d

IMM cannot be larger than 31, must be positive

a logical shift is performed only

MUL %A %B %D

Multiply %A by %B and store result in %D

0x10

Only implement a 16 bit multiply? Right now-> yes upper src bits are ignored.

CMP %A %B %D

Compare %A to %B put results in %D

0x30

Bit0 equal,

bit1 R1 > R2 unsigned; bit2 R1 > R2 signed

bit3 R1 < R2 unsigned; bit4 R1 < R2 signed

SRI %A IMM

Store Register %A to Immediate

0x40

Store register to address held in the Immediate word

SRR %A %B IMM

Store Register %A to Address generated by %B + sign extended IMM

0x41

Store register %A to value held at address held in register %B + IMM

LRI %D IMM

Load %D from Address in Immediate

0x42

Load %D from address held in Immediate word

LRR %B %D IMM

Load %D from Address in %B with sign extended C added to %A

0x43

Not encoded as %A to save an adder in load-store.vhd

SEQZ %A

Skip next instruction if %A is zero

0x50


SNEQZ %A

Skip next instruction if %A is not zero

0x51


SBSET %A IMM

Skip next instruction if bit IMM of %A is set

0x52

C cannot be larger than 31

SBCLR %A IMM

Skip next instruction if bit IMM of %A is clear

0x53

C cannot be larger than 31

JI IMM

Jump to address in Immediate

0x60


JR %A

Jump to address in register %A

0x61

Perhaps add with sign extended IMM added to address in %A, but this seems too complex for now.

HLT

Halt

0x70


ABI

Register 5 (%F) is used as a stack frame pointer (by convention only, see below).

Register 6 (%G) is an assembler temporary. As it is only used during the call and return assembler macros, it can be normally used outside of these calls, provided the programer knows its value will not be preserved through call or returns.

Register 7 (%H) is treated as the stack pointer. The stack grows down. The stack pointer points to the next open slot. In other words in writing to the stack the write occurs before the SP is decremented.

Assembler Macros

LI %R c

does a LUI and a LLI of the constant into register R

PUSH %R

Pushes register A's contents to stack and decrements stack pointer by 1 (dword addressable)

PULL %R

Pulls value from stack, inserts value into register R and increments stack pointer by 1

CALL <addr>

Computes the proper return address and puts in %G (AT) then jumps to addr. See Notes Section.

FNSETUP

The second half of a function call. FNSETUP pushes all register contents to stack (except %H). Register %G (The CALL calculated return address) is saved first at stack offset '0'. It then updates the stack pointer considering all values added. Independent PUSHs are not preformed. Expected to be called as the first 'instruction' of a function. See Notes Section.

RETURN

restores (PULLs) all register contents from stack (except registers G, and H).

PULLs return address from stack

jumps to that address

Compiler Function Calls

The assembler Macro Call pushes computes the desired return address and leaves it in the %G (AT) register, then branches to the call address. The callee is expected to save processor register state on the stack via FNSETUP, using 7 stack locations. The Compiler know this and writes function arguments into the space below this -7 offset so that after the CALL, FNSETUP pair the new functions frame pointer is setup pointing to these blocks

Procedure for a Function call:

The Caller:

considers space required for all registers to be saved. (space on stack to be used by CALL, FNSETUP macros)

considers space required for return value (1 more gap below CALL, FNSETUP modified stack pointer)

writes function parameters into space below the SP value calculated above.

performs the CALL (expected behavior – nothing tricky here).

Final instruction of CALL macro jumps to function address

The Callee:

Calls FNSETUP to save caller's register state. (expected behavior)

The first instructions of a function are compiler generated (not user code).

The stack pointer is copied into the %F register, forming the frame pointer for all the function’s variables. The compiler uses this frame pointer to access the return variable value, all named local local variables, and compiler temporaries.

The stack pointer is decremented by the space required by all of the above, so the next PUSH does not stomp on anything.

Upon a compiled code return <value>

The <value> is copied to the reserved space in the stack frame.

Restores the stack pointer from from our 'frame' pointer

issues the RETURN assembler macro

DMA Unit (software Interface)

The shader core has an integrated 4 entry scatter gather DMA unit that allows the shader to transfer data between the shader memory and the framebuffer. The DMA unit operates independently of the execution of the shader core so the shader can save results and/or load new raw data while operating on some previously fetched data. This is the mechanism by which the shader core can hide memory latencies.

The DMA unit is interfaced by writing DMA instructions into special addresses at the end of the data memory address space. These memory locations allow the shader programs to activate or check the status of the DMA unit.

The DMA unit has slots for doing up to four DMA operations. Each operation is programmed by two dword values. The first dword is the DMA_CMD and the second dword is the DMA_FB_ADDR (the framebuffer address). The lowest word of the DMA_CMD dword is the 16bit address for the DMA in the shader processor.'s data ram. The lowest 12 bits of the upper word in the DMA_CMD is the number of transfers. The highest bit in the DMA_CMD is the direction. If the highest bit is set, we have a transfer into the shader, if it is not set transfer out of the shader.

DMA commands are executed in order. The DMA is triggered by writing a count value for the number of transfers to perform (i.e. 1,2,3, or 4) to one address past the end of the 8 configuration registers. Reading from this same location returns a non-zero value if the DMA unit is busy, or a zero if the DMA unit is idle.

DMA_CMD_X

Lower 16 bits are transfer address (A). High bit (D) is direction bit (set 1 for reading into shader ram from framebuffer). Bits ( C) are for the number of dwords transfered. bits (X) are not examined. So:

[D][XXX][CCCCCCCCCCCC][AAAAAAAAAAAAAAAA]

DMA_FB_ADDR_X

Starting address in framebuffer

RTL Implementation

See Illustration 1 for a diagram. The shader processor v2 has four stages:

Illustration 1: Pipeline Diagram v2a


Data Address Space

As stated above, the instruction and address spaces are 65536 dwords long. The implemented data space in each ram is expected to be significantly less than this value. The 256 addresses 0xFF00 – 0xFFFF in the data address space are considered to be external to the shader core. The shader's load_store unit traps reads and writes to this address range and performs reads/writes to a processor local bus that has 8 bit addresses, and 32 bit data. It is though this local data bus that the DMA unit is activated by the shader core. This local data bus is very similar to the TVC's top level command bus. It is expected that specialized functional units will be designed and implemented first on the command bus, then moved from the top-level command bus to these processor local data buses. It is through this architectural feature that accelerated processing units will be added to the shader cores.

Notes

The buffer stage as added pre-release to double clock speed from 40-80MHz. (Spartan IIIe)

Pre-release versions of the assembler combined FNSETUP into CALL, making the caller responsible for saving its state. Because there are normally many more function calls than function definitions in a program; it makes sense to keep most of the overhead of a function call in the calle, shrinking the emitted code size.

The first optimization was to have call only compute the return address and push it onto the stack, then jump to the callee which would first call FNSTUP, thereby pushing the rest of the registers to the stack. The cost of this optimization is a single instruction; the stack pointer is modified once in CALL and once in FNSETUP.

The next optimization is to just have the CALL just compute the return address and put it on the top of the stack, but not modify the stack pointer in CALL, rather the matching FNSETUP assumes the return address is already there, but the stack pointer not modified. This adds no cost and reduces the binary size.

The final optimization is just to keep the return address in %G (Assembler temp) for CALL, a very conventional design.

These machinations were done to squeeze more functionality into the fixed size instruction ram.