TVC R4 design document

The Test Video Card (tvc) as of now consists of 16 VHDL modules. The organization of these modules may be seen in Illustration 1. (The Poly Scanline has Block SelectRAM+ caches internally that are not shown in Illustration 1, also synchronizers throughout the design are not shown) . The modules are of two main types, those that implement drawing commands, and those that implement the overall structure of the TVC. These structural VHDL modules consist of the TVC top level module, a Memory Control Unit (MCU), a module called pixel_pusher that controls the monitor sync signals and outputs pixel data, and the EPP control unit.

Illustration 1: TVC Architecture

The most important structural module is the MCU. The MCU provides and controls a memory user bus that runs throughout the TVC to all units that need to access or modify the frame buffer contents. The MCU through its operation establishes the timing for the entire TVC.

The drawing functional units (FU) consist of four separate modules. The first is a Block Set unit that writes fixed pixel values to consecutive bus aligned memory addresses. Next are a block reader and block writer that allow reading and writing 32 bit bus aligned memory locations. The poly scanline unit draws a single pixel wide z-buffered triangle scanline (with optional texture mapping) to the frame buffer.

The Memory User Bus

The Memory Users (mu) are driven by the Memory Controller Unit (mcu). The mcu when idle checks to see if any mus have raised an interrupt. If a scanline cache mu has raised an interrupt, that interrupt is serviced ahead of any other mu. This is done to (help) guarantee that pixel values will be available to the pixel pusher when it needs them to output to the display. Memory Users other than the scanline caches are serviced in round-robin order. A mu may have to wait an arbitrarily long time before being serviced by the mcu. After selecting a given mu, the mcu will completely satisfy the request of that mu before it moves on to another mu. This means that the maximum memory request length cannot allow starvation of the scanline caches.

After the mcu selects a mu, the mcu reads several special mu addresses that contain the actions that the mu wants the mcu to perform. The mcu fully reads these addresses as needed before starting processing the request. The mcu will be capable of several operations.

Unfortunately the mcu is currently not capable of read/modify/write pixel operations.

Bus Details

Remember that the mcu is completely in control of the mu bus. Mu's are fully reactionary to values driven on the bus by the mcu. Because all mu's see the same mu_bus, they only react to it when they are selected in response to an interrupt they raised.

When this is high this means the mu should read or write its internal selected address to the mu_bus_data lines.

Data width is 32 bits. Read/write by the mcu and mu. Only full dwords are written or read at a time

Address bus width 12 bits. These are considered to be the addresses of 32 bit registers, not individual bytes. This means that a total of 4K dwords or 16K bytes can be addressed by the mcu in a given mu.

The mcu raises this line when it wants the mu to store the value currently on the mu_bus_data lines to its internal dword 'register' at the mus internal address currently on the mu_bus_address lines. When this is low, the mcu wants the mu to put the value in the selected address on the mu_bus_data lines.

In addition to the bus lines mentioned above, each mu has two lines connecting directly to the mcu.

The mu raises this line when it wants the mcu to perform an operation on its behalf

The mcu raises this line in response to the irq line raised by the mu when it is starting to service its request. The mcu lowers this line after it is finished servicing the mu's reqest.

MCU Request Specification

As stated above the mu must define and communicate a behavior that it wants the mcu to do on its behalf. This is done by treating the last four (?should it be eight?) addressable dwords in each mu as special values. The mcu, just after selecting a mu will read at least the first two of these addresses in order, and may optionally read more depending on the values held within the first dword. After reading the mu request specification, the mcu will then read/write from/to the mu the requested number of dword values in order, starting from mu address 0. After the mcu completes this operation on behalf of the mu, the mcu deselects the mu by putting its select line to zero.

Right now the only two commands are 0x80 (write) and 0x00 (read) so only bit 31 of the command is used.

Bits 31 to 0 hold the 32 bit framebuffer starting address that will be written/read/modified from.

Possible design changes

Should there be some signal from the MCU to the MU when it is done reading the request registers? One implementation would be dropping the strobe the first time (with the final MU request address still on the mu_bus_address) and waiting a few clock cycles that will allow mus to switch from a 'mcu configuration mode' to 'data transfer mode'. None of the current mus need this, but it might simplify the design of more complex Memory Users.

Implementation of the poly-scanline FU demonstrated a weakness in the MU bus design. In order to read values from its internal cache to write to the MCU, the address to write needs to be known a cycle or two in advance. One solution to this is for the mcu to notify the mu that the for the desired transfer is about to begin, and then strobe out data lines with a clock signal. Any counters that are needed would have to be replicated in the various Memory Users. This seems to be a more modern data transfer model that what currently exists. It also more reflects the data transfer model of the sdram that the framebuffer will eventually utilize. To quickly hack around this problem I have the MCU use the data strobe line as a clock signal for the poly-scanline.

Additional Complexities

It is expected that functional units will be implemented that will need to have multiple mu bus connections. Although it is possible to implement all functional units with only a single connection to the mcu, it may be simpler to implement a functional unit with multiple connections. For example the GL functional unit could have a MU that contains a row of framebuffer pixels, a MU that contains a row of depth-buffer pixels, etc. (This is not how the current Poly Scanline MU is implemented). The MCU will not know anything about these details. The pix_pusher is almost an example of this in that it has two scanline caches (an even and odd scanlines), each with its own independent connection to the mu bus, however the MCU knows some details of this because it must give the scanline caches' irqs priority over other irqs.

The MCU can drop the MU's select line at any time. Whenever this happens, the mu should consider its request filled. The MCU may do this before accessing all of the requested mu's addresses if it encounters an error satisfying the request.

The Pixel Pusher

The pixel pusher functional unit is the hardware that pulls pixel data from the frame buffer and generates the display driving signals. The Pixel Pusher is a relatively standard memory user, so it should be trivial to instantiate multiple pixel pushers to drive multiple displays from the single frame buffer (provided sufficient memory bandwidth exists). The pixel pusher has two internal clock domains. The first is the main logic clock that spans the rest of the TVC hardware. The second clock domain is the pixel clock (which in this implementation is 25MHz). Synchronizers guard signals that cross these domains. All synchronization is implemented in the disp_driver module, to keep the multiple clock domain issues contained in a single place (and in particular out of pixel_pusher.vhd)

The pixel pusher's overall implementation leverages both interfaces of the Block SelectRAM+ in the Spartan-II. One interface is used to fill the scanline cache from the frame buffer in the main logic clock domain, and the second interface drains pixel values to the display driver in the pixel clock domain. The disp_driver does not know if the data in the scanline cache (s) is/are correct. The disp_driver just reads the correct data at the correct time to display it. The disp_driver uses signals into the pixel pusher to trigger alternate filling of the scanline caches.

The disp_driver has three signals into the pixel_pusher that can trigger three behaviours of the pixel pusher. The first signal occurs at the start of the display draw cycle, before pixel data is needed to be drawn to the screen. This signal resets the base addresses of the two scanlines to the start of the frame buffer. The second two signals trigger filling of the two scanline caches, and a post increment of the frame buffer address for that scanline (so that the subsequent scanline fill occurs at the proper address) The pixel pusher is not sophisticated enough to trigger multiple behaviors at once, so the disp_driver only requests one behavior at a time.

The disp_driver display cycle operates in a loop generating the display timing signals. Near the start of this loop the pixel pusher to reset the scanline base address, then triggers the filling of both scanline caches in turn. At some (brief) time later the disp_driver begins outputting pixel data. As each scanline cache is drained the disp_driver triggers the pixel_pusher to refil each scanline cache at a new frame buffer address.

The scanline caches are state machines that connect two 16bit wide Spartan-II Block SelectRAM's into a single 32 bit wide 8 bit deep cache that can be triggered to be filled with pixel data at an arbitrary address from the Memory User Bus. As described above the second port into the Block SelectRAM's is not used directly by the scanline cache, but instead are wired directly into the disp_driver.

Command Bus

The command bus is a 32 bit data 8 bit address bus that is used only to write data into and read data from the drawing functional units. Each functional unit has reserved a certain address space on the command bus that does not overlap with any other functional unit. Functional units are only expected to respond to the command bus when they are idle. Reads and writes to addresses that belong to functional units that are not idle are undefined (but at this point are only ignored). There is no requirement for the functional units to provide both read and write functionality to all addresses that belong to it on the bus. The functional units are expected to implement only the command bus functionality necessary for their implementation specific operation. Functional units are activated by writing to a special address within each units' address space.

Driving the TVC though the command bus by the EPP parallel port is less efficient than directly mapping TVC control registers into the EPP port address space (as has been done previously), but the Command Bus is uses less logic resources than the previous command/control implementation and should allow an easier transition to much higher performing host interconnects. To help offset the cost of slowing the EPP communication, the EPPCU now automatically increments its internal address pointer on data writes to valid data registers.

EPP Control Unit

The EPPCU drives the command bus. To preform a command bus write operation, the user uses the epp address register to in turn specify writes into each of the first four epp registers. These four registers form the 32 bit value that will be written to the command bus. The user then writes the command bus address to which they wish write to the command bus address register. The user then writes the magic value 0xF0 to the command bus command register (#5) to initiate the actual write. To preform a command bus read, the user writes the command bus address they wish to read into the command bus address register. The user then writes the magic number 0x0F into the command bus command register. The eppcu will perform the command bus read and the the data will then be available for reading from EPP registers 0 through 3 inclusive. (Note: currently the only functional unit that reads data is the block reader, and therefore a simpler implementation of reads is currently used)

Eventually, all interaction with the functional units will be through the command bus, however the present implementation retains a single read only register (#6) that uses each bit as a busy flag indicating the idle state of each functional unit. Writes to register #6 are ignored.

Functional Units

Block Write

Writes a single memory bus aligned 32 bit value. Functional unit is initiated with a write to address 0x13 on command bus.

Block Read

Reads a single memory bus aligned 32 bit value. Functional unit is initiated with a write to address 0x23 on the command bus.

Block Set

This command sets a block of consecutive dram aligned values to a certain value. This command is used for clearing the front and depth buffers.

Note: This command only operates on dram aligned data values. This is implemented by treating the least significant two bits of the fb_address and size_in_bytes values as zero. This command as currently implemented can hog the MU data bus. Its design must be changed to break its writes into several/many shorter bursts.

Poly Scanline

The polygon scan line functional unit implements the most computational intensive part of OpenGL rendering, the innermost rendering loop that deals with the individual pixels in a group that span the width of a triangle called a polygon scan line. If the texture_id is specified as zero, this functional unit assumes the polygon is flat shaded. If the texture_id is non-zero, the functional unit performs texture mapping based on nearest pixel selection.

EPP registers
Register	Function/name
0	Command Bus Data IO 0
1	Command Bus Data IO 1
2	Command Bus Data IO 2
3	Command Bus Data IO 3
4	Command Bus Address
5	Command Bus Control
6	Functional unit status register

Functional unit Status Register Layout
Bit number	Bit meaning
7	Cmd Error
6	Block Reader Idle
5	Block Writer Idle
4	Block Set Idle
3	Poly Scanline Idle
2	<unassigned>
1	<unassigned>
0	<unassigned>

Command Bus Address	Function/name	Note
0x11	Desired frame buffer address	Only highest 30 bits are used
0x12	Value to write
0x13	Start FU	Value on bus not used

Command Bus Address	Function/name	Note
0x21	Desired frame buffer address	Only highest 30 bits are used
0x22	Address from which fb value can be read
0x23	Start FU	Value on bus not used

Command Bus Address	Function/name	Note
0x01	Staring frame buffer address	Only highest 30 bits are used
0x02	Number of registers	Only 23 downto 2 bits are used
0x03	Value to write

Command Bus Address	Function/name	Note
0xF0	scanline_fb_address
0xF1	depthbuffer_fb_address
0xF2	scanline_start_z_value
0xF3	scanline_z_increment
0xF4	scanline_length	Only lowest 16 bits are used
0xF5	pixel_value	Only lowest 8 bits are (presently) used
0xF6	Texture coord x & y	High 16 bits tcx Low 16 bits tcy
0xF7	Texture coord increment x & y	High 16 bits tc_xinc Low 16 bits tc_yinc
0xF8	Texture id	Only lowest 3 bits are used.
0xF9	Start FU	Value on data bus is not used

DrawLine <not implemented yet>

Command Bus Address	Function/name	Note
0xE0	Window fb_address
0xE1	Widow width and height	Width stored in high 16 bits, height in lowest 16 bits
0xE2	Start coordinate	X in high 16 bits, Y in lowest 16 bits
0xE3	End coordinate	X in high 16 bits, Y in lowest 16 bits
0xE4	Pixel value	Only lowest 8 bits are (presently) used

Illustration 2: Assembled TVC Circuit

Implementation Example

Development of this hardware was done using a Digilent D2E development kit based around a Xilinx Spartan XC2S200E FPGA. The memory for the frame buffer is an old 8MB 72 pin simm cannibalized from a P166 computer before it was discarded. The simm was hardwired to a wire wrap bread board that plugs into the D2E board. The VGA signals are generated by a resistor network on a Digilent DIO2 peripheral board. The DIO2 provides two bits for red, and three bits each for green and blue. The final circuit is shown in Illustration 2. Illustrations 3,4, and 5 show the output of the circuit in operation. Illustration 4 was captured from TVC release #1. The block fill is now implemented using the Block Set FU and now proceeds too quickly to capture.

The Xilinx tools report that this design occupies 89% (2,113 out of 2,352) of the slices in the XC2S200E with 0% of the slices containing unrelated logic. The design meets timing requirements for a 15ns (66MHz) logic clock and a 10ns (100MHz) pixel clock. A 50MHz clock is used on the D2E board for the main logic clock. The pixel clock is 25 MHz generated by dividing the main clock by two.

For a FPM memory transfer cycle currently four clock cycles are used excluding setup and cleanup. This means that for my 50Mhz clock and the 32 bit bus of the simm the theoretical peak memory bandwidth would be 50MB/sec. Displaying VGA (640x480 @60Hz) requires 18.4 MB/sec of bandwidth, leaving the rest for drawing operations. A scope picture of the CAS lines is shown in Illustration 7, showing visually the available time for drawing operations (Drawing operations may be scheduled where the CAS lines are not active).

Illustration 3: TVC output showing software rasterized lines

Illustration 4: TVC solid fill output midway through blue fill after a green fill

Illustration 5: Hardware Depth buffered scanline output

Illustration 6: Software rendered ideal TVC output

Illustration 7: Hardware texture mapping output

Illustration 8: Software texture mapping output

Illustration 9: Ram CAS lines showing display draw memory cycles

Implementation Issues

A picture of the crt display as seen in Illustrations 5 and 7 may be compared to the ideal output of the TVC as shown in Illustration 6 and 8. The small artifacts seen in Illustration 5 and 7 are most likely due to logic errors in the mcu-hl. The color gradient across the hardware rendered teapot is much coarser than the software rendering due to the difference between the 8 and 24 bpp of the respective front buffer implementations.

The lack of memory bandwidth seems to be the primary limit to performance. The EPP command interface is also quite slow. The non-sdram memory means that the mcu-ll will have to be completely re-written for a different development kit.

Development of this project started on Xilinx webpack 6.1, but has since moved through versions 8.1, 8.2sp2, to the currently used 10.1sp3 . The C++ driver code was developed initially on Slackware Linux 10.2, but has now moved to Slackware 12.2. The webpack software for previous releases has been used on Windows 2000. Due to problems running the current Webpack software under windows 2000, I am now running the Linux version of Xilinx's Webpack on a second slackware 12.2 computer.

Design Overview