The Test Video Card (tvc) as of now consists of 16 VHDL modules. The organization of these modules may be seen in Illustration 1. (The Poly Scanline has Block SelectRAM+ caches internally that are not shown in Illustration 1, also synchronizers throughout the design are not shown) . The modules are of two main types, those that implement drawing commands, and those that implement the overall structure of the TVC. These structural VHDL modules consist of the TVC top level module, a Memory Control Unit (MCU), a module called pixel_pusher that controls the monitor sync signals and outputs pixel data, and the EPP control unit.

The most important structural module is the MCU. The MCU provides and controls a memory user bus that runs throughout the TVC to all units that need to access or modify the frame buffer contents. The MCU through its operation establishes the timing for the entire TVC.

The drawing functional units (FU) consist of four separate modules. The first is a Block Set unit that writes fixed pixel values to consecutive bus aligned memory addresses. Next are a block reader and block writer that allow reading and writing 32 bit bus aligned memory locations. The poly scanline unit draws a single pixel wide z-buffered triangle scanline to the frame buffer.

The Memory User Bus

Illustration 1: TVC Architecture

The Memory Users (mu) are driven by the Memory Controller Unit (mcu). The mcu when idle checks to see if any mus have raised an interrupt. If a scanline cache mu has raised an interrupt, that interrupt is serviced ahead of any other mu. This is done to (help) guarantee that pixel values will be available to the pixel pusher when it needs them to output to the display. Memory Users other than the scanline caches are serviced in round-robin order. A mu may have to wait an arbitrarily long time before being serviced by the mcu. After selecting a given mu, the mcu will completely satisfy the request of that mu before it moves on to another mu. This means that the maximum memory request length cannot allow starvation of the scanline caches.

After the mcu selects a mu, the mcu reads several special mu addresses that contain the actions that the mu wants the mcu to perform. The mcu fully reads these addresses as needed before starting processing the request. The mcu will be capable of several operations.

Unfortunately the mcu is currently not capable of read/modify/write pixel operations.

Bus Details

Remember that the mcu is completely in control of the mu bus. Mu's are fully reactionary to values driven on the bus by the mcu. Because all mu's see the same mu_bus, they only react to it when they are selected in response to an interrupt they raised.

When this is high this means the mu should read or write its internal selected address to the mu_bus_data lines.

Data width is 32 bits. Read/write by the mcu and mu. Only full dwords are written or read at a time

Address bus width 12 bits. These are considered to be the addresses of 32 bit registers, not individual bytes. This means that a total of 4K dwords or 16K bytes can be addressed by the mcu in a given mu.

The mcu raises this line when it wants the mu to store the value currently on the mu_bus_data lines to its internal dword 'register' at the mus internal address currently on the mu_bus_address lines. When this is low, the mcu wants the mu to put the value in the selected address on the mu_bus_data lines.

In addition to the bus lines mentioned above, each mu has two lines connecting directly to the mcu.

The mu raises this line when it wants the mcu to perform an operation on its behalf

The mcu raises this line in response to the irq line raised by the mu when it is starting to service its request. The mcu lowers this line after it is finished servicing the mu's reqest.

MCU Request Specification

As stated above the mu must define and communicate a behavior that it wants the mcu to do on its behalf. This is done by treating the last four (?should it be eight?) addressable dwords in each mu as special values. The mcu, just after selecting a mu will read at least the first two of these addresses in order, and may optionally read more depending on the values held within the first dword. After reading the mu request specification, the mcu will then read/write from/to the mu the requested number of dword values in order, starting from mu address 0. After the mcu completes this operation on behalf of the mu, the mcu deselects the mu by putting its select line to zero.

Right now the only two commands are 0x80 (write) and 0x00 (read) so only bit 31 of the command is used.

Bits 31 to 0 hold the 32 bit framebuffer starting address that will be written/read/modified from.

Possible design changes

Should there be some signal from the MCU to the MU when it is done reading the request registers? One implementation would be dropping the strobe the first time (with the final MU request address still on the mu_bus_address) and waiting a few clock cycles that will allow mus to switch from a 'mcu configuration mode' to 'data transfer mode'. None of the current mus need this, but it might simplify the design of more complex Memory Users.

Implementation of the poly-scanline FU demonstrated a weakness in the MU bus design. In order to read values from its internal cache to write to the MCU, the address to write needs to be known a cycle or two in advance. One solution to this is for the mcu to notify the mu that the for the desired transfer is about to begin, and then strobe out data lines with a clock signal. Any counters that are needed would have to be replicated in the various Memory Users. This seems to be a more modern data transfer model that what currently exists. It also more reflects the data transfer model of the sdram that the framebuffer will eventually utilize. To quickly hack around this problem I have the MCU use the data strobe line as a clock signal for the poly-scanline.

Additional Complexities

It is expected that functional units will be implemented that will need to have multiple mu bus connections. Although it is possible to implement all functional units with only a single connection to the mcu, it may be simpler to implement a functional unit with multiple connections. For example the GL functional unit could have a MU that contains a row of framebuffer pixels, a MU that contains a row of depth-buffer pixels, etc. (This is not how the current Poly Scanline MU is implemented). The MCU will not know anything about these details. The pix_pusher is almost an example of this in that it has two scanline caches (an even and odd scanlines), each with its own independent connection to the mu bus, however the MCU knows some details of this because it must give the scanline caches' irqs priority over other irqs.

The MCU can drop the MU's select line at any time. Whenever this happens, the mu should consider its request filled. The MCU may do this before accessing all of the requested mu's addresses if it encounters an error satisfying the request.

The Pixel Pusher

The pixel pusher functional unit is the hardware that pulls pixel data from the frame buffer and generates the display driving signals. The Pixel Pusher is a relatively standard memory user, so it should be trivial to instantiate multiple pixel pushers to drive multiple displays from the single frame buffer (provided sufficient memory bandwidth exists). The pixel pusher has two internal clock domains. The first is the main logic clock that spans the rest of the TVC hardware. The second clock domain is the pixel clock (which in this implementation is 25MHz). Synchronizers guard signals that cross these domains. All synchronization is implemented in the disp_driver module, to keep the multiple clock domain issues contained in a single place (and in particular out of pixel_pusher.vhd)

The pixel pusher's overall implementation leverages both interfaces of the Block SelectRAM+ in the Spartan-II. One interface is used to fill the scanline cache from the frame buffer in the main logic clock domain, and the second interface drains pixel values to the display driver in the pixel clock domain. The disp_driver does not know if the data in the scanline cache (s) is/are correct. The disp_driver just reads the correct data at the correct time to display it. The disp_driver uses signals into the pixel pusher to trigger alternate filling of the scanline caches.

The disp_driver has three signals into the pixel_pusher that can trigger three behaviours of the pixel pusher. The first signal occurs at the start of the display draw cycle, before pixel data is needed to be drawn to the screen. This signal resets the base addresses of the two scanlines to the start of the frame buffer. The second two signals trigger filling of the two scanline caches, and a post increment of the frame buffer address for that scanline (so that the subsequent scanline fill occurs at the proper address) The pixel pusher is not sophisticated enough to trigger multiple behaviors at once, so the disp_driver only requests one behavior at a time.

The disp_driver display cycle operates in a loop generating the display timing signals. Near the start of this loop the pixel pusher to reset the scanline base address, then triggers the filling of both scanline caches in turn. At some (brief) time later the disp_driver begins outputting pixel data. As each scanline cache is drained the disp_driver triggers the pixel_pusher to refil each scanline cache at a new frame buffer address.

The scanline caches are state machines that connect two 16bit wide Spartan-II Block SelectRAM's into a single 32 bit wide 8 bit deep cache that can be triggered to be filled with pixel data at an arbitrary address from the Memory User Bus. As described above the second port into the Block SelectRAM's is not used directly by the scanline cache, but instead are wired directly into the disp_driver.

EPP Interface

The external host computer communicates with the tvc via 22 eight bit addressable registers. The epp uses these registers by first writing an eight bit address (via the epp address write cycle), which is used by subsequent data writes to select a data register. The address register is not cleared or modified by the tvc, so its value can be expected not to change between writes.

The first 20 registers hold data dependent on the command that is to be executed. These 20 registers may have data written to and read from them via the epp port. The twenty-first register is a write only command register (reads from this address return 0x00). Recognized values written to this epp address result in functional units within the tvc to begin executing based on data loaded into the first 20 registers. The executing functional unit may return data to some or all of the first 20 registers to be read back out via the epp port. The twenty-second data register is a read only register that holds the status of the functional units.

Note: after loading epp registers 0-3 with the desired fb address, execute the command, and after the functional unit becomes idle read the frame buffer data values from registers 5-19.

Note: This command only operates on dram aligned data values. This is implemented by treating the least significant two bits of the fb_address and size_in_bytes values as zero. This command as currently implemented can hog the MU data bus. Its design must be changed to break its writes into several/many shorter bursts.

Implementation Example

Development of this hardware was done using a Digilent D2E development kit based around a Xilinx Spartan XC2S200E FPGA. The memory for the frame buffer is an old 8MB 72 pin simm cannibalized from a P166 computer before it was discarded. The simm was hardwired to a wire wrap bread board that plugs into the D2E board. The VGA signals are generated by a resistor network on a Digilent DIO2 peripheral board. The DIO2 provides two bits for red, and three bits each for green and blue. The final circuit is shown in Illustration 2. Illustrations 3,4, and 5 show the output of the circuit in operation. Illustration 4 was captured from TVC release #1. The block fill is now implemented using the Block Set FU and now proceeds too quickly to capture.

Register	Function/name
0	fb_address_0
1	fb_address_1
2	fb_address_2
3	fb_address_3
4	Num values to write
5-19	Data values to write

Register	Function/name
0	fb_address_0
1	fb_address_1
2	fb_address_2
3	fb_address_3
4	Num values to read
5-19	Data values from framebuffer

Register	Function/name
0	fb_address_0
1	fb_address_1
2	fb_address_2
3	fb_address_3
4	size_in_bytes_0
5	size_in_bytes_1
6	size_in_bytes_2
7	fb_data_0
8	fb_data_1
9	fb_data_2
10	fb_data_3

Register	Function/name
0	fb_address_0
1	fb_address_1
2	fb_address_2
3	fb_address_3
4	window_width_0
5	window_width_1
6	window_height_0
7	window_height_1
8	line_x1_0
9	line_x1_1
10	line_y1_0
11	line_y1_1
12	line_x2_0
13	line_x2_1
14	line_y2_0
15	line_y2_1
16	pixel_value

Register	Function/name
0	scanline_fb_address_0
1	scanline_fb_address_1
2	scanline_fb_address_2
3	scanline_fb_address_3
4	depthbuffer_fb_address_0
5	depthbuffer_fb_address_1
6	depthbuffer_fb_address_2
7	depthbuffer_fb_address_3
8	scanline_start_z_value_0
9	scanline_start_z_value_1
10	scanline_start_z_value_2
11	scanline_start_z_value_3
12	scanline_z_increment_0
13	scanline_z_increment_1
14	scanline_z_increment_2
15	scanline_z_increment_3
16	scanline_length_0
17	scanline_length_1
18	pixel_value

Bit number	Bit meaning
7	Cmd Error
6	Block Reader Idle
5	Block Writer Idle
4	Block Set Idle
3	Poly Scanline Idle
2	<unassigned>
1	<unassigned>
0	<unassigned>

Illustration 2: Assembled TVC Circuit

The Xilinx tools report that this design occupies 97% (2,289 out of 2,352) of the slices in the XC2S200E with 0% of the slices containing unrelated logic. The design meets timing requirements for a 15ns (66MHz) logic clock and a 10ns (100MHz) pixel clock. A 50MHz clock is used on the D2E board for the main logic clock. The pixel clock is 25 MHz generated by dividing the main clock by two.

For a FPM memory transfer cycle currently four clock cycles are used excluding setup and cleanup. This means that for my 50Mhz clock and the 32 bit bus of the simm the theoretical peak memory bandwidth would be 50MB/sec. Displaying VGA (640x480 @60Hz) requires 18.4 MB/sec of bandwidth, leaving the rest for drawing operations. A scope picture of the CAS lines is shown in Illustration 7, showing visually the available time for drawing operations (Drawing operations may be scheduled where the CAS lines are not active).

Implementation Issues

A picture of the display as seen in Illustration 5 may be compared to the ideal output of the TVC as shown in Illustration 6. The small artifacts seen in Illustration 5 are most likely due to logic errors in the mcu-hl. The large black areas on the teapot currently appear to be a problem in the driver software, not in the hardware design. These polygons are rendered correctly in sample renderings using constant color values. For some reason the driver assigns a black pixel value to these polygons. They are being depth buffered correctly. The color gradient across the hardware rendered teapot is much coarser than the software rendering due to the difference between the 8 and 24 bpp of the respective front buffer implementations.

The lack of memory bandwidth seems to be the primary limit to performance. The EPP command interface is also quite slow. The non-sdram memory means that the mcu-ll will have to be completely re-written for a different development kit.

Illustration 3: TVC output showing software rasterized lines

Development of this project started on Xilinx webpack 6.1, but has since moved through versions 8.1, 8.2sp2, to the currently used 10.1sp3 . The C++ driver code was developed initially on Slackware Linux 10.2, but has now moved to Slackware 12.2. The webpack software has been used on Windows 2000. Unfortunatly the webpack 10.1 seems to have some issues with this dated platform, so I'm exploring moving all development over to Slackware Linux.

Illustration 4: TVC solid fill output midway through blue fill after a green fill

Illustration 5: Depth buffered scanline output

Illustration 6: Software rendered ideal TVC output

Illustration 7: Ram CAS lines showing display draw memory cycles

Design Overview