Design Overview

The Test Video Card (tvc) as of now consists of 16 VHDL modules. The organization of these modules may be seen in Illustration 1. (The Poly Scanline has Block SelectRAM+ caches internally that are not shown in Illustration 1, also synchronizers throughout the design are not shown) . The modules are of two main types, those that implement drawing commands, and those that implement the overall structure of the TVC. These structural VHDL modules consist of the TVC top level module, a Memory Control Unit (MCU), a module called pixel_pusher that controls the monitor sync signals and outputs pixel data, and the EPP control unit.

The most important structural module is the MCU. The MCU provides and controls a memory user bus that runs throughout the TVC to all units that need to access or modify the frame buffer contents. The MCU through its operation establishes the timing for the entire TVC.

The drawing functional units (FU) consist of four separate modules. The first is a Block Set unit that writes fixed pixel values to consecutive bus aligned memory addresses. Next are a block reader and block writer that allow reading and writing 32 bit bus aligned memory locations. The poly scanline unit draws a single pixel wide z-buffered triangle scanline to the frame buffer.

The Memory User Bus




Illustration 1: TVC Architecture

The Memory Users (mu) are driven by the Memory Controller Unit (mcu). The mcu when idle checks to see if any mus have raised an interrupt. If a scanline cache mu has raised an interrupt, that interrupt is serviced ahead of any other mu. This is done to (help) guarantee that pixel values will be available to the pixel pusher when it needs them to output to the display. Memory Users other than the scanline caches are serviced in round-robin order. A mu may have to wait an arbitrarily long time before being serviced by the mcu. After selecting a given mu, the mcu will completely satisfy the request of that mu before it moves on to another mu. This means that the maximum memory request length cannot allow starvation of the scanline caches.


After the mcu selects a mu, the mcu reads several special mu addresses that contain the actions that the mu wants the mcu to perform. The mcu fully reads these addresses as needed before starting processing the request. The mcu will be capable of several operations.



Unfortunately the mcu is currently not capable of read/modify/write pixel operations.

Bus Details

Remember that the mcu is completely in control of the mu bus. Mu's are fully reactionary to values driven on the bus by the mcu. Because all mu's see the same mu_bus, they only react to it when they are selected in response to an interrupt they raised.


mu_bus_strobe

When this is high this means the mu should read or write its internal selected address to the mu_bus_data lines.


mu_bus_data

Data width is 32 bits. Read/write by the mcu and mu. Only full dwords are written or read at a time


mu_bus_address

Address bus width 12 bits. These are considered to be the addresses of 32 bit registers, not individual bytes. This means that a total of 4K dwords or 16K bytes can be addressed by the mcu in a given mu.


mu_bus_write.

The mcu raises this line when it wants the mu to store the value currently on the mu_bus_data lines to its internal dword 'register' at the mus internal address currently on the mu_bus_address lines. When this is low, the mcu wants the mu to put the value in the selected address on the mu_bus_data lines.


In addition to the bus lines mentioned above, each mu has two lines connecting directly to the mcu.


<mu_name>_irq

The mu raises this line when it wants the mcu to perform an operation on its behalf


<mu_name>_select

The mcu raises this line in response to the irq line raised by the mu when it is starting to service its request. The mcu lowers this line after it is finished servicing the mu's reqest.

MCU Request Specification

As stated above the mu must define and communicate a behavior that it wants the mcu to do on its behalf. This is done by treating the last four (?should it be eight?) addressable dwords in each mu as special values. The mcu, just after selecting a mu will read at least the first two of these addresses in order, and may optionally read more depending on the values held within the first dword. After reading the mu request specification, the mcu will then read/write from/to the mu the requested number of dword values in order, starting from mu address 0. After the mcu completes this operation on behalf of the mu, the mcu deselects the mu by putting its select line to zero.


MU address 0xFFC = 0xFFF – 3

Bit 31 to 24 the mcu command type

Bit 23 to 0 the number of framebuffer addresses to read/write/modify

Right now the only two commands are 0x80 (write) and 0x00 (read) so only bit 31 of the command is used.


MU address 0xFFD = 0xFFF – 2

Bits 31 to 0 hold the 32 bit framebuffer starting address that will be written/read/modified from.


MU address 0xFFE = 0xFFF -1

undefined (unused as of now)


Address 0xFFF

undefined (unused as of now)


Possible design changes

Should there be some signal from the MCU to the MU when it is done reading the request registers? One implementation would be dropping the strobe the first time (with the final MU request address still on the mu_bus_address) and waiting a few clock cycles that will allow mus to switch from a 'mcu configuration mode' to 'data transfer mode'. None of the current mus need this, but it might simplify the design of more complex Memory Users.


Implementation of the poly-scanline FU demonstrated a weakness in the MU bus design. In order to read values from its internal cache to write to the MCU, the address to write needs to be known a cycle or two in advance. One solution to this is for the mcu to notify the mu that the for the desired transfer is about to begin, and then strobe out data lines with a clock signal. Any counters that are needed would have to be replicated in the various Memory Users. This seems to be a more modern data transfer model that what currently exists. It also more reflects the data transfer model of the sdram that the framebuffer will eventually utilize. To quickly hack around this problem I have the MCU use the data strobe line as a clock signal for the poly-scanline.

Additional Complexities

It is expected that functional units will be implemented that will need to have multiple mu bus connections. Although it is possible to implement all functional units with only a single connection to the mcu, it may be simpler to implement a functional unit with multiple connections. For example the GL functional unit could have a MU that contains a row of framebuffer pixels, a MU that contains a row of depth-buffer pixels, etc. (This is not how the current Poly Scanline MU is implemented). The MCU will not know anything about these details. The pix_pusher is almost an example of this in that it has two scanline caches (an even and odd scanlines), each with its own independent connection to the mu bus, however the MCU knows some details of this because it must give the scanline caches' irqs priority over other irqs.


The MCU can drop the MU's select line at any time. Whenever this happens, the mu should consider its request filled. The MCU may do this before accessing all of the requested mu's addresses if it encounters an error satisfying the request.

The Pixel Pusher

The pixel pusher functional unit is the hardware that pulls pixel data from the frame buffer and generates the display driving signals. The Pixel Pusher is a relatively standard memory user, so it should be trivial to instantiate multiple pixel pushers to drive multiple displays from the single frame buffer (provided sufficient memory bandwidth exists). The pixel pusher has two internal clock domains. The first is the main logic clock that spans the rest of the TVC hardware. The second clock domain is the pixel clock (which in this implementation is 25MHz). Synchronizers guard signals that cross these domains. All synchronization is implemented in the disp_driver module, to keep the multiple clock domain issues contained in a single place (and in particular out of pixel_pusher.vhd)


The pixel pusher's overall implementation leverages both interfaces of the Block SelectRAM+ in the Spartan-II. One interface is used to fill the scanline cache from the frame buffer in the main logic clock domain, and the second interface drains pixel values to the display driver in the pixel clock domain. The disp_driver does not know if the data in the scanline cache (s) is/are correct. The disp_driver just reads the correct data at the correct time to display it. The disp_driver uses signals into the pixel pusher to trigger alternate filling of the scanline caches.


The disp_driver has three signals into the pixel_pusher that can trigger three behaviours of the pixel pusher. The first signal occurs at the start of the display draw cycle, before pixel data is needed to be drawn to the screen. This signal resets the base addresses of the two scanlines to the start of the frame buffer. The second two signals trigger filling of the two scanline caches, and a post increment of the frame buffer address for that scanline (so that the subsequent scanline fill occurs at the proper address) The pixel pusher is not sophisticated enough to trigger multiple behaviors at once, so the disp_driver only requests one behavior at a time.


The disp_driver display cycle operates in a loop generating the display timing signals. Near the start of this loop the pixel pusher to reset the scanline base address, then triggers the filling of both scanline caches in turn. At some (brief) time later the disp_driver begins outputting pixel data. As each scanline cache is drained the disp_driver triggers the pixel_pusher to refil each scanline cache at a new frame buffer address.


The scanline caches are state machines that connect two 16bit wide Spartan-II Block SelectRAM's into a single 32 bit wide 8 bit deep cache that can be triggered to be filled with pixel data at an arbitrary address from the Memory User Bus. As described above the second port into the Block SelectRAM's is not used directly by the scanline cache, but instead are wired directly into the disp_driver.

EPP Interface

The external host computer communicates with the tvc via 22 eight bit addressable registers. The epp uses these registers by first writing an eight bit address (via the epp address write cycle), which is used by subsequent data writes to select a data register. The address register is not cleared or modified by the tvc, so its value can be expected not to change between writes.


The first 20 registers hold data dependent on the command that is to be executed. These 20 registers may have data written to and read from them via the epp port. The twenty-first register is a write only command register (reads from this address return 0x00). Recognized values written to this epp address result in functional units within the tvc to begin executing based on data loaded into the first 20 registers. The executing functional unit may return data to some or all of the first 20 registers to be read back out via the epp port. The twenty-second data register is a read only register that holds the status of the functional units.


Command: Block Write


Register

Function/name

0

fb_address_0

1

fb_address_1

2

fb_address_2

3

fb_address_3

4

Num values to write

5-19

Data values to write


Command byte 0xF0


Note: The write will be complete when the functional unit becomes idle.


Command: Block Read


Register

Function/name

0

fb_address_0

1

fb_address_1

2

fb_address_2

3

fb_address_3

4

Num values to read

5-19

Data values from framebuffer


Command byte 0x0F


Note: after loading epp registers 0-3 with the desired fb address, execute the command, and after the functional unit becomes idle read the frame buffer data values from registers 5-19.


Command: Block Set


Register

Function/name

0

fb_address_0

1

fb_address_1

2

fb_address_2

3

fb_address_3

4

size_in_bytes_0

5

size_in_bytes_1

6

size_in_bytes_2

7

fb_data_0

8

fb_data_1

9

fb_data_2

10

fb_data_3


Command byte: 0x03


Note: This command only operates on dram aligned data values. This is implemented by treating the least significant two bits of the fb_address and size_in_bytes values as zero. This command as currently implemented can hog the MU data bus. Its design must be changed to break its writes into several/many shorter bursts.


Command: Draw Line <not implemented yet>


Register

Function/name

0

fb_address_0

1

fb_address_1

2

fb_address_2

3

fb_address_3

4

window_width_0

5

window_width_1

6

window_height_0

7

window_height_1

8

line_x1_0

9

line_x1_1

10

line_y1_0

11

line_y1_1

12

line_x2_0

13

line_x2_1

14

line_y2_0

15

line_y2_1

16

pixel_value


Command byte: 0x04


Command: Poly Scanline


Register

Function/name

0

scanline_fb_address_0

1

scanline_fb_address_1

2

scanline_fb_address_2

3

scanline_fb_address_3

4

depthbuffer_fb_address_0

5

depthbuffer_fb_address_1

6

depthbuffer_fb_address_2

7

depthbuffer_fb_address_3

8

scanline_start_z_value_0

9

scanline_start_z_value_1

10

scanline_start_z_value_2

11

scanline_start_z_value_3

12

scanline_z_increment_0

13

scanline_z_increment_1

14

scanline_z_increment_2

15

scanline_z_increment_3

16

scanline_length_0

17

scanline_length_1

18

pixel_value

Command byte 0x30


Status Register Layout

Bit number

Bit meaning

7

Cmd Error

6

Block Reader Idle

5

Block Writer Idle

4

Block Set Idle

3

Poly Scanline Idle

2

<unassigned>

1

<unassigned>

0

<unassigned>

Implementation Example

Development of this hardware was done using a Digilent D2E development kit based around a Xilinx Spartan XC2S200E FPGA. The memory for the frame buffer is an old 8MB 72 pin simm cannibalized from a P166 computer before it was discarded. The simm was hardwired to a wire wrap bread board that plugs into the D2E board. The VGA signals are generated by a resistor network on a Digilent DIO2 peripheral board. The DIO2 provides two bits for red, and three bits each for green and blue. The final circuit is shown in Illustration 2. Illustrations 3,4, and 5 show the output of the circuit in operation. Illustration 4 was captured from TVC release #1. The block fill is now implemented using the Block Set FU and now proceeds too quickly to capture.


Illustration 2: Assembled TVC Circuit




The Xilinx tools report that this design occupies 97% (2,289 out of 2,352) of the slices in the XC2S200E with 0% of the slices containing unrelated logic. The design meets timing requirements for a 15ns (66MHz) logic clock and a 10ns (100MHz) pixel clock. A 50MHz clock is used on the D2E board for the main logic clock. The pixel clock is 25 MHz generated by dividing the main clock by two.

For a FPM memory transfer cycle currently four clock cycles are used excluding setup and cleanup. This means that for my 50Mhz clock and the 32 bit bus of the simm the theoretical peak memory bandwidth would be 50MB/sec. Displaying VGA (640x480 @60Hz) requires 18.4 MB/sec of bandwidth, leaving the rest for drawing operations. A scope picture of the CAS lines is shown in Illustration 7, showing visually the available time for drawing operations (Drawing operations may be scheduled where the CAS lines are not active).

Implementation Issues

A picture of the display as seen in Illustration 5 may be compared to the ideal output of the TVC as shown in Illustration 6. The small artifacts seen in Illustration 5 are most likely due to logic errors in the mcu-hl. The large black areas on the teapot currently appear to be a problem in the driver software, not in the hardware design. These polygons are rendered correctly in sample renderings using constant color values. For some reason the driver assigns a black pixel value to these polygons. They are being depth buffered correctly. The color gradient across the hardware rendered teapot is much coarser than the software rendering due to the difference between the 8 and 24 bpp of the respective front buffer implementations.

The lack of memory bandwidth seems to be the primary limit to performance. The EPP command interface is also quite slow. The non-sdram memory means that the mcu-ll will have to be completely re-written for a different development kit.


Illustration 3: TVC output showing software rasterized lines






Development of this project started on Xilinx webpack 6.1, but has since moved through versions 8.1, 8.2sp2, to the currently used 10.1sp3 . The C++ driver code was developed initially on Slackware Linux 10.2, but has now moved to Slackware 12.2. The webpack software has been used on Windows 2000. Unfortunatly the webpack 10.1 seems to have some issues with this dated platform, so I'm exploring moving all development over to Slackware Linux.




Illustration 4: TVC solid fill output midway through blue fill after a green fill






Illustration 5: Depth buffered scanline output






Illustration 6: Software rendered ideal TVC output






Illustration 7: Ram CAS lines showing display draw memory cycles