BMOW title
Floppy Emu banner

More on Memory

I’ve been working hard the past week on a DDR2 memory controller for the Xilinx starter kit, and refining my estimates for 3d Graphics Thingy’s memory bandwidth requirements. There’s been progress, but it feels like things are moving at a snail’s pace.

I wrote earlier about some basic bandwidth estimates, and have revised them somewhat here. Assume pixels and texels are 16 bits (5-6-5 RGB format), z-buffer entries are 24 bits, and the screen resolution is 640×480 @ 60Hz. Let’s also assume a simple case where there’s no alpha blending being performed, and every triangle has one texture applied to it, using point sampling for the texture lookup. For every pixel, every frame, the hardware must:

  1. Clear the z-buffer, at the start of the frame: 3 bytes
  2. Clear the frame buffer, at the start of the frame: 2 bytes
  3. Read the z-buffer, when a new pixel is being drawn: 3 bytes
  4. Write the z-buffer, if the Z test passes: 3 bytes
  5. Read the texture data, if the Z test passes: 2 bytes
  6. Write the frame buffer, if the Z test passes: 2 bytes
  7. Read the frame buffer, when the display circuit paints the screen: 2 bytes

Assume too that the scene’s depth complexity is 4, meaning the average pixel is covered by 4 triangles, and steps 3-6 will be repeated 4 times. Add everything up, and that’s 47 bytes per pixel, times 640 x 480 is 14.43 MB per frame, times 60 Hz is 866.3 MB/s.

The DDR2 memory on the Xilinx starter kit board has a theoretical maximum bandwidth of 1064 MB/s, so that might just fit. I have serious reservations about my ability to later recreate such a high-speed memory interface on a custom PCB, but ignore that for now. Unfortunately you’ll never get anything close to the theoretical bandwidth in real world usage, unless you’re streaming a huge chunk of data to consecutive memory addreses. Even half the theoretical bandwidth would be doing well. I’ll be conservative and assume I can reach 1/3 of the theoretical bandwidth, which means 355 MB/s. That’s not enough. And I’ll also need some bandwidth for vertex manipulations, since I’ve only considered pixel rasterization, and possibly for CPU operations too. It looks like things will definitely be bandwidth constrained.

Fortunately there are some clever tricks that can be used to save lots of memory bandwidth.

  1. Z occlusion: When a pixel fails the Z test at step 3, then steps 4-6 can be skipped. With a depth complexity of 4, and assuming randomly-ordered triangles, then on average 1 + 1/2 + 1/3 + 1/4 = 2.08 triangles will pass the Z test and get drawn, not 4. That’s a savings of 14 bytes per pixel, or 258 MB/s!
  2. Back-face culling: When drawing solid objects, it’s guaranteed that any triangle facing away from the camera will be overdrawn by some other triangle facing towards the camera. These back-face triangles can be ignored completely, skipping steps 3-6 and saving 10 bytes per culled pixel. Assuming half the pixels are part of back-facing triangles, then that’s a savings of 369 MB/s. Of course some of the pixels rejected due to back-face culling would also have been rejected by Z occlusion, so it’s not valid to simply add the savings from the two techniques.
  3. Z pre-pass: Another technique is to draw the entire scene while skipping steps 5 and 6, so only the Z buffer is updated. Then the scene is drawn again, but step 3 is changed to test for an exactly equal Z value, and step 4 is eliminated. This guarantees that steps 5 and 6 are only performed once per pixel, for the front-most triangle. However, step 3 must now be performed twice as many times, and all the vertex transformation and triangle setup work not accounted for here must be done twice. Whether this results in an appreciable overall savings depends on many factors.
  4. Skip frame buffer clear: If the rendered scene is indoors and covers the entire screen, then the frame buffer clear in step 2 can be omitted. That’s a savings of 37 MB/s.
  5. Skip Z-buffer clear: If the rendered scene covers the entire screen, then the Z-buffer clear in step 1 can also be omitted, but sacrificing one bit of Z-buffer accuracy. On even frames, the low half of the Z-buffer range can be used. On odd frames, the high half can be used, along with a reversal in the sense of direction, so larger values are treated as being closer to the camera. This means that every Z value from an even frame is farther away than any Z value from an odd frame, so each frame effectively clears the Z-buffer for the next one. This provides a savings of 55 MB/s.
  6. Texture compression: Compression formats like DXT1 can provide a 4:1 or better compression ratio for texture data. If the rasterizer can be structured so that an entire texture is read into a cache, and then used for calculations on many adjacent pixels, this can translate directly into a 4:1 bandwidth savings on step 5. Assuming less than perfect gains of 2:1, that translates to a savings of 18 MB/s.
  7. Texture cache: Neighboring pixels on the screen are likely to access the same texels, when the textures are drawn magnified. A texture that’s tiled many times across the face of a triangle may also result in many reads of the same texel. The expected savings depend on the particular model that’s rendered, but are probably similar to those for texture compression, or about 18 MB/s.
  8. Tiled Z-Buffer: The Z-buffer can be divided into many 8×8 squares, with a small amount of state data cached for each square: the farthest point (largest Z value) in the square, and a flag indicating if the square has been cleared. That’s 25 bits per square, or 15 KB for a 640×480 Z-buffer. That should fit in the FPGA’s block RAM. Then when considering a pixel before step 3, if the pixel’s Z value is larger than the cached Z-max for that square, the pixel can be rejected without actually doing the Z-buffer read. Furthermore, when the Z-buffer needs to be cleared, the cleared flag for the block can be set without actually clearing the Z-buffer values. Then the next time that Z-buffer square is read, if the cleared flag is set, the hardware can return a square filled with Z-far without actually reading the Z-buffer values. This skips both a Z write and a Z read for the entire square. In order to gain the benefit of the cleared flag, the hardware must operate on entire 8×8 blocks at once before writing the result back to the Z-buffer. The total savings for both these techniques is at least 110 MB/s, and possibly as much as 165 MB/s depending on how much is occluded with the square-level Z test.
  9. Z-buffer compression: 8×8 blocks of Z-buffer data can be stored compressed in memory, using some kind of differential encoding scheme. Like the previous technique, this would require the hardware to operate on an entire 8×8 block at a time in order to see any benefit. The cost of all Z-buffer reads and writes might be reduced by 2:1 to 4:1, at the cost of additional latency and hardware complexity to handle the compression. This could provide a savings in the range of 350 MB/s.

Unfortunately the savings from all these techniques can’t merely be summed, and the savings I’ve estimated for each one are assuming it’s done by itself, without any of the other techniques. However, when used together, the combination of backface culling plus Z-occlusion should provide at least 400 MB/s in savings, texture compression and caching another 30 MB/s, and Z-buffer tiling another 110 MB/s. That lowers the total bandwidth needs down to 326 MB/s, roughly the same as my conservative estimate of real-world available bandwidth.

Read 3 comments and join the conversation 

3 Comments so far

  1. elpuri - September 9th, 2009 8:24 pm

    Hi. Thanks for the interesting read. I had never really given thought about what kind of optimizations (like the tiled z-buffer) there are inside the GPU.

  2. Neeraj Kulkarni - November 10th, 2009 1:51 am

    A really good read. I was looking for such a plain-english explanation about these strategies for a long time 🙂

  3. Marcel - May 31st, 2013 6:14 pm

    You cannot be expecting to buffer the whole scene/frame in the FPGA chip there simply is not enough SRAM internally you can however use some of these to Cache to reduce some of the bandwidth requirements.

    Most of the things you need can be fetched during the Horizontal Blanking and Front and Back porches.

Leave a reply. For customer support issues, please use the Customer Support link instead of writing comments.