This is going to be just a blog post with random thoughts outlining some of the successes and failures while I was working on Haemimont Games’ in-house engine. Most of these techniques are not particularly new, but they still helped out to squeeze some performance or improve the graphics in one way or another. Of course, it is a mature/old engine - it started being used in 2004, so it is expected that there is lots of legacy code that does not get redone because of time constraints or purely political reasons. Have in mind that it is a small studio and just limited number of people were working on the graphics part of the engine: 2-3 at any given time while I was working for them. So after establishing some caveats, I am going to talk about the actual work.
Reducing the stalling
Well, we have shipped across five different platforms (Windows, Mac OS X, Linux, Xbox 360, PS4). The performance on consoles had some issues related to legacy code that was started on PC under D3D9. As some people are well aware, it is not a particularly explicit API. You get shadow copies of resources in many places that hide the fact that mapping resources is actually quite expensive. There were many places where the code was mapping memory mid-frame and consequently stalling everything. We have put some effort into rearranging everything, so that it doesn’t stall. When possible we kept static buffers.
Another issue that we have faced is that on OpenGL some flags work differently across vendors. GL_MAP_INVALIDATE_BUFFER_BIT is a prime example of that. In our case, it was making things worse on the more popular vendors(compared to glBufferSubData) and actually fixing performance issues on the GPUs that are considered below minimum system requirements by many games. We have experimented with GL_UNSYNCRONIZED_BIT which was quite awful on some drivers, but provided decent boost when mapping constant buffers on others. In the end we have introduced some support for persistent mapped buffers, which basically bypasses most of the weird synchronizations on some drivers. We fallback to unsynchronized when it is not available. It would be nice to map all constant buffers once per frame and upload them. That would definitely resolve some of the above stated issues, but this optimization was postponed multiple times, so I can’t talk about success stories related to using it.
Optimizing texture management
We had some issues of this kind in our UI code. It was maintaining a texture with all the small icons and it was copying things around, mapping, unmapping and basically doing lots of nasty stuff. Obviously, at some point you are going to figure out that it is causing multiple synchronizations mid-frame. It was reading from write-combined memory which tends to be generally inefficient. The code was basically like this for some hysterical reasons and some things were done because of some strange iOS port concerns at one point. So it was stuttering quite noticeably while going around menus. On PC some of these issues were hidden by lots of caches inside drivers and OS. On consoles it was another story. Quite obviously you can copy things on the GPU, so that was almost immediately fixed. Because it was loading stuff from hard-drive we have made caches of our own and the issue got resolved kind of. Not really the prettiest solution, but that’s what you do close to shipping.
When you find out that you are loading from hard-drive each frame…
That was something that was happening with the trail rendering code. You obviously need a texture to stretch across it, so it was loaded from file. After it is no longer in use it gets unloaded… Wait! What happens if you have more than one of these? Well, obviously after awhile the game starts to stutter for a second or two. The solution was straightforward - just cache it, but we already have some system to manage textures that are loaded dynamically; oh, well - it was not that easy to fix the art pipeline close to launch.
Loading shaders better and faster
That was something that I was not exactly the person that fixed it, but it is still a good example. Basically, the rendering code tends to be single threaded (because of PC APIs); therefore, if you have somewhat interactive loading screen(progress bar or something similar), you might consider loading shaders in sizable chunks each frame to keep the rendering process running. However, modern drivers don’t exactly compile shaders immediately (well, if you are not compiling them on one particular vendor on OpenGL). Basically, in some cases you are just prolonging your loading times. In this particular instance, we were loading shaders one per frame with VSync “on”, so it was mostly waiting for the frame to finish. Obviously, you should measure time and push more data, if possible. So this issue was quite easily solved.
Another issue that is mostly OpenGL’s fault is that shaders are quite slow to compile. So you might get a lot of stuttering while new shaders get loaded. Hence, we were forced to load almost all shaders that might get used on a given level behind the loading screen. Yeah, it takes longer on OpenGL to load a map, but otherwise the game is completely unplayable, so I think that it is a decent trade-off.
Multi-threading and latency trade-offs
That’s actually the essential question of how much work do you do before you submit any command via the graphics API. The obvious issue that you face if you prolong and buffer stuff is latency. But why would you delay things in the first place? Well, the most important reasons are: API overhead and general optimizations of the pipeline. Yes, if you are a console developer the actual overhead is much less, but you are going to work on less powerful CPUs, so it actually becomes the same. Instancing is one of the obvious ways to solve partially this issue, but you first need the data properly arranged. Consequently, you must introduce some latency. In this case you might be able to split the work across different threads (visibility determination and draw commands assembly), but at the end you need to perform at least some sorting of all or some objects to exploit this feature. We have sort of cheated in this regard by doing half of the frame work before handing the information to what is in our case the render thread. A much better solution would be to submit data immediately after it gets available and submit from multiple threads, but that’s in some ways limited to the newer set of APIs, if you are not interested in providing some workarounds for OpenGL.
Know your texture formats
Compressed texture formats were in use for quite some time, but we still wasted lots of bandwidth on particle effects. At one point we have found out that particle artists tend to make grayscale particles that are mostly alpha. So we have ran some tests and it turns out that more than 25% of them were in this format. So we have devised a tool that just converts them to BC4. This format doesn’t see that much use as BC3 and BC1, but it is actually available on many D3D10 graphics cards and there is a hack in D3D9 to support it (ATI1). The last bit was quite important because at one point it was decided by management and publishers to restore ye olde D3D9 and we have kind of got limited as to what kind of features can be done. Anyway, it provided 5-10% improvement in the most basic cases, so it was solid win.
Make better fits
Particle effects were quite important for one of our titles, so many of them got created. However, after examining much of the art at late point of development we have determined that much of the texture space was wasted. However, it was not a particularly new problem. You can check out one possible solution out here. We have done something similar with improvements of our own and the result was quite sweet - 2x-3x performance improvement in some cases. Particle effects tend to be quite expensive. Especially, if a lot of area gets covered by overlapping geometry. Arguably, you should teach your artists to stop wasting bandwidth on subtle air movement and doing brightness through the use of more particles, but that’s easier said than done. Anyway, at one instance we have forced some artist to make a certain glow effect to be just a single quad, because of the sheer number of the particle systems, which actually made a certain boss fight playable.
Sort your data wisely
Instancing is good for reducing API overhead; however, you are wasting GPU performance on shading pixels that won’t be visible in the final picture. Some vendors give guidelines to sort by state, but that’s also far from ideal. The newer explicit APIs are advertised as having much lower overhead, so we might be free to do better sorting eventually when they have 80% market share, but for now we are forced to obey some of these rules. However, doing just a naive sort does not lead to perfect results. You can always do z-prepass, but that has questionable performance, if you don’t have shading that completely wastes bandwidth. So you might also consider clustering similar groups of states according to camera distance. It provided some benefits to us; however, it is entirely dependent on the scene in question. Another important optimization was related to the actual terrain code. The engine was used to make strategy games, so it was optimized to show a lower quality terrain in the distance, which was pre-baked and quite cheap. However, it was quite unsuitable for ARPGs because it was quite expensive to render all layers. More expensive than any other geometry. There were many plans to optimize it, but other “more important” work had to be done, so another solution was required. Well, here is something simple - there is rarely anything occluded by terrain in this kind of game and it is the most expensive thing. Why don’t we just draw it after everything else? Well, it turns out that, yeah, it makes it 2x faster because of the good old early-z test.
Total frame time is not a real profiling tool
This one is quite obvious, especially if you are profiling multi-threaded application and GPU work. It can get quite spiky and it won’t tell you much about the whole execution pattern. We used to have just a CPU profiler inside the engine which helped a lot when we were splitting work between the different cores, but it was not enough. You can’t catch stuttering during execution and it is not exactly trivial to determine where most of the GPU time goes. So obviously we have spent a day or two when we had time to improve it. We have added GPU profiler, CPU-GPU graphs and some other information. And I think that the effort payed back because we have removed much of the stuttering and spent some time on things that actually affected performance significantly. Another thing that helps out a lot is having a watch dog thread that breaks execution when a certain frame takes too long. That’s how we caught some of the excessive hard-drive reads. A bit of a caveat, the biggest issue with using queries for measuring GPU work execution is that they don’t always work and sometimes they give incorrect results. On the Apple driver they don’t work at all. If it was performing like the Windows driver it would be fine, but when the boot camped Windows is performing better than Mac OS X you start to get angry that another tool was taken from you. Yeah, they have some tools of their own, but we had magic macros that wrapped the whole OpenGL to profile it on the CPU, so they were pretty much worthless. Actually, there are some tools that are not that terrible - we have used one of their tools to catch texture overcommitment on one of our machines.
Speaking of bad tools. PC lacks good tools. There are some that work in some cases. RenderDoc is one of the good ones because it works in most cases, nowadays. Some of the AMD tools sometimes work with smaller applications and completely fail when you throw something bigger at them. NSIGHT was working with pretty much every application that I was working on, but it was quite slow for some reason. The profiling information was nice. Not exactly console tools level of detail, but still better than some others.
And that’s the end of my assorted thoughts on optimizing on a budget. Hope that someone has some use for this. Even if that’s not state of the art or particularly clever. It is what sometimes happens in the industry and doesn’t get discussed much. I have read recently an article by one guy that suggested that we should talk about this kind of stuff, so I have decided to write it up before I forget everything about it.