Download Search Discord Facebook Instagram LinkedIn Meetup Pinterest Tumblr Twitch Twitter Vimeo YouTube

C# memory and performance tips for Unity

(This is Part 1 of the series. Also available: Part 2)

There’s a lot of useful information out there about memory and performance optimizations in Unity. I have myself relied heavily on Wendelin Reich’s posts and Andrew Fray’s list when getting started – they are excellent resources worth studying.

I’m hoping this post will add some a few more interesting details, collected from various sources as well as from my own optimization adventures, about ways to improve performance using this engine.

The following specifically concentrates on perf improvements on the coding side, such as looking at different code constructs and see how they perform in both speed and memory usage. (There is another set of perf improvements that are also useful, such as optimizing your assets, compressing textures, or sharing materials, but I won’t touch those here. Good idea for another post, though!)

First, let’s start with a quick recap about memory allocation and garbage collection.

 

Always on my mind

One of the first things gamedevs always learn is to not allocate memory needlessly. There are very good reasons for that. First, it’s a limited resource, especially on mobile devices. Second, allocation is not free – allocating and deallocating on the heap will cost you CPU cycles. Third, in languages with manual memory management like C or C++, each allocation is an opportunity to introduce subtle bugs that can turn into huge problems, anywhere from memory leaks to full crashes.

Unity uses .NET, or rather its open source cousin, Mono. It features automatic memory management which fixes a lot of the safety problems, for example it’s no longer possible to use memory after it has been deallocated (ignoring unsafe code for now). But it makes the cost of allocation and deallocation even harder to predict.

I assume you’re already familiar with the distinction between stack allocation and heap allocation, but in short: data on the stack is short-lived, but alloc/dealloc is practically free, while data on the heap can live as long as necessary, but alloc/dealloc becomes more expensive as the memory manager needs to keep track of allocations. In .NET and Mono specifically, heap memory gets reclaimed automatically by the garbage collector (GC), which is practically speaking a black box, and the user doesn’t have a lot of control over it.

.NET also exposes two families of data types, which get allocated differently. Instances of reference types such as classes, or arrays such as int[], always get allocated on the heap, to be GC’d later. Data of value type, such as primitives (int, string, etc) or instances of structs, can live on the stack, unless they’re inside a container that already lives on the heap (such as an array of structs). Finally, value types can be promoted from the stack to the heap via boxing.

OK, enough setup. Let’s talk a bit about garbage collection and Mono.

 

It’s a sin

Finding and reclaiming data on the heap that’s no longer in use is the job of the GC, and different collectors can vary drastically in performance.

Older garbage collectors have gained a reputation for introducing framerate “hiccups”. For example, a simple mark-and-sweep collector is a blocking collector – it would pause the entire program so that it can process the entire heap at once. The length of the pause depends on the amount of data allocated by the program, and if this pause is long enough, it could result in noticeable stutter.

Newer garbage collectors have different ways for reducing those collection pauses. For example, so-called generational GCs split their work into smaller chunks, by grouping all recent allocations in one place so they can be scanned and collected quickly. Since many programs like to allocate temporary objects that get used and thrown away quickly, keeping them together helps make the GC more responsive.

Unfortunately Unity doesn’t do that. The version of Mono used by Unity is 2.6.5, and it uses an older Boehm GC, which is not generational and, I believe, not multithreaded. There are more recent versions of Mono with a better garbage collector, however, Unity has stated that the version of Mono will not be upgraded. Instead they’re working on a long-term plan to replace it with a different approach.

While this sounds like an exciting future, for now it means we have to put up with Mono 2.x and its old GC for a while longer.

In other words, we need to minimize memory allocations.

 

Opportunities

One of the first things that everyone recommends is to replace foreach loops with for loops when working with flat arrays. This is really surprising – foreach loops make code so much more readable, why would we want to get rid of them?

The reason is that a foreach loop internally creates a new enumerator instance. In pseudocode, a foreach loop like this:

foreach (var element in collection) { ... }

gets compiled to something like this:

var enumerator = collection.GetEnumerator();
while (enumerator.MoveNext()) {
  var element = enumerator.Current;
  // the body of the foreach loop
}

This has a few consequences:

  1. Using an enumerator means extra function calls to iterate over the collection
  2. Also: due to a bug in the Mono C# compiler that ships with Unity, the enumerator creates a throwaway object on the heap that GC will have to clean up later.
  3. The compiler doesn’t try to auto-optimize foreach loops into for loops, even for simple List collections – except for one special-case optimization in Mono that turns foreach over arrays (but not over Lists) into for loops.

Let’s compare various for and foreach loops over a List<int> or an int[] of 16M elements, adding up all the elements. And let’s throw in a Linq extension in there too.

(The following measurements are taken using Unity’s own performance profiler, using a standalone build under Unity 5.0.1, on an Intel i7 desktop machine. Yes, I’m aware of the limitations of synthetic benchmarks – use these as rough guidelines, always profile your own production code, etc.)

Right, back to the post…

// const SIZE = 16 * 1024 * 1024;
// array is an int[]
// list is a List<int>

1a. for (int i = 0; i < SIZE; i++) { x += array[i]; }
1b. for (int i = 0; i < SIZE; i++) { x += list[i]; }
2a. foreach (int val in array) { x += val; }
2b. foreach (int val in list) { x += val; }
 3. x = list.Sum(); // linq extension

                              time   memory
1a. for loop over array .... 35 ms .... 0 B
1b. for loop over list ..... 62 ms .... 0 B
2a. foreach over array ..... 35 ms .... 0 B
2b. foreach over list ..... 120 ms ... 24 B
 3. linq sum() ............ 271 ms ... 24 B

Clearly, a for loop over an array is the winner (along with foreach over arrays thanks to the special case optimization).

But why is a for loop over a list considerably slower than over an array? Turns out, it’s because accessing a List element requires a function call, so it’s slower than array access. If we look at the IL code for those loops, using a tool like ILSpy, we can see that “x += list[i]” really gets turned into a function call like “x += list.get_Item(i)“.

It gets even slower with Linq Sum() extension. Looking at the IL, the body of Sum() is essentially a foreach loop that looks like “tmp = enum.get_Current(); x = fn.Invoke(x, tmp)” where fn is a delegate to an adder function. No wonder it’s much slower than the for loop version.

Let’s try something else, this time the same number of elements only arranged in a 2D array, of 4K arrays or lists each 4K elements long, using nested for loops vs nested foreach loops:

                                      time    memory
1a. for loops over array[][] ......  35 ms ..... 0 B
1b. for loops over list<list<int>> . 60 ms ..... 0 B
2a. foreach on array[][] ........... 35 ms ..... 0 B
2b. foreach on list<list<int>> .... 120 ms .... 96 KB <-- !

No big surprises there, the numbers are on par with the previous run, but it highlights how much memory gets wasted with nested foreach loops: (1 + 4026) x 24 bytes each ~= 96 KB. Imagine if you’re doing nested loops on each frame!

In the end: in tight loops, or when looping over large collections, arrays perform better than generic collections, and for loops better than foreach loops. We can get a huge perf improvement by downgrading to arrays, not to mention save on mallocs.

Outside of tight loops and large collections, this doesn’t matter so much (and foreach and generic collections make life so much simpler).

 

What have I done to deserve this

Once we start looking, we can find memory allocations in all sorts of odd places.

For instance, calling functions with a variable number of arguments actually allocates those args on the heap in a temporary array (which is an unpleasant surprise to those coming from a C background). Let’s look at doing a loop of 256K math max operations:

1. Math.Max(a, b) ......... 0.6 ms ..... 0 B
2. Mathf.Max(a, b) ........ 1.1 ms ..... 0 B
3. Mathf.Max(a, b, b) ...... 25 ms ... 9.0 MB <-- !!!

Calling Max with three arguments means invoking a variadic “Mathf.Max(params int[] args)“, which then allocates 36 bytes on the heap for each function call (36B * 256K = 9MB).

For another example, let’s look at delegates. They’re very useful for decoupling and abstraction, but there’s one unexpected behavior: assigning a delegate to a local variable also appears to box it. We get a spurious heap allocation even if we’re just storing the delegate in a temporary local variable.

Here’s an example of 256K function calls in a tight loop:

protected static int Fn () { return 1; }
1. for (...) { result += Fn(); }
2. Func fn = Fn; for (...) { result += fn.Invoke(); }
3. for (...) { Func fn = Fn; result += fn.Invoke(); }

1. Static function call ....... 0.1 ms .... 0 B
2. Assign once, invoke many ... 1.0 ms ... 52 B
3. Assign many, invoke many .... 40 ms ... 13 MB <-- !!!

Looking at IL in ILSpy, every single local variable assignment like “Func<int> fn = Fn” creates a new instance of the delegate class Func<int32> on the heap, taking up 52 bytes that are then going to be thrown away immediately, and this compiler at least isn’t smart enough to hoist the invariant local variable out of the body of the loop.

Now this made me worry. What about things like lists or dictionaries of delegates – for example, when implementing the observer pattern, or a dictionary of handler functions? If we iterated over them to invoke each delegate, will this cause tons of spurious heap allocations?

Let’s try iterating and executing over a List<> of 256K delegates:

4. For loop over list of delegates .... 1.5 ms .... 0 B
5. Foreach over list of delegates ..... 3.0 ms ... 24 B

Whew. At least looping over a list of delegates doesn’t re-box them, and a peek at the IL confirms that.

 

Se a vida é

There are more random opportunities for minimizing memory allocation. In brief:

  • Some places in the Unity API want the user to assign an array of structs to a property, for example on the Mesh component:
    void Update () {
      // new up Vector2[] and populate it
      Vector2[] uvs = MyHelperFunction();
      mesh.uvs = uvs;
    }
    

    Unfortunately, as we mentioned before, a local array of value types gets allocated on the heap, even though Vector2 are value types and the array is just a local variable. If this runs on every frame, that’s 24B for each new array, plus the size of each element (in case of Vector2 it’s 8B per element).

    There’s a fix that’s ugly but useful: keep a scratch list of the appropriate size and reuse it:

    // assume a member variable initialized once:
    // private Vector2[] tmp_uvs;
    //
    void Update () {
      MyHelperFunction(tmp_uvs); // populate
      mesh.uvs = tmp_uvs;
    }
    

    This works because Unity API property setters will silently make a copy of the array you pass in, and not hold on to the array reference (unlike what one might think). So there’s really no point in making scratch copies all the time.

  • Because arrays are not resizable, it’s often more convenient to use List<> instances instead, and then add or remove elements as necessary, like this:
    List<int> ints = new List<int>();
    for (...) { ints.Add(something); }
    

    As an implementation detail, when a List is allocated this way using the default constructor, it will start with a pretty small capacity (that is, it will only allocate internal storage for a small number of elements, such as four). Once that is exceeded, it will need to allocate a new larger chunk of memory (say, eight elements long), and move them over.

    So if game code needs to create a list and add a large number of elements, it’s better to specify capacity explicitly like this, even overshooting a bit, to avoid unnecessary re-sizing and re-allocations:

    List<int> ints = new List<int>(expectedSize);
    
  • Another interesting side-effects of the List<> type is that, even when it’s cleared, it does not release the memory it has allocated (ie. the capacity remains the same). If you have a List with many elements, calling Clear() will not release this memory – it will just clear out its contents and set the count to zero. Similarly, adding new elements to this list will not allocate new memory, until capacity is reached.

    So similarly to the first tip, if there’s a function that needs to populate and use large lists on every frame, a dirty but effective optimization is to pre-allocate a large list ahead of time, and keep reusing it and clearing after each use, which will not cause the memory to be re-allocated.

  • Finally, a quick word about strings. Strings in C# and .NET are immutable objects, so string concatenation generates new string instances on the heap. When assembling strings from multiple components, it’s usually better to use a StringBuilder, which has its own internal character buffer and can create a single new string instance at the end. Any instances of code that are single-threaded and not re-entrant could even share a single static instance of the builder, resetting it between invocations, so that the buffer gets reused between invocations.

 

Was it worth it?

I was inspired to collect all of these after a recent bout of optimizations, where I got rid of some pretty bad memory allocation spikes by digging in and simplifying code. In one particularly bad case, one frame allocated ~1MB of temporary objects just by using wrong data structures and iterators. Relieving memory pressure is especially important on mobile, since your texture memory and your game memory both have to share the same, very limited pool.

In the end, this list is not a set of rules set in stone, they’re just opportunities. I actually really like Linq, foreach, and other productivity extensions, and use them often (maybe too often). These optimizations only really matter when dealing with code that runs very frequently or deals with a ton of data, but most of the time they’re not necessary.

Ultimately, the standard approach to optimization is right: we should write good code first, then profile, and only then optimize actual observed hot spots, because each optimization reduces flexibility. And we all know what Knuth had to say about premature optimization. 🙂

 

(Go to Part 2)