Fun with large memory allocations. Well, not really fun at all.

I have an application that involves voxellizing meshes into a large 3d array of floats and then performing image-processing type operations on the resultant 3d array of floats.

I do this on systems with a lot of cores (44, 48, and 128) and a lot of memory, so these operations run really fast, even including things like generating distance fields on large arrays. 2K x 2K x 2K is plenty fast to process, even though it is 32 gigabytes of RAM.

When doing some benchmarking of this code on a new system, I noticed that despite all of my code being heavily multithreaded, on really large problem sizes, it would spend a huge amount of time at the beginning (many seconds) with the process manager only showing one core in use. My compile-run cycle was being dominated by this. 7 seconds to compile, 3 seconds to run the algorithm I was working on, plus 10 seconds of mysterious start up time!

So I investigated with the debugger:

I first tried building a debug version even though I knew it would probably be too slow to be practical working on 32GB arrays. I hit F5 and waited a bit until it was churning away on the startup delay. I broke in the debugger. It was not in my code, but rather in a not-so-well coded memory filling loop in the debug memory allocator that was scrubbing my 32GB voxel array. I cursed that – my filling of the array takes nowhere near that time, and uses multiple threads doing non-caching stores to avoid trashing the cache. I decided I’d debug the release build instead. In retrospect, I should have been suspicious though, since even a byte-at-a-time single threaded loop should be able to fill this memory in a second or two.

So I broke into the running release version during the slow part. It was actually in a big loop in my binary this time. It was iterating over a large block of memory, writing one byte of ‘0’ every 16 bytes. This was compiler generated code for operator new[], executing a constructor on a huge array. But the class was just a plain 16-byte struct with no constructors (or so I thought), that I later initted via threaded code.

The struct being setup was a block of scheduled operations with dependencies, that would then be used to execute the math on 128 cores. When fed a large problem and a lot of CPUs to break it up into, this could also be between 16GB and 32GB. No worries, I’ve got 256GB of RAM, and enough CPU power to write 16GB of data in way less than a second. But why was the compiler code executing a constructor on it?

Well, it turns out that std::atomic_int8_t, unlike an ordinary uint8_t, has a constructor that sets it to zero. I had thought of them as just an int that supported the locked operations, but they also had this property. In my case, I wasn’t happy about this – I already init them much faster, using multiple threads, and 0 isn’t the value I set them to.

My options were either to just use int8_t instead of atomic_int8_t, and use casting when I needed to do the actual atomic decrement. Or I could allocate them as an array of uint8_t’s and cast the result to WorkItem_t *. I made a generic version of this:

I put a benchmark timer around the new[] call for WorkItem_t, and bypassing the constructor reduced the time from 25 seconds to basically zero. Problem solved! (?)

I was also annoyed that not only was it doing a constructor that I didn’t need, but it was executing it on one thread on 32GB of memory, even though I had 256 HW threads. What if I actually needed the constructor? What if the constructor was expensive? I certainly would want the init to use all my cores. So I added this definition:

Now that I had that, I dropped my code into a little test benchmark program that timed different ways of allocating and initializing 32GB worth of this struct. My tests were run on two systems:

  • System A: Windows, dual socket AMD Epyc 3rd generation system. 128 cores, 256 HW threads, 256GB of RAM with high memory bandwidth via fully populating all the NUMA lands.
  • System B: Linux, AWS EC2 dual Xeon system with 48 cores, 96 threads, and 192GB of RAM.

/list

The results were:

  • plain old operator new[]: A: 28s, B:13.55
  • FastNew (threaded constructors) : A: 10, B:2.79

I also decided to try bypassing the constructor, and use memset to subsequently fill the array. I also wrote a parallel memset to try out as well:

This produced on both systems that were similar to the vanilla operator new[], and the parallel constructor respectively, as you would expect. So basically I could only get the startup time down to 10s (unacceptable) on windows and 2.79s on Linux (still bad).

But these numbers didn’t make any sense now that I had reduced everything to a simple memset. Even though every 4th iteration of the init loop would cause a cache miss on read and an eventual writeback (because I don’t have 32GB of cache!), the numbers didn’t add up. Both these systems can write memory a lot faster than that. The AMD system has 8 channels of memory at >20GB/s each, and the Xeon is no slouch either. What’s going on? I thought about pathological TLB misses as well (fixing this would require using “large pages”) but that wouldn’t be enough to make it this slow either.

Ah-hah, it must be page-faulting. I have way more than 32GB of RAM, so it couldn’t be swapping. I figured that the OS must not be really allocating the memory, but allocating physical RAM when the first access to a region causes a page fault. If my data is 32Gb, and pages are 4K (or are they 8K?) in size, that would mean 8 million page faults. That certainly could add up to something!

So, I change my code to memset the newly allocated memory _twice_ instead of just once. Boom! The second (single-threaded) memset only took 2 seconds instead of 24s for the first one.

Ok, so even when I eliminate the memory initialization, I’m paying the same page fault cost later on first access. The 24s of page fault handling just gets amortized over the whole run time, but my code is threaded so its probably only added 10s or so to the runtime. But that’s on an algorithm that runs in less than 10s if the memory had already been paged in. I’m spending more time letting the OS handle page faults than I am doing math, all because the OS is acting as a very slow gatekeeper in front of my 256GB of RAM!

What I would want it to do is just allocate all the pages up front in a tight loop instead of giving them to my app one at a time in an exception handler. I looked at the docs for Windows VirtualAlloc, figuring there must be a way to do that. Nope. It explicitly states that comitted memory is assigned to DRAM pages upon first processor access, without a flag to override this :-(.

I tried VirtualAlloc over new[] anyway, since it would let me try large pages:

This didn’t provide any significant speedup over operator new[] as you would expect. The equivalent of VirtualAlloc on linux is mmap and it looked like it had some relevant flags so I threw that into the mix as well:

This, as expected yielded the same speed as operator new[], but I wanted to try out the exciting MAP_POPULATE flag, which says that it pre-populates the page table instead of relying on faulting it in. Exactly what I need! Unfortunately, adding that flag didn’t change the numbers at all. I read some things on the web that implied it was actually a NO-OP? :-(.

So the only thing left to try was “huge pages”. Normal virtual memory page sizes are either 4k or 8k. Huge pages are either 2MB or 1GB, which is a massive reduction in the number of pages required to hold my 32GB of data. On both Windows and Linux, you have to do something to enable them as they are normally disabled. In Linux it was “sysctl -w vm.nr_hugepages=N” where N was how many 2MB pages you wanted to have available. I asked for 32768, which adds up to 64GB. I added the MAP_HUGETLB flag to mmap, and it gave me 2x:

  • mmap + memset : 14s
  • mmap + huge pages: 8
  • mmap + parallel memset: 2.8
  • mmap + huge pages + parallel memset: 0.407s

I didn’t try out 1GB pages, but I’d expect them to make the OS overhead of allocating the memory approach zero.

Now what about huge pages on windows? It turns out that they are a terrible pain to enable, involving changing user permissions, rebooting, and running your app as administrator. I really want to see if it helped, so I followed instructions online for doing this, and added the proper flag to VirtualAlloc(). It didn’t work, and I didn’t feel like goggling over the web and trying to figure it out. Most of the articles on large pages in Windows were old, referring to ancient versions of the OS, pointing at dead web pages, etc.

So the bottom line if you need a huge memory array is:

  • To get a 35x speedup on large memory allocations in Linux, use mmap with large pages, and use threads to initially touch/fill the memory.
  • To get a 2.8x speedup in windows, allocate the memory without initialization and then use threads to initially touch/fill the memory.

If your application is going to fill the memory later, but you just want to pull it into the MMU tables, the right answer is probably to allocate the memory un-initted, and then spawn asynchronous jobs to walk through the memory, just performing reads in order to get the page faults to happen. You don’t have to wait for these threads to complete to start doing work. They are just asynchronous prefetchers.

So I’m let with an unavoidable 10s of startup time on a 128 core Windows box. Can anyone demonstrate allocating a 32GB array and filling it in in < 10s on Windows?


Windows:

/

Leave a comment