I needed to write a basic worker thread pool implementation. I needed a simple system that let me queue jobs, have those jobs executed by worker threads, and wait on job completion.
I looked at a few public ones, including the intel TBB library, and some fairly simple portable c++11 versions. I ended up writing my own because:
- It’s not especially difficult, especially with c++11’s thread-related library functions.
- I wanted to be able to iterate on it in the long run, including using the intel TSX primitives and coding a work-stealing system.
- I wanted a feature which isn’t present in the libraries I looked at for assigning jobs based on NUMA nodes.
- For right now, I’d be happy with a really simple, not especially efficient one. I figured I could just grind it out almost as fast as I can type, using std::thread and c++11 synchronization primitives, and later on worry about making it lock free, managing memory, handling NUMA, etc.
The last point turned out to be true. Using std::thread, a mutex-guarded queue, and condition variables for signalling/waiting, it was really simple, and worked the first time that it compiled.
But when I started using it, there was a bit of a mystery! My development system is unusual in that it is a dual-socket Xeon system with 44 physical cores (88 logical when you count hyperthreading). This is good, as I’m writing code for the cloud that is intended to be ridiculously parallel. It’s also good for multi-threaded builds 🙂
When I started testing my thread pool with some numeric benchmarks, I noticed that I only got a roughly 30x speedup, no matter how simple the jobs were. I had previously observed the same behavior at home on the same system when I was working on threaded code at Valve. At the time I figured there must be something in the Valve libraries preventing code from creating the right number of threads or that it was doing something unexpected with processor affinity. I also thought that the less-than-linear scaling could be clock-throttling as all the idle processors start heating up. But then, after seeing the exact same thing from my simple code, I decided to investigate….
Long story short: Until recently, Windows only supported a maximum of 64 processors. Looking at the API for affinity masks, etc, you can see why – they use a 64-bit mask to represent the processors. However, modern versions of Windows do support more than 64 cores. A bunch of googling revealed that this is done via dividing the processors into “processor groups”. These assignments are done at boot time:
- Any system with 64 processors or less will end up with a single processor group (0) containing all of them.
- Systems with >64 logical processors will have more than one processor group, with the assignment of the processors done by the OS at boot time. There are a set of rules for how to divide them up that are controllable by some boot parameters.
- My system chose to divide them into two equal groups of 44 logical processors, with each group containing all of the processors associated with one socket/NUMA group.
- When starting a process, the process is assigned a group. Windows appears to use some heuristic to decide which group. If you launch your app multiple times, sometimes it will be assigned to one group, sometimes the other.
- A thread will ALWAYS run on a processor in its assigned group.
- When starting a thread, if unspecified, that thread is assigned the group of the process that started it.
I first wondered why it split the processors on my machine into equal-sized groups instead of putting 64 in group 0 and the rest in group 1 (which would maximize the performance of threaded apps that don’t know about groups). The way it does it now means that most threaded apps only use half of the cores, when they could have used 64/88 of them. However, that would have resulted in two issues:
- 24 cores would be completely unused by apps that aren’t processor-group aware
- Starting more than one threaded program would still result in idle cores. The way it works now, at least all your cores will be used by threaded apps if you start more than one of them.
Fortunately, that’s all moot, as it’s not hard to distribute worker threads across processor groups. Sadly, that means that I had to add some system-specific code for windows, so it’s still not quite possible to write a usable totally portable thread pool using c++11.
void CThreadPool::DistributeThreads( void )
{
#if OS_WINDOWS_64
//!!BUG!! need to skip this code for old windows versions
int nNumGroups = GetActiveProcessorGroupCount();
if ( nNumGroups > 1 )
{
Log( "System has %d processor groups", nNumGroups );
for(int i = 0; i < nNumGroups; i++ )
{
Log(" group %d has %d processors", i, ( int ) GetMaximumProcessorCount( i ) );
}
int nCurGroup = 0;
int nNumRemaining = GetMaximumProcessorCount( nCurGroup );
for( int i = 0; i < m_threads.size(); i ++ )
{
auto hndl = m_threads[i].native_handle();
GROUP_AFFINITY oldaffinity;
if ( GetThreadGroupAffinity( hndl, &oldaffinity ) )
{
//Log( "thread %d, old msk = %x, old grp = %llx", i, oldaffinity.Mask, oldaffinity.Group );
GROUP_AFFINITY affinity;
affinity = oldaffinity;
if ( affinity.Group != nCurGroup )
{
affinity.Group = nCurGroup;
auto bSucc = SetThreadGroupAffinity( hndl, &affinity, nullptr );
if ( ! bSucc )
{
Log( "failed to set gr aff err=%x", (int) GetLastError() );
}
else
{
//Log( "Set group for thread %d to %d", i, nCurGroup );
}
--nNumRemaining;
if ( nNumRemaining == 0 )
{
nCurGroup = min( nCurGroup + 1 , nNumGroups - 1 );
nNumRemaining = GetMaximumProcessorCount( nCurGroup );
}
}
}
}
}
#endif
}
Making this fix raised my multithreaded speedup in a simple numeric test from 30x to 64X!
This seems like a mistake in visual c++’s implementation of std::thread. These changes to the library would make simple c++ threading work on Windows machines with >64 logical processors
- std::thread::hardware_concurrency() should have returned the total # of processors in all groups. Instead it returned 44 (the number of logical processors in group 0 )
- creating an std::thread should keep track of the # of threads created so far and assign newly created threads to non-default processor groups when the # of threads created exceeds the # of processors in the default group. When more threads are created then the total # of processors in all groups, it can cycle back to the first group.
Hi there, I’m one of the maintainers of MSVC++’s std::thread.
At the moment we are not transparently exposing processor groups because there isn’t really a way to do that transparently. There isn’t a way to ask the system to assign groups according to which is under the least load, and we inside std::thread don’t have any idea what you’re going to use the threads for. In particular, depending on when threads are created and are terminated no matter what we do we’re likely to end up with an unbalanced allocation of resources. Being group-aware really does take an understanding of your workload.
As a result, we’ve chosen to make std::thread model CreateThread, and hardware_concurrency model GetNativeSystemInfo, which is the substrate Windows exposes to most applications. This way we can document what we’re doing in a “sane” way, and users don’t get strange behavior when we guess their workload wrong and assign groups incorrectly, as group assignment is a clearly documented limitation.
If in the future Windows adds a “I don’t care what group this thread goes in” API we will change this model.
Billy O’Neal
Visual C++ Libraries
LikeLiked by 1 person
Have you looked into using Windows’ own ThreadPool?
In https://github.com/stlab/libraries/blob/develop/stlab/concurrency/default_executor.hpp you can find our C++ abstraction on top of it.
I would assume that there is no problem in using as many cores as there are available.
LikeLike
I’m not sure that it does. I was told that the c++ library’s parallel STL functions use that thread pool, and those all suffer from the processor group problem – all of them only use 1/4 of the threads in my 128 core/ 256 thread AMD system 😦
LikeLike
[…] 线程池和Windows处理器组 […]
LikeLike