Skip to main content

Crazy Parallel Madness

This was originally sent to me from a friend who works on massively parallel medical imaging software:

I thought I'd take a little time out of my day to rail, once again, against the incompetency of those [software developers at Microsoft]. Consider the following innocent looking bit of code:

  #include <omp.h>
  #include <vector>

  int main(int argc, char* argv[])
  {
    #pragma omp parallel
    {
      std::vector<int> A(1024 * 1024);
    }
  }

For the OpenMP-uneducated, the inner code block will be executed
in parallel by one thread per CPU on your system. In my case that is 8 threads (dual quad-core). If you run this bit of code in VTune and look at which black hole your clock-cycles disappear down, you'll find an unusually large number of them being gobbled up by "ntoskrnl.exe". And, if you dive down into this file, you'll find that a good portion of those cycles are attributable to a kernel function named ExfAcquirePushLockExclusive().

What happens in the above code segment? Eight threads are created, each running on a separate core. All eight proceed to allocate and then zero-fill 4MB worth of memory. The zero-fill occurs in this case because std::vector always initializes its contents. Because Microsoft writes their software for the [average consumer] who are loath to spend the $25 it would take to outfit their system with more than 256 MB of RAM, the NT kernel conveniently doesn't assign you any physical memory when you allocate the 4MB array. Instead it waits until you actually write to a page in that array. Our code segment, then, is actually eight threads executing:

  Allocate 4MB
  Loop from 1 to 1024
      Page fault resulting in the allocation of one page (4KB) of physical memory
      Write 1024 zeroes to that page

The coup de gras is that those [Microsoft developers] decided it would be just fine if each page fault required the locking of some sort of internal kernel structure that is shared between all the cores. Don't know exactly what because the details of the kernel are, of course, hidden from my prying eyes. But I do know the end result - massive lock contention and performance that sucks ass.

Now the above example is obviously contrived. But Bill spent a substantial bit of time in November digging into why, when optimizing the 3D volume loading/decompression of our software, he kept seeing a good 30% of the CPU cycles swallowed up by this particular black hole. So this particular issue is not simply academic. I'm writing about it now because one of my colleagues just ran into a slightly different manifestation of exactly the same problem. His trials have freshly aggravated my own wound.

This solution, while simple in execution, is insane in its necessity. Whenever I have a significantly sized data structure, or data structures, which is to be filled rapidly by multiple concurrent threads, I must, after allocation, perform what I've coined a "page touching" operation on it. This is exactly what the name implies… I have a single thread march over the entire extent of the memory, at page-sized intervals, and write a single zero value into each page. After the page touching, my parallel algorithm can proceed to fill the data structure without the performance loss that results from the lock contention.

Popular posts from this blog

Clustered Foolishness

I had morning coffee with a well respected friend of mine recently. Aside from chatting about the usual wifery and family, we touched on the subject of clustered indices and SQL Server performance. A common misconception in the software industry is that a clustered index will make your database queries faster. In fact, most cases will demonstrate the polar opposite of this assumption. The reason for this misconception is a misunderstanding of how the clustered index works in any database server. A clustered index is a node clustering of records that share a common index value. When you decide on an index strategy for your data, you must consider the range of data to be indexed. Remember back to your data structures classes and what you were taught about hashtable optimizations. A hashtable, which is another way of saying a database index, is just a table of N values that organizes a set of M records in quickly accessible lists that are of order L, where L is significantly less than M. ...

Deadly Information

Remember back to 2006 when a young girl killed herself [1] , [4] after being tricked and harassed by a faux boy she found on the Web using MySpace. The trial against the faux boy, an adult woman (Lori Drew), did not result in prosecution for the death of Megan, much to the dismay of many.  Yet, today we read about another trial where someone is being accused of second degree murder because they may have mentioned something slanderous about another person who was later killed by a hit man [2] . In this case, though, the person on trial is a former FBI agent who was working deep cover to infiltrate organized crime. In both cases, someone released information to third parties that resulted in the death of another person.  Neither defendant in either of these cases actually committed the act of murder, though. In the case of the FBI agent, though, the murder charge is being taken seriously. Yet, in the MySpace slander case, the murder charge was not taken seriously. How are t...

Faster Climate Change

CNN reports that a WWF study has found that global climate change is happening faster than predicted in 2007 and that there will not be any arctic ice by 2013, or 2040. [1] Then it goes on to say that global sea level will increase by 1.08 meters by the end of the century, which is 2100, 92 years from now. Quite honestly, nobody really cares what is going to happen to the planet in 98 years. Why? Because in 98 years we (as humans) will either: (1) Obliterate ourselves because God told us to do it. (2) Eat eachother because there will no longer be any land available to grow crops and sustain living quarters for our 50 billion people. (3) Suffocate because our planet will no longer smell nice thanks to 50 billion people producing lots of solid waste in our oceans. (4) Leave the planet because there will no longer be enough fresh water to sustain our lives. Wait a minute. Consider (4) for a moment. Where can we get an abundance of fresh water TODAY? Anyone? Yeah, the arctic! It's goin...