Skip to main content

Crazy Parallel Madness

This was originally sent to me from a friend who works on massively parallel medical imaging software:

I thought I'd take a little time out of my day to rail, once again, against the incompetency of those [software developers at Microsoft]. Consider the following innocent looking bit of code:

  #include <omp.h>
  #include <vector>

  int main(int argc, char* argv[])
  {
    #pragma omp parallel
    {
      std::vector<int> A(1024 * 1024);
    }
  }

For the OpenMP-uneducated, the inner code block will be executed
in parallel by one thread per CPU on your system. In my case that is 8 threads (dual quad-core). If you run this bit of code in VTune and look at which black hole your clock-cycles disappear down, you'll find an unusually large number of them being gobbled up by "ntoskrnl.exe". And, if you dive down into this file, you'll find that a good portion of those cycles are attributable to a kernel function named ExfAcquirePushLockExclusive().

What happens in the above code segment? Eight threads are created, each running on a separate core. All eight proceed to allocate and then zero-fill 4MB worth of memory. The zero-fill occurs in this case because std::vector always initializes its contents. Because Microsoft writes their software for the [average consumer] who are loath to spend the $25 it would take to outfit their system with more than 256 MB of RAM, the NT kernel conveniently doesn't assign you any physical memory when you allocate the 4MB array. Instead it waits until you actually write to a page in that array. Our code segment, then, is actually eight threads executing:

  Allocate 4MB
  Loop from 1 to 1024
      Page fault resulting in the allocation of one page (4KB) of physical memory
      Write 1024 zeroes to that page

The coup de gras is that those [Microsoft developers] decided it would be just fine if each page fault required the locking of some sort of internal kernel structure that is shared between all the cores. Don't know exactly what because the details of the kernel are, of course, hidden from my prying eyes. But I do know the end result - massive lock contention and performance that sucks ass.

Now the above example is obviously contrived. But Bill spent a substantial bit of time in November digging into why, when optimizing the 3D volume loading/decompression of our software, he kept seeing a good 30% of the CPU cycles swallowed up by this particular black hole. So this particular issue is not simply academic. I'm writing about it now because one of my colleagues just ran into a slightly different manifestation of exactly the same problem. His trials have freshly aggravated my own wound.

This solution, while simple in execution, is insane in its necessity. Whenever I have a significantly sized data structure, or data structures, which is to be filled rapidly by multiple concurrent threads, I must, after allocation, perform what I've coined a "page touching" operation on it. This is exactly what the name implies… I have a single thread march over the entire extent of the memory, at page-sized intervals, and write a single zero value into each page. After the page touching, my parallel algorithm can proceed to fill the data structure without the performance loss that results from the lock contention.

Popular posts from this blog

THE RISE OF FASCIST SOCIAL MEDIA

The Merriam-Webster dictionary defines fascism as: a tendency toward or actual exercise of strong autocratic or dictatorial control .  The phrase "dictatorial control" is important for the case that I am going to make about fascism in social media. The word "dictatorial" means "of or relating to a dictator," and a dictator is "one ruling in an absolute and often oppressive way." In 2020, social media has seen a rise in the number of autocratic events of censorship. The two social media outlets that I am going to focus on are Facebook and Twitter.  Background Facebook is a semi-private curated blogging platform where you, the user, share information at your leisure. The public part of Facebook is in Facebook Groups. With a group, outside people who are not privy to your "Facebook Wall" will join your group and establish a communal discourse. This can be private, by invitation only, or public. The Facebook is auth-walled so that you must

DNS Custom Logs and selinux

If you google "named custom logs selinux" you will find quite a bit of chatter about setting up custom logs outside of /var/log for DNS (named). These posts are interesting, but they tend to be run on posts about learning selinux and becoming an expert on named. What you need to know? If you have setup custom logging locations in your /etc/named.conf file, such as:     channel default_file {         file "/var/log/named/default.log" versions 3 size 5m;         severity dynamic;         print-time yes;     }; Then you will likely see errors like this in /var/log/messages: Oct 26 11:41:13 namedsvr setroubleshoot: SELinux is preventing /usr/sbin/named from write access on the directory /var/named/chroot/var/log/named. For complete SELinux messages. run sealert -l 6eab4aaf-e615-4ade-9e88-4efdc789eaf2 Then you run the sealert command as suggested by the very friendly selinux audit log and you are told: #============= named_t ============== #!

Number of Primes

Anderson's Theorem (a) The number of primes in [1,n] is no more than 2+floor(n/2). The probability of n being prime when n is not prime is 1/2 - see Dasgupta,Papadimitriou,Vazirani "Algorithms" page 26. Therefore, the E(pi(n)) is n/2. (b) There does not exist another set of adjacent primes other than {1,2,3} 5: 2 + floor(5/2) = 2 + 2 = 4:=> {1,2,3,5} : 4 <= 4 7: 2 + floor(7/2) = 2 + 3 = 5 => {1,2,3,5,7} : 5 <= 5 11: 2 + floor(11/2) = 2 + 5 = 7 => {1,2,3,5,7,11} 6 <= 7 26: 2 + floor(26/2) = 15 => {1,2,3,5,7,11,13,17,19,23} : 10 <= 15 Lagrange's Theorem is Inaccurate Lagrange's theorem about primes states that pi(x) is the number of primes <= x. The pi(x) is approximately x/ln(x). He postulated that the lim of pi(x)/(x/lnx) as x-> infinity was 1. This is incorrect. if the number of primes is bounded by n/2 then refactoring and reducing Lagrange's Theorem results in the lim of ln(x) as x approaches infinity. This is alwa