I was forwarded a link today from a friend that explains a bug in a binary sort implementation.
Here is the blog entry
First it's important to note that the bug is not in the binary sort algorithm, but just in the short sighted implementations of binary sort that pervade the industry. When binary sort was first proposed as an implementation, we were using 8-bit processors and the only big iron systems out there were multi-million dollar systems that could barely process a million records, let alone 2.2 billion.
The bug, as pointed out in the blog entry, is in the calculation of the pivot. The implementations use (high+low)/2 as their pivot, which then fails when you are sorting billions of records because high+low quickly overflows.
The real simple solution is to use the arithmetic lessons we learned in the 3rd grade: pivot = high/2 + low/2. This will work because you're not adding the two out of range values together, so you won't have an arithmetic overflow during your pivot calculation. The easier calculation is "pivot = high >> 1 + low >> 1".
The real solution is to use a 64-bit system and use long ints (QWORD) for your index variables intead of DWORDS (32-bit). Any other solution is just a band-aid that will eventually experience an overflow again.
Here is the blog entry
First it's important to note that the bug is not in the binary sort algorithm, but just in the short sighted implementations of binary sort that pervade the industry. When binary sort was first proposed as an implementation, we were using 8-bit processors and the only big iron systems out there were multi-million dollar systems that could barely process a million records, let alone 2.2 billion.
The bug, as pointed out in the blog entry, is in the calculation of the pivot. The implementations use (high+low)/2 as their pivot, which then fails when you are sorting billions of records because high+low quickly overflows.
The real simple solution is to use the arithmetic lessons we learned in the 3rd grade: pivot = high/2 + low/2. This will work because you're not adding the two out of range values together, so you won't have an arithmetic overflow during your pivot calculation. The easier calculation is "pivot = high >> 1 + low >> 1".
The real solution is to use a 64-bit system and use long ints (QWORD) for your index variables intead of DWORDS (32-bit). Any other solution is just a band-aid that will eventually experience an overflow again.