I was pleased to accept the Software Developer position at Bloomberg L.P. on my third work term. It was a fast paced work term that introduced me to the world of cross platform C++ library development on 32- and 64-bit versions of AIX, Linux, OSX, Solaris, and Windows. Alisdair Meredith, my mentor as well as Chair of the Library Working Group on the C++ Standards Committee, was the best possible guide I could have asked for. I also specifically requested to work on an open source team, which means all of my code contributions can now be seen here (the linked folder was entirely implemented by me).
During my term, I implemented a new modular hashing system which improved hash table performance by 11x and prevented denial of service attacks. I also implement a “flat map” data structure that takes advantage of CPU caches to improve associative lookups. Beyond developing library code, I also created extremely thorough documentation and testing to ensure my code was rock solid and usable. The following paragraphs explore my experience more in depth.
My first project of the term was to implement a “flat map” data structure. The concept of the data structure is that key-value pairs are stored in a sorted vector, eliminating the need for the extra pointers required for a map (underlying implementation is a binary tree) storing the same data. Eliminating the pointers saves memory, and allows the elements of the flat map to be packed closely together. This close packing allows more key-value pairs to fit into CPU caches, decreasing the time to find elements during a binary search. I benchmarked the flat map vs. map and unordered map (underlying implementation is a hash table). The benchmarks showed only marginal improvements of flat map over map. A more interesting result of the benchmarks, however, was that for large collections, unordered map transitioned from O(1) behaviour to O(n) behaviour.
With the support of my mentor, I chose to investigate the anomalous unordered map performance. I identified a poor string hashing algorithm as the culprit. In a test of 1,000,000 strings, the algorithms caused a median of 64 keys to hash to the same hash value. In comparison, every other hashing algorithm tested on the data set resulted in a median of 1 key per hash value (no collisions).
The poor string hashing algorithm was indicative of a problem with hashing in C++ in general: type implementers should not have to write their own hashing algorithms, they should be using industry standard algorithms. Along with addressing this flaw with hashing, I also examined other developer requests for improvements to hashing. One request was for a secure hashing algorithm to prevent a malicious attack from performing a Denial of Service (DoS) attack on hash tables by crafting keys that collide to the same bucket. Another request was for a way for developers to combine multiple hash values into a single value.
I analyzed a number of C++ Standard proposals, as well as the Boost ‘hash_combine’ function, and settled on Standard Proposal N3980 – “Types Don’t Know #”. The chosen proposal allows type users to select any implemented algorithm and apply it to any supported type. I created my own proposal for implementing N3980 at Bloomberg, and it was accepted by my mentor, my manager, and numerous stakeholders.
I implemented Proposal N3980, which I refer to as “modular hashing” because of how it makes hashing more modular. I included SpookyHash as the default hashing algorithm and SipHash as the secure hashing algorithm. Many modifications were required to make the system cross platform and compatible with C++03. Finally, when the system was completed, I ran the same benchmarks as I did initially, and confirmed that the poor unordered map behaviour had been removed.