Wednesday 23 March 2011

What's better than std::map and std::set?

This is a rather dry subject, but I think essential when you think that exabytes of data around the world is held in these structures.  If enough people read this then I think that I would have contributed considerably to reducing global CO2 emissions!  I hope that you will find this a useful launch pad for giving an extra boost to your solution.
Are there faster alternatives and less memory hungry alternatives that we could use as a drop in replacement to std::map and std::set?
The things that I commonly use std::map and std::set for are:
  • Allow me to store objects in a retrievable collection.
  • Confirm that an object exists in the collection.
  • Retrieve objects by entering the key to the data.
  • Iterate through the structure in the order specified by the less function (that you can optionally provide).
  • Merge collections.
  • Get a count of objects in the collections.
  • Store objects by value or by pointer.
  • and others...
The probem is that these collection classes use memory themselves, aside from the value, each record need memory equal to the size of the key + size of two other pointers to two other adjacent records.
Secondly these two stl classes are slow than some other solutions when searching for data in a large collection especially for strings.
What are the alternatives?  Well there are lots but they basically come down to a short list that do:

Red/black or b-tree

What stl:map/set is made of, traditionally the most flexible, but performance gets hurt when data gets big.  Some of these structures also have problems with node balancing that can have a serious performance hit especially when data is entered already sorted.  Its strong point is that they are very fast in iteration, as the key does not need to be unpacked along with the value.

Hash tables

Fast lookup, but you cannot iterate through nearby key values.  Has performance hits when you have chosen a poor hash algoritum for your data, or the hash table needs to be resized, often this can be fixed by predeclaring how big the table is likely to be.  You cannot iterate through similar values though so is not very useful in similar word searches etc.  The structure of the hash table can also be inefficent with memory as the table may be poorly tuned to the requirements of the data.
If you want the definitive rundown on hashed collection classes have a look at the tommyds site.

Patricia, Ternary and Ned tries, DWAG

Yes tries not trees!  The key (e.g. a string) is split into an array of numbers or charaters and values are found by following a breadcrumb trail of these characters in the memory structure.

Judy array

A highly compact exotic datastructure, that I do not fully understand (maybe one day), it took the bloke years to get it right, and named it after his sister.  I have written an STL style wrapper around it, along with iterators.  The keys are sorted.
Performance Test
Test harness
Below are the results of my tests, I do think that some of my timing results are suspect, and there is a bit of apple and pear comparisons going here however you are welcome to compile my tests and see for your self.  This was done on a lenovo x200 laptop with 64 bit windows 7, core 2 duo, 8 gig of ram running as a 32 bit process.   The collections were made to store 1,000,000 strings containing values "0" to "999,999".
I have tested against
A set of 32 bit values
  • std::set<const_char*>
  • xt::judy_set<const_char*>
A set of strings
  • std::set<std::string>
  • xt::judy_string_set
A string map to 32 bit integers
  • nPatriciaTrie
  • std::map<std::string,int>
  • stdext::hash_map<std::string,int>
  • ToolBox::Trie
  • uxn::patl::trie_map<std::string,int>
  • xt::judy_string_map<int>
  • xt::judy_string_val_map<int>
The variance of the amount of memory used is considerable, xt::judy_set<const_char*> used 1/10 th of the memory of std::set<const_char*>.  If memory is an issue then Judy is the way to go, it can be much faster if you start paging memory to disk.
Anything that required the contents of the key got a speed penalty.  If your string class did internal reference counting then this helped considerably with the std:: maps and sets, as the key was not copied.
Iteration was the red/black btree main strength, as the key was uncompressed.  This was especially the case when the key value was a std::string, as the other collections had to make them from scratch.
There is only one good reasonably fast all rounder that can do exciting things such as hamming and levenshtein distances is the PATRICIA C++ library.  You can also have a look at the Ternary Search Tree but performance seemed and issue.

Name clear copy find iterate load Memory used
nPatriciaTrie

421
1466 45596
std::map<std::string,int> 375 749 3416 63 5335 55136
std::set<const_char*> 468 531 562 62 1092 31408
std::set<std::string> 405 749 3573 250 4868 54368
stdext::hash_map<std::string,int> 483 1295 1466 32 2418 48480
tommy_hash_tester<char*,int>

31
890 39536
ToolBox::Trie

983
1060 21684
uxn::patl::trie_map<std::string,int> 390 1778 1591 172 2558 54904
xt::judy_set<const_char*> 47 951 499 484 577 3396
xt::judy_string_map<int> 655 0 2168 1950 2683 10248
xt::judy_string_set 671 2902 2044 2059 2714 12500
xt::judy_string_val_map<int> 2184 3058 1997 1950 2762 26212

Links 

Patricia Trie Template Class

nPatriciaTrie is fast but is not a drop in replacement to stl.  Cannot iterate through collections.

Wikipedia Trie

ToolBox::Trie reasonably fast but is not a drop in replacement to stl.  Cannot iterate through collections.

Practical Algorithm Template Library - C++ library on PATRICIA trie

uxn::patl::trie_map<std::string,int>, the true swiss army knife of a map collections, is reasonably fast (and perhaps could go faster with different allocators).  The only one worth looking at for things like hamming and levenshtein distances.

A wrapper to this judy array code

xt::judy_set<const_char*> Judy is not the fastest, but easily the most efficient with memory.

Roguewave sourcepro 9

RWTPtrHashMap<std::string,int,silly_hash,std::equal_to<std::string> > was not a true alternative to stl::map as the interface was too different.  It also did not run too fast with 1000000 records in it.  Perhaps there is a tweek to make it faster, but why pay for something when you can get it for free?

tommy_hash_collection

Very fast, does not work with string keys, which is my main interest.  Their web site contains some excellant benchmarking data, probably far more accurate then mine!  A good source if you are interested in some in depth info on hashed indexes.

jdktrie

I had problems with this one and had to hack a few lines to make it run, I will look at this another time.

Ternary Search Tree 

[containers::structured_set Probably as] powerful as uxn::patl::trie_map but on my machine it was very slow.

pb_ds trie

Would not compile as there is too much dependency on the GCC libraries that would have required me to rewrite my test harness.  Has an bsd style licence but dependent on GPL code, not sure if I could use this in commercial software.

Ned Tries

Currently not tested

google-sparsehash

Currently not tested

boost::spirit

There is a trie implementation in there somewhere, not tested.

DAWG

Directed acyclic word graph

No comments:

Post a Comment