Edward Boggis-Rolfe: What's better than std::map and std::set?

This is a rather dry subject, but I think essential when you think that exabytes of data around the world is held in these structures. If enough people read this then I think that I would have contributed considerably to reducing global CO2 emissions! I hope that you will find this a useful launch pad for giving an extra boost to your solution.
Are there faster alternatives and less memory hungry alternatives that we could use as a drop in replacement to std::map and std::set?
The things that I commonly use std::map and std::set for are:

Allow me to store objects in a retrievable collection.
Confirm that an object exists in the collection.
Retrieve objects by entering the key to the data.
Iterate through the structure in the order specified by the less function (that you can optionally provide).
Merge collections.
Get a count of objects in the collections.
Store objects by value or by pointer.
and others...

The probem is that these collection classes use memory themselves, aside from the value, each record need memory equal to the size of the key + size of two other pointers to two other adjacent records.
Secondly these two stl classes are slow than some other solutions when searching for data in a large collection especially for strings.
What are the alternatives? Well there are lots but they basically come down to a short list that do:

Red/black or b-tree

What stl:map/set is made of, traditionally the most flexible, but performance gets hurt when data gets big. Some of these structures also have problems with node balancing that can have a serious performance hit especially when data is entered already sorted. Its strong point is that they are very fast in iteration, as the key does not need to be unpacked along with the value.

Hash tables

Fast lookup, but you cannot iterate through nearby key values. Has performance hits when you have chosen a poor hash algoritum for your data, or the hash table needs to be resized, often this can be fixed by predeclaring how big the table is likely to be. You cannot iterate through similar values though so is not very useful in similar word searches etc. The structure of the hash table can also be inefficent with memory as the table may be poorly tuned to the requirements of the data.
If you want the definitive rundown on hashed collection classes have a look at the tommyds site.

Patricia, Ternary and Ned tries, DWAG

Yes tries not trees! The key (e.g. a string) is split into an array of numbers or charaters and values are found by following a breadcrumb trail of these characters in the memory structure.

Judy array

A highly compact exotic datastructure, that I do not fully understand (maybe one day), it took the bloke years to get it right, and named it after his sister. I have written an STL style wrapper around it, along with iterators. The keys are sorted.
Performance Test
Test harness
Below are the results of my tests, I do think that some of my timing results are suspect, and there is a bit of apple and pear comparisons going here however you are welcome to compile my tests and see for your self. This was done on a lenovo x200 laptop with 64 bit windows 7, core 2 duo, 8 gig of ram running as a 32 bit process. The collections were made to store 1,000,000 strings containing values "0" to "999,999".
I have tested against
A set of 32 bit values

std::set<const_char*>
xt::judy_set<const_char*>

A set of strings

std::set<std::string>
xt::judy_string_set

A string map to 32 bit integers

nPatriciaTrie
std::map<std::string,int>
stdext::hash_map<std::string,int>
ToolBox::Trie
uxn::patl::trie_map<std::string,int>
xt::judy_string_map<int>
xt::judy_string_val_map<int>

The variance of the amount of memory used is considerable, xt::judy_set<const_char*> used 1/10 th of the memory of std::set<const_char*>. If memory is an issue then Judy is the way to go, it can be much faster if you start paging memory to disk.
Anything that required the contents of the key got a speed penalty. If your string class did internal reference counting then this helped considerably with the std:: maps and sets, as the key was not copied.
Iteration was the red/black btree main strength, as the key was uncompressed. This was especially the case when the key value was a std::string, as the other collections had to make them from scratch.
There is only one good reasonably fast all rounder that can do exciting things such as hamming and levenshtein distances is the PATRICIA C++ library. You can also have a look at the Ternary Search Tree but performance seemed and issue.

Name	clear	copy	find	iterate	load	Memory used
nPatriciaTrie			421		1466	45596
std::map<std::string,int>	375	749	3416	63	5335	55136
std::set<const_char*>	468	531	562	62	1092	31408
std::set<std::string>	405	749	3573	250	4868	54368
stdext::hash_map<std::string,int>	483	1295	1466	32	2418	48480
tommy_hash_tester<char*,int>			31		890	39536
ToolBox::Trie			983		1060	21684
uxn::patl::trie_map<std::string,int>	390	1778	1591	172	2558	54904
xt::judy_set<const_char*>	47	951	499	484	577	3396
xt::judy_string_map<int>	655	0	2168	1950	2683	10248
xt::judy_string_set	671	2902	2044	2059	2714	12500
xt::judy_string_val_map<int>	2184	3058	1997	1950	2762	26212

Links

Patricia Trie Template Class

nPatriciaTrie is fast but is not a drop in replacement to stl. Cannot iterate through collections.

Wikipedia Trie

ToolBox::Trie reasonably fast but is not a drop in replacement to stl. Cannot iterate through collections.

Practical Algorithm Template Library - C++ library on PATRICIA trie

uxn::patl::trie_map<std::string,int>, the true swiss army knife of a map collections, is reasonably fast (and perhaps could go faster with different allocators). The only one worth looking at for things like hamming and levenshtein distances.

A wrapper to this judy array code

xt::judy_set<const_char*> Judy is not the fastest, but easily the most efficient with memory.

Roguewave sourcepro 9

RWTPtrHashMap<std::string,int,silly_hash,std::equal_to<std::string> > was not a true alternative to stl::map as the interface was too different. It also did not run too fast with 1000000 records in it. Perhaps there is a tweek to make it faster, but why pay for something when you can get it for free?

Edward Boggis-Rolfe

Wednesday 23 March 2011

What's better than std::map and std::set?

Red/black or b-tree

Hash tables

Patricia, Ternary and Ned tries, DWAG

Judy array

Links

Patricia Trie Template Class

Wikipedia Trie

Practical Algorithm Template Library - C++ library on PATRICIA trie

A wrapper to this judy array code

Roguewave sourcepro 9

tommy_hash_collection

jdktrie

Ternary Search Tree

pb_ds trie

Ned Tries

google-sparsehash

boost::spirit

DAWG

No comments:

Post a Comment

About Me