Is Using Hash Tables a Good Thing?
One of these days, I was looking at patterns of coding in some of the applications that I work with — written in Java and C#. I was then interested in seeing why our group happened to use hash tables so frequently in these languages. This was puzzling to me because I am used to write software in C (for many years now), and I hardly ever used, or found the necessity of using, hash tables. Of course, I am not considering the classic examples (such as symbol tables), but programming in general. And I can definitely say that the programs I write in C are not any slower because of this missing data structure.
After thinking a little bit, it became easier to see the reason why Java and C# not only provide, but need to
use hash tables in so many places. The fact is that hash tables are frequently used where pointers, structures, and vectors should suffice.
Uses of Hash Tables In Java
A hash table is frequently used as a loosely defined structure, which can hold whatever fields you want. Thus, one first use of hash tables is as non-typed storage that can hold any kind of object. This can also be a problem in C, but for better or worse, we have void pointers that can be used to hold anything. I am not saying that it is a good solution: it also means that we need to be extra careful not to blow up the whole program when accessing those values, but there is certainly a way to hold values or any desired type.
In a more general way, simple vectors of pointers can be used to do most of what a hash tables does. Your data just needs to store a value indicating where the second element is, in addition to its own data. For example, suppose you want a hash table from data type T1 to data type T2. Create a vector of pointers for elements of type T2. Use this vector as a list, growing it with malloc/realloc as necessary (sometimes it is easy to determine the maximum possible size, which makes it even more efficient). Then, for each added element t1 of type T1, use a new position on the vector to store the address of the element of type T2. Also, store the index of that position in t1.
Now, if you need to look up any object, just use the stored index into the array of pointers to recover t2. This is done in constant time, as we wanted. It is even easier and more efficient than using a hash table. The only disadvantage is that you need to add a new field to the object t1, which is pretty easy to do in C, but could hurt the feelings of someone that thinks in an object oriented way.
Clearly, there are situations in which this is not so easy to do. For example, if we are manipulating simple values such as integers, doubles, even strings, instead of structures, then you don’t have a place where to add the extra field. In that case, you might consider creating a hash table, even though I think it would still be easier in most cases to create a structure with the original value plus the index, and use the structure instead of the value object (and by the way, that is also more “object oriented”).
Using Data Instead of Code
Of course, there is an even simpler way: just put a pointer to the second object in the first object. While this is easier, there can be value in having an additional layer of indirection, so somebody else could manipulate the references without going through the objects of type T1.
We can also notice that this is an example of the general technique of using more data instead of algorithms. In the scheme described above, the indirection work is taken from the hash table code into the object itself. Now, it is not the hash table that knows how to find the object, instead the object itself knows the secret of how the other piece of data can be found. The first object just passes that secret to the vector in order to retrieve the object of type T2.
This is not to say that hash tables are not useful. They can be used in highly dynamic situations, in which we don’t know in advance which relationships will exist between objects. However, it must be said that in 99% of the circumstances this is not the case, and using hash tables to implement simple algorithms might just add overhead to the program. It can also mean that you are not putting the references for your objects in the place where they should go logically, which is a sign of weakness in the original design.
Given all the issues related to the use of hash table in Java and C#, I will certainly start to be more careful of the places where I use them, from now on.
Similar Posts:
About the Author
Carlos Oliveira holds a PhD in Systems Engineering and Optimization from University of Florida. He works as a software engineer, with more than 10 years of experience in developing high performance, commercial and scientific applications in C++, Java, and Objective-C. His most Recent Book is Practical C++ Financial Programming.
Hashtables are mostly used for caches.
Imagine small DNS server, which has to answer many domain->ip requests. How do you plan to have fast lookup of already resolved entries by name without using some kind of dictionary ?
Second cache usage is database frontend layer – instead of going for the same data over and over to database, you can cache it on application level. Again, some kind of map is very helpful here.
“All programming is an exercise in caching”. And if you have sparse domain, maps/dictionaries are the right tool.
By Artur on Nov 20, 2008
have you heard of decoupling? it’s a useful design principle.
The reason why maps are used so frequently is because the are very convenient and almost never the source of peformance issues.
premature optimization is the root of all evil.
By Henrik on Nov 20, 2008
Your attitude reminds me of Beating the Averages (www.paulgraham.com/avg.html ), specifically the part about Blub. If you haven’t read it, please do so, it will be enlightening.
By Tassos Bassoukos on Nov 20, 2008
If I understand the data structure you’re proposing, it’s as complicated as a hash table and less efficient. It’s only a constant time lookup if you’re in some sort of special case where you can calculate the index in T1 first in order to get the associated T2, or where you are iterating through all values. “Calculating the index” should ring a bell, in this context – it’s what hashing provides general solution for. But I don’t see you offering that… in the general case, you’ll have to perform a binary search on the T1 vector – O(Lg(N)) – before you can get your T2 element in constant time.
By Chase Saunders on Nov 20, 2008
@ Henrik,
I agree that hash maps have good performance, and I am glad they exist. I am just talking against the overuse of a good feature. A problem that I see is that we sometimes add indirection to code when it is not necessary.
The other problem I see is that data uses should be located as close as possible to other related data. Hash tables make it easy to spread data over your code, so
we need to be careful to identify when this is just making your code more complicated.
By coliveira on Nov 20, 2008
@ Chase,
Yes, you are right in general. But I am just describing the situation when you are mapping an object (not a string or integer) to another. You already know what object you are looking for, so there is no need to do a search for that object in the vector.
Again I am not saying that we shouldn’t use hash tables, but describing an alternative and thinking that they might be overused in current code.
By coliveira on Nov 20, 2008