Is Using Hash Tables a Good Thing?

One of these days, I was looking at patterns of coding in some of the applications that I work with — written in Java and C#.  I was then interested in seeing why our group happened to use hash tables so frequently in these languages. This was puzzling to me because I am used to write software in C (for many years now), and I hardly ever used, or found the necessity of using, hash tables. Of course, I am not considering the classic examples (such as symbol tables), but programming in general. And I can definitely say that the programs I write in C are not any slower because of this missing data structure.  

After thinking a little bit, it became easier to see the reason why Java and C# not only provide, but need to
use hash tables in so many places.  The fact is that hash tables are frequently used where pointers, structures, and vectors should suffice.  

Uses of Hash Tables In Java

 

A hash table is frequently used as a loosely defined structure, which can hold whatever fields you want. Thus, one first use of hash tables is as non-typed storage that can hold any kind of object. This can also be a problem in C, but for better or worse, we have void pointers that can be used to hold anything. I am not saying that it is a good solution: it also means that we need to be extra careful not to blow up the whole program when accessing those values, but there is certainly a way to hold values or any desired type.  

In a more general way, simple vectors of pointers can be used to do most of what a hash tables does. Your data just needs to store a value indicating where the second element is, in addition to its own data. For example, suppose you want a hash table from data type T1 to data type T2. Create a vector of pointers for elements of type T2. Use this vector as a list, growing it with malloc/realloc as necessary (sometimes it is easy to determine the maximum possible size, which makes it even more efficient). Then, for each added element t1 of type T1, use a new position on the vector to store the address of the element of type T2.  Also, store the index of that position in t1.  

Now, if you need to look up any object, just use the stored index into the array of pointers to recover t2. This is done in constant time, as we wanted.  It is even easier and more efficient than using a hash table. The only disadvantage is that you need to add a new field to the object t1, which is pretty easy to do in C, but could hurt the feelings of someone that thinks in an object oriented way.  

Clearly, there are situations in which this is not so easy to do. For example, if we are manipulating simple values such as integers, doubles, even strings, instead of structures, then you don’t have a place where to add the extra field. In that case, you might consider creating a hash table, even though I think it would still be easier in most cases to create a structure with the original value plus the index, and use the structure instead of the value object (and by the way, that is also more “object oriented”).  

Using Data Instead of Code

Of course, there is an even simpler way: just put a pointer to the second object in the first object. While this is easier, there can be value in having an additional layer of indirection, so somebody else could manipulate the references without going through the objects of type T1.  

We can also notice that this is an example of the general technique of using more data instead of algorithms. In the scheme described above, the indirection work is taken from the hash table code into the object itself. Now, it is not the hash table that knows how to find the object, instead the object itself knows the secret of how the other piece of data can be found. The first object just passes that secret to the vector in order to retrieve the object of type T2.  

This is not to say that hash tables are not useful. They can be used in highly dynamic situations, in which we don’t know in advance which relationships will exist between objects. However, it must be said that in 99% of the circumstances this is not the case, and using hash tables to implement simple algorithms might just add overhead to the program. It can also mean that you are not putting the references for your objects in the place where they should go logically, which is a sign of weakness in the original design.  

Given all the issues related to the use of hash table in Java and C#, I will certainly start to be more careful of the places where I use them, from now on.

Implementing a trie data structure in C

There are some data structures that have wide applicability in several situations. One such example are trie data structures, which can be used in as varied situations as in compilers, text editors, and related applications.

This afternoon I had some spare time, so I decided to implement a trie data structure (something that I always considered interesting). A trie is just a kind of tree that can be used to store string keys. To improve search and insert speed, however, a trie is implemented in a way such that the stored data is shared between nodes in the data structure. According to the wikipedia entry [1]:

In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.

The Implementation

For fun, I decided to write the implementation in C, using Knuth’s CWEB system. The idea is that I can read a text file, split the file into individual words, and add each word to the trie data structure. Currently, the main restriction of the implementation is that it handles only the characters from A to Z. This means that it is good to store words, such as in a dictionary, but it doesn’t work to store identifiers in a C-like language, for example (since these can have numbers and underscores).

One easy operation to do with such a data structure is to count the number of distinct words. For example, considering only the words that we can handle in this program, I could identify 3138 distinct words in the file texprogram.tex, the original source code of the TeX program by Knuth. This was executed in less than 0.2s, so it looks pretty quick for a first implementation.

The program also has an option to print the distinct words found. You just need to execute

./trie fileName -p

and the program will print all distinct words in the file “fileName”.

Future Improvements

One of the uses for a trie is to intern strings (think of the Java intern() method). The next time I have some time, I would like to extend this program to allow C strings to be interned easily within a trie data structure.

With such an implementation, one could use the trie to return a pointer for each passed string. This way, instead of using strcmp to compare two strings, one can just compare the pointers stored in the trie, which is extremely fast.

My code has been tested on UNIX systems, and you can read the complete source code and documentation in PDF (see instructions for the online version at the end of this post). Let me know if you have any improvements.

Further Reading

References:

[1] http://en.wikipedia.org/wiki/Trie

Get The Complete Code

Given the great interest generated by the implementation of this algorithm, I have released the complete implementation in literate programming format (PDF file). Click in the button bellow to get more information about this data structure, with all details necessary to make it work in C.

  • By clicking the button bellow you will be redirected to PayPal.
  • After you make the payment, an email will be sent to your address with a download link.
  • The download will be in PDF format, with the complete listing in C using literate programming documentation.

Add to Cart

Writing a Win32 Application in SBCL Common Lisp

The sucess of dynamic languages such as Python and Ruby in the web has created a renewed interest in Lisp. After all, Lisp is the grandfather of all dynamic languages. 

For Windows users, however, it has been traditionally difficult to find good implementations of Common Lisp. The main implementations of Lisp have some difficulties that make them unsuitable to use on Windows. For example, CMU-CL runs only on UN*X systems. Allegro Lisp (from Franz Inc.) runs well on Windows, but is not free. CLisp runs on Windows using cygwin, but it is not well integrated on Windows, and is an interpreted Common Lisp.

Recently, though, SBCL, a variant of CMU-CL has been ported to Windows. There are several things that make SBCL attractive as a development compiler.

  • It is really fast
  • It is stable, with a large community of dedicated Lisp users
  • Generates executables, so you can distribute your program without saying in which language it was created — good if you want to sell your programs.

To find out if SBCL is really suitable for general development on Windows, I decided to create a graphical Win32 application. If that works, then it should be safe to use SBCL to write Windows applications. In this article, I will explain how to use SBCL to write such a Win32 application. 

Installing SBCL

The first step is to install the SBCL sytem. You can get it from http://www.sbcl.org/ and use the installer. The installer will even put the executable directory of sbcl in your path.

To create windows applications, however, you will need a windows calling library, created by Alastair Bridgewater. With this little piece of glue code, you can directly call the Win32 API functions. You can also create a callback function in Common Lisp, and have it passed into the Windows functions.  This is how a window procedure looks like:

 (defun window-procedure (hwnd imsg wparam lparam)

(cond

    ((= imsg wm_paint)   (on-window-paint hwnd lparam) 0)
    ((= imsg wm_lbuttondown) (messagebox hwnd "test" "gedit" 0) 0)
    ((= imsg wm_lbuttondown) (on-mouse-down hwnd lparam) 0)
    ((= imsg wm_destroy) (postquitmessage 0) 0)
    (t (defwindowproc hwnd imsg wparam lparam))))

The Drawing Application

The application I created is a simple drawing program. It just draws a line whenever we click, extending from the previous mouse location to the current mouse location. Here is the portion of code that captures the mouse location:

(defvar *point1* (make-my-point :x 10 :y 10))
(defvar *point2* (make-my-point :x 100 :y 100))
(defun on-mouse-down (hwnd lparam)
  ; save current point
  (setf (my-point-x *point1*) (my-point-x *point2*))
  (setf (my-point-y *point1*) (my-point-y *point2*))
  ; set second point
  (set-point-from-windows-msg lparam *point2*)
  (invalidaterect hwnd nil t)))

Notice that we receive the mouse coodinates in the lparam argument, which is in encoded format (as usual in the Win32 API). We need to define the function set-point-from-windows-msg to get the real position of the new point. This function can be defined as follows:

(defun set-point-from-windows-msg (lparam point)
  "lparam is a 32 bit value encoding two 16 bit words.
  we create a point from lparam using these two words."
  (let ((s 16))
    (setf (my-point-x point) (mod lparam #x10000))
    (setf (my-point-y point) (ash lparam (- s)))))

Here we are using two Common Lisp arithmetic functions:

  • mod returns the module of the division of the first number by the second number. In this case, it returns the low bytes in a 32 bit word.
  • ash shifts the integer value by the number of positions in the second argument. In this case, it shifts lparam 16 positions to the right, and returns the high bytes in a 32 bit word.
Now, we can just paint the two points when necessary (the update is automatically called using the function invalidaterect).
(defun paint-line (hdc hwnd rect rect-addr)
  (moveto hdc (my-point-x *point1*) (my-point-y *point1*))
  (lineto hdc (my-point-x *point2*) (my-point-y *point2*)))

(defun on-window-paint (hwnd lparam)
  (with-alien ((ps (struct paintstruct)) (rect (struct rect)))
              (let ((hdc (beginpaint hwnd (addr ps))))
                (paint-line hdc hwnd rect (addr rect))
                (endpaint hwnd (addr ps)))))

Running the Application

You can download the application from here, including the win32 glue files. To run the application, you just need to call

sbcl geditor.lisp 

The program is simple, but it shows that SBCL can handle most features required to write windows programs. The SBCL port for Windows is, however, still experimental (notice the warning printed when sbcl is run). I expect, however, that it will evolve into a great option for Lisp programmers that need to access win32 features.