Implementing a trie data structure in C
There are some data structures that have wide applicability in several situations. One such example are trie data structures, which can be used in as varied situations as in compilers, text editors, and related applications.
This afternoon I had some spare time, so I decided to implement a trie data structure (something that I always considered interesting). A trie is just a kind of tree that can be used to store string keys. To improve search and insert speed, however, a trie is implemented in a way such that the stored data is shared between nodes in the data structure. According to the wikipedia entry :
In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.
For fun, I decided to write the implementation in C, using Knuth’s CWEB system. The idea is that I can read a text file, split the file into individual words, and add each word to the trie data structure. Currently, the main restriction of the implementation is that it handles only the characters from A to Z. This means that it is good to store words, such as in a dictionary, but it doesn’t work to store identifiers in a C-like language, for example (since these can have numbers and underscores).
One easy operation to do with such a data structure is to count the number of distinct words. For example, considering only the words that we can handle in this program, I could identify 3138 distinct words in the file texprogram.tex, the original source code of the TeX program by Knuth. This was executed in less than 0.2s, so it looks pretty quick for a first implementation.
The program also has an option to print the distinct words found. You just need to execute
./trie fileName -p
and the program will print all distinct words in the file “fileName”.
One of the uses for a trie is to intern strings (think of the Java intern() method). The next time I have some time, I would like to extend this program to allow C strings to be interned easily within a trie data structure.
With such an implementation, one could use the trie to return a pointer for each passed string. This way, instead of using strcmp to compare two strings, one can just compare the pointers stored in the trie, which is extremely fast.
My code has been tested on UNIX systems, and you can read the complete source code and documentation in PDF (see instructions for the online version at the end of this post). Let me know if you have any improvements.
- Data Structures and Algorithms (by Aho, Ullman and Hopcroft) is a classic book with a very good discussion of this data structure.
- Algorithms in C++ (by Sedgewick) is also excellent, and it has C++ code for most algorithms.
Get The Complete Code
Given the great interest generated by the implementation of this algorithm, I have released the complete implementation in literate programming format (PDF file). Click in the button bellow to get more information about this data structure, with all details necessary to make it work in C.
- By clicking the button bellow you will be redirected to PayPal.
- After you make the payment, an email will be sent to your address with a download link.
- The download will be in PDF format, with the complete listing in C using literate programming documentation.
About the Author
Carlos Oliveira holds a PhD in Systems Engineering and Optimization from University of Florida. He works as a software engineer, with more than 10 years of experience in developing high performance, commercial and scientific applications in C++, Java, and Objective-C. His most Recent Book is Practical C++ Financial Programming.