Implementing a trie data structure in C

There are some data structures that have wide applicability in several situations. One such example are trie data structures, which can be used in as varied situations as in compilers, text editors, and related applications.

This afternoon I had some spare time, so I decided to implement a trie data structure (something that I always considered interesting). A trie is just a kind of tree that can be used to store string keys. To improve search and insert speed, however, a trie is implemented in a way such that the stored data is shared between nodes in the data structure. According to the wikipedia entry [1]:

In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.

The Implementation

For fun, I decided to write the implementation in C, using Knuth’s CWEB system. The idea is that I can read a text file, split the file into individual words, and add each word to the trie data structure. Currently, the main restriction of the implementation is that it handles only the characters from A to Z. This means that it is good to store words, such as in a dictionary, but it doesn’t work to store identifiers in a C-like language, for example (since these can have numbers and underscores).

One easy operation to do with such a data structure is to count the number of distinct words. For example, considering only the words that we can handle in this program, I could identify 3138 distinct words in the file texprogram.tex, the original source code of the TeX program by Knuth. This was executed in less than 0.2s, so it looks pretty quick for a first implementation.

The program also has an option to print the distinct words found. You just need to execute

./trie fileName -p

and the program will print all distinct words in the file “fileName”.

Future Improvements

One of the uses for a trie is to intern strings (think of the Java intern() method). The next time I have some time, I would like to extend this program to allow C strings to be interned easily within a trie data structure.

With such an implementation, one could use the trie to return a pointer for each passed string. This way, instead of using strcmp to compare two strings, one can just compare the pointers stored in the trie, which is extremely fast.

My code has been tested on UNIX systems, and you can read the complete source code and documentation in PDF (see instructions for the online version at the end of this post). Let me know if you have any improvements.

Get The Complete Code

Given the great interest generated by the implementation of this algorithm, I have released the complete implementation in literate programming format (PDF file). Click in the button bellow to get more information about this data structure, with all details necessary to make it work in C.

By clicking the button bellow you will be redirected to PayPal.
After you make the payment, an email will be sent to your address with a download link.
The download will be in PDF format, with the complete listing in C using literate programming documentation.

About the Author

Carlos Oliveira holds a PhD in Systems Engineering and Optimization from University of Florida. He works as a software engineer, with more than 10 years of experience in developing high performance, commercial and scientific applications in C++, Java, and Objective-C. His most Recent Book is Practical C++ Financial Programming.

Published on November 16th, 2008

4 Responses to “Implementing a trie data structure in C”

Dear Dr. Oliveira.

I am new at c++. I was searching for some information about tries and found your wonderful article at http://coliveira.net/software/implementing-a-trie-data-structure-in-c/ .

This is a fascinating topic, although a bit too advanced for me. My extent of knowledge of C++ is that I cant even spell it!

In any case, I downloaded your source code. I was devastated to learn it was in C (not C++ which I know bare basics for) and is for UNIX. The second obstacle I overcame. I created a LINUX VM, found some references online on make, configure and install, so it works.

The first issue (having the source code written in C) I am still very much is struggling with. I found some UNIX software online (C to C++ converter) and will try it tomorrow.

Can you answer the following questions, please:

1) Do you have this code in C++ instead of C?

2) I created a simple text file with a few words and your code prints its contents to screen. However, I have larger files (in MBs and GBs) that contain hundreds of lines with numbers like:

794946673453453454599999934534534534566342

794946673453453568450004568945686456456456

…

and your program cant read them (0 values returned or something like that). These files were created with Big Integer C++ library for another project. Why cant these file be read? Is it because they consist of numbers? Or because they are too long to be interpreted? How do I take care of this? I think BigInt is only written for C++, not C.

3) I have read the pdf for your source code but I cant find a reference on how to retrieve a particular value (lets say “does 794946673453453454599999934534534534566342 exist in in this trie”? How would I do that? Did I miss it in the doc?

4) My understanding of the program is that removes redundant parts of the strings. Lets say first and second line have this string in common: 794946673453453, the rest is different. The program creates a new branch at the first different character. is this correct?

Thanks you so very much for your help.

Sincerely,

Brugio.

By Brugio on Sep 19, 2011
Respected Sir,

I am trying to implement B-trie(Disk resident trie).
what i know about disk rsident is disk-resident means that the data structure is stored and accessed on non-volatile memory, such as a hard disk or a solid state drive. Main memory
is typically only used as a buffer; updates are often performed in a write-through manner — meaning that an update is not held in memory for an indefinite amount of time,it is instead immediately forwarded to disk (via the operating system).
I read all the algorithms of b-trie from http://www.naskitis.com/naskitis-vldbj09.pdf . even search and insert implemented in memory.

How it should be implemented on disk ? pls guide me with the flow of application ?

i am unable to understand the concept how to store it on disk ?

By Sahil Singla on Dec 21, 2011
Hi

I got an question in VIVA as what is meant by TRIE data structure, really i hadnt heard such a data structure and i thought that mam is kidding but now i found that there is something like that, thanks for your xplntn…Goood

By ash on Jan 11, 2012