First impressions with the Amazon Kindle

I recently received a new Amazon Kindle as a gift, and was very positively surprised by the product. It provides an intuitive way to read books, and even better, it provides online access to lots of content.

The first form of online content is, of course, the Amazon kindle store. The primary reason why Amazon provides free wireless (EVDO) access is to be able to easily sell books in their store. However, you don’t really need to pay for everything on Amazon. First of all, you can browser all the titles in the store. For most of them, you can get a free excerpt of the book. Frequently, the excepts are significant parts of the book, so you can decide if it really makes sense to buy it or not. For example, I downloaded yesterday the except of two books: a probability analysis book (mathematics) offered the first two chapter for free. The second book was a novel, with the first few chapter for free (in the case of a novel, of course, it make sense to offer enough to make the reader want to see the rest).

The kindle also has an (experimental) web browser, which will let you see very basic web pages. Its main restriction is that Javascript does not work properly. Since most modern websites are dynamic, this makes it very hard to have a good web experience (some people even hate it). However, a few sites are still usable: wikipedia is available, as well as Google search, and some news web sites.

The main functionality of the device, however, works really well. You can quickly browse books, and the battery lasts very long. The kindle uses a new type of display that consumes very little power, and is easier on the eyes then most computer screens.

It has been nice having access to so much information in one device. The only thing that I believe would really improve the experience with Kindle is to have more options for document formats. Right now, it accepts the mobi format (which is Kindle’s native format), along with text files. Other document formats can be converted using a free service from Amazon that can be accessed by email. However, accessing more formats directly, such as doc and pdf, would make life much easier.

  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • E-mail this story to a friend!
  • HackerNews
  • Reddit
  • StumbleUpon
  • Twitter

Links for Wednesday 3

Here is a list of the links that I found more interesting today:

  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • E-mail this story to a friend!
  • HackerNews
  • Reddit
  • StumbleUpon
  • Twitter

Editing multiple files in a single buffer

It is a fact of programming that software written in modern programming language is divided into multiple files. That is all programmers use text editors such as vim, emacs, and eclipse, that have facility to work with and move easily between files.  With all these facilities, however, there are times when what you really want is to be able to concentrate and work the functionality you are implementing on a single file.  I frequently had this desire, and solving it seamed so easy, although I didn’t see anyone doing this.  So, I decided to implement a simple scheme by which I can edit multiple files in the same editor buffer.

For example, suppose that you are writing a C++ application. You need to create one file for each class. Also, for each class you need an implementation and an interface file. The number of files quickly grows, so it becomes harder to track the files you are using and you need to change between buffers frequently. All of this contributes to make it harder to visualize you program.

The simple solution I am now using is the following: create only one file (with an extension .cc in the case of a C++ source). Then, divide the file into sections, where each section starts with a few magic characters (in my case these characters are “@****”), followed by the name of the file.  For example, the separator “@**** class1.h”, when used in a line by itself, means that the next lines are part of the file class1.h.  In order to create real files from the original buffer, I use a small python script that reads the main file, parses the magic characters and writes lines to the appropriate file. The script is called “multifile.py”, and it is invoked as

python multifile.py <name of multipart file>

For example, I usually compile code written in C using

python multifile.py mycode.c

This way, I can edit multiple file with easy using the same original vim buffer. Whenever I want to create a new file (a header file for example), I just add a new section with the file name and start typing.  The next time I run my python script, the file will be created.

Moreover, I can easily add existing files to the edited buffer. I just need to add a new section with the desired name, then use the :r vim command to load the file in the following lines.  

The only concern you might have is that “make” will be confused by what is going on — since you are going to write your files back to disk at each edit iteration.  However, notice that the main buffer needs only to contain the files that you are editing at this moment. The remaining of the files in the application will be untouched as always. So you are overwriting only the files that are currently part of your edits. This also means that you should remove from your working set all files that you are not really editing — you can always load back a file into the editing buffer as described above.

I find this method to be simple and efficient. It has allowed me to work on project much quickly, while minimize the number of files that I need to have open at any moment.

Here is a list of the script I am using. While it can be improved in many ways, it does what I need in order to implement the scheme described above. Feel free to provide suggestions for improvements.


import sys
if len(sys.argv) < 2:
   print 'usage: multifile fileName'
   quit()

stdout = sys.stdout
for line in open(sys.argv[1], "r"):
   if line[:5] == "@****":
      words = line.split('*')
      if len(words) < 5:
         print 'error parsing', line
      file = str(words[-1:][0]).strip()
      sys.stdout = stdout
      print 'writing file', file
      sys.stdout = open(file, "w")
   else: print line,

  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • E-mail this story to a friend!
  • HackerNews
  • Reddit
  • StumbleUpon
  • Twitter

Is Using Hash Tables a Good Thing?

One of these days, I was looking at patterns of coding in some of the applications that I work with — written in Java and C#.  I was then interested in seeing why our group happened to use hash tables so frequently in these languages. This was puzzling to me because I am used to write software in C (for many years now), and I hardly ever used, or found the necessity of using, hash tables. Of course, I am not considering the classic examples (such as symbol tables), but programming in general. And I can definitely say that the programs I write in C are not any slower because of this missing data structure.  

After thinking a little bit, it became easier to see the reason why Java and C# not only provide, but need to
use hash tables in so many places.  The fact is that hash tables are frequently used where pointers, structures, and vectors should suffice.  

Uses of Hash Tables In Java

 

A hash table is frequently used as a loosely defined structure, which can hold whatever fields you want. Thus, one first use of hash tables is as non-typed storage that can hold any kind of object. This can also be a problem in C, but for better or worse, we have void pointers that can be used to hold anything. I am not saying that it is a good solution: it also means that we need to be extra careful not to blow up the whole program when accessing those values, but there is certainly a way to hold values or any desired type.  

In a more general way, simple vectors of pointers can be used to do most of what a hash tables does. Your data just needs to store a value indicating where the second element is, in addition to its own data. For example, suppose you want a hash table from data type T1 to data type T2. Create a vector of pointers for elements of type T2. Use this vector as a list, growing it with malloc/realloc as necessary (sometimes it is easy to determine the maximum possible size, which makes it even more efficient). Then, for each added element t1 of type T1, use a new position on the vector to store the address of the element of type T2.  Also, store the index of that position in t1.  

Now, if you need to look up any object, just use the stored index into the array of pointers to recover t2. This is done in constant time, as we wanted.  It is even easier and more efficient than using a hash table. The only disadvantage is that you need to add a new field to the object t1, which is pretty easy to do in C, but could hurt the feelings of someone that thinks in an object oriented way.  

Clearly, there are situations in which this is not so easy to do. For example, if we are manipulating simple values such as integers, doubles, even strings, instead of structures, then you don’t have a place where to add the extra field. In that case, you might consider creating a hash table, even though I think it would still be easier in most cases to create a structure with the original value plus the index, and use the structure instead of the value object (and by the way, that is also more “object oriented”).  

Using Data Instead of Code

Of course, there is an even simpler way: just put a pointer to the second object in the first object. While this is easier, there can be value in having an additional layer of indirection, so somebody else could manipulate the references without going through the objects of type T1.  

We can also notice that this is an example of the general technique of using more data instead of algorithms. In the scheme described above, the indirection work is taken from the hash table code into the object itself. Now, it is not the hash table that knows how to find the object, instead the object itself knows the secret of how the other piece of data can be found. The first object just passes that secret to the vector in order to retrieve the object of type T2.  

This is not to say that hash tables are not useful. They can be used in highly dynamic situations, in which we don’t know in advance which relationships will exist between objects. However, it must be said that in 99% of the circumstances this is not the case, and using hash tables to implement simple algorithms might just add overhead to the program. It can also mean that you are not putting the references for your objects in the place where they should go logically, which is a sign of weakness in the original design.  

Given all the issues related to the use of hash table in Java and C#, I will certainly start to be more careful of the places where I use them, from now on.

  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • E-mail this story to a friend!
  • HackerNews
  • Reddit
  • StumbleUpon
  • Twitter

Implementing a trie data structure in C

There are some data structures that have wide applicability in several situations. One such example are trie data structures, which can be used in as varied situations as in compilers, text editors, and related applications.

This afternoon I had some spare time, so I decided to implement a trie data structure (something that I always considered interesting). A trie is just a kind of tree that can be used to store string keys. To improve search and insert speed, however, a trie is implemented in a way such that the stored data is shared between nodes in the data structure. According to the wikipedia entry [1]:

In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.

The Implementation

For fun, I decided to write the implementation in C, using Knuth’s CWEB system. The idea is that I can read a text file, split the file into individual words, and add each word to the trie data structure. Currently, the main restriction of the implementation is that it handles only the characters from A to Z. This means that it is good to store words, such as in a dictionary, but it doesn’t work to store identifiers in a C-like language, for example (since these can have numbers and underscores).

One easy operation to do with such a data structure is to count the number of distinct words. For example, considering only the words that we can handle in this program, I could identify 3138 distinct words in the file texprogram.tex, the original source code of the TeX program by Knuth. This was executed in less than 0.2s, so it looks pretty quick for a first implementation.

The program also has an option to print the distinct words found. You just need to execute

./trie fileName -p

and the program will print all distinct words in the file “fileName”.

Future Improvements

One of the uses for a trie is to intern strings (think of the Java intern() method). The next time I have some time, I would like to extend this program to allow C strings to be interned easily within a trie data structure.

With such an implementation, one could use the trie to return a pointer for each passed string. This way, instead of using strcmp to compare two strings, one can just compare the pointers stored in the trie, which is extremely fast.

My sample code can be found in the software section, and you can read the source code in PDF (an online version is at the end of this post). Let me know if you have any improvements.

Further Reading

References:

[1] http://en.wikipedia.org/wiki/Trie

trie data structure implementation

  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • E-mail this story to a friend!
  • HackerNews
  • Reddit
  • StumbleUpon
  • Twitter