Saturday, June 2, 2012

A Boggling Return to C


My first real programming class, back in high school, was taught in C. I found it difficult and I remember spending the whole year thinking that I really had no aptitude for programming. I think I got an OK grade in the end, but I felt like I was really struggling compared to the other students in the class. It wasn't until a year or two later, taking CS classes in college and finding them easy, that I figured a few things out. First, that C programming class was incredibly intensive. By the end of the school year, I'd implemented, from scratch, a 3D wireframe graphics engine. Second, I felt like I was struggling compared to my friends in the class, because my friends were complete prodigies who had already been coding in other languages for years.
I haven't really coded in C for the last few years though. Aside from a couple classes that happened to involve a bit of C as an aside, college CS classes were getting on the Java bandwagon, I'd discovered Perl as a great language to just get stuff done. Since then it's just been Perl, Python, and JavaScript with excursions into other high-level functional languages for fun (ML, Haskell, Lisp, Erlang, etc).
Recently, I had the urge to get back to my roots and relearn C. I'm not sure exactly why. I think I've just had a few too many moments lately where I've felt like software I was using or working on was slower and more bloated feeling than it ought to be. I know writing in C won't automatically fix that, but I've had the urge to write small, efficient programs. To make my own data structures and libraries where every clock cycle and byte of memory used are essential to the problem being solved.
Any time I try to learn a new language, I like to start with a toy program or two. Something simple enough to get my head around, but meaty enough to not be completely trivial.
For my first little toy project this time, I've dusted of my copy of K&R and written a program that, given a Boggle board, will solve it, finding all the english words that appear.
I implemented this before in Python about a year ago for fun. You can see that program here.
The C version is functional but will probably make more experienced C programmers cringe. C syntax is pretty straightforward and not hard to pick back up, but I've got very little experience with the standard libraries available and I'm still re-familiarizing myself with the idioms of developing in that language.
I also took a fundamentally different approach this time, since I've had a year of background thinking on the most efficient way to do this.
Both versions rely on the standard unix 'words' file (/usr/share/dict/words on most Linux boxen) as a master dictionary. They also both expect a board to be defined in a textfile in a very basic manner, eg:
tntb
gdar
elte
sish
I'm not going to go over the rules of Boggle, since you can google it.
The approach I used in Python was to first come up with a reduced alphabet consisting only of the letters that actually show up on the board. In the example case above: abdegilnrst. Then I go through each line of the words file, skipping it if there's an apostrophe, a capital letter (no proper nouns in Boggle), or any letters in the word do not appear in the restricted alphabet. The ones that are left are potentially findable words. Then I do a fairly straightforward search at each position of the board, going neighbor to neighbor to see if that word appears. Then at the end, I take all the words that were found and sort them by length.
That's a really rough, fairly brute-force approach. I won't apologize. That's kind of how I like to do things in Python. It probably took me an hour to write that code and it runs fast enough that I could use it to cheat and get really high scores on online Boggle games (start a new game, quickly type the board into a text file, run the Python script, which took a couple seconds, then spend the next minute or two typing in the words it found, starting with the longest, highest scoring ones first). Basically a total success as far as I'm concerned. Python absolute excels for this approach to problem solving where programmer-time matters a lot and there's a certain performance threshold of "good enough" that, once you reach, you don't gain by exceeding. With Python and a couple basic built-in data structures like lists and hash tables, there are few problems you can't tackle by just taking a very straightforward approach and adding a few quick filters to reduce the search-space drastically. Sophistication is possible, but rarely necessary.
But C is different to me. I feel like since C usually requires more lines of code and more care to get a program working but offers far more control and efficiency, it requires a different approach. It's worth taking the time to get it right to begin with and avoid wasting cycles and memory.
I've had a few ideas in my head lately for things that I want to build that I think will actually form a nice foundation for other work. Things that would really benefit from being small, fast, efficient and "right" from the ground up. This is what's got me interested in getting back into C.
Back to Boggle.
The C version takes a completely different approach, that I thought up recently and seems elegant. It's based on a Prefix Tree, or "Trie" data structure.
What I do is first go through the words file (again throwing away words that have apostrophes or capital letters) and building a Trie out of it. The result is a Trie structure representing every word in the english language. Checking to see if some word exists is O(n) where n is the number of letters in the word, and further will always be very small in this application. Next, we go through each position on the board as a starting point and then go through all the possible word paths from that point, checking against the master word Trie.
Now, the nice thing about using the Trie instead of just a plain hash table lookup or binary search or something, is that in addition to being able to tell if a given word exists, we can also find out whether there are any words that start with a given prefix just as quickly. That means that as we check every possible path through the board, we can eliminate subpaths very quickly when the Trie tells us that there can't possibly be any valid words that start with that prefix. Eg, starting at the 't' in the upper left corner of the example board and looking at all the paths that can originate from there, we can pretty much instantly eliminate them all since there are no words in the english language that start with 'tn', 'tg', or 'td'. This is the sort of reasoning that a human mind can do quickly, and a Trie is how we can get a computer to be about as clever in this case. This rapid pruning of the search space pays dividends quickly.
The code obviously isn't perfect. I've actually completely ignored the case of 'Qu', which Boggle puts as a single tile. It wouldn't be hard to make my code handle that one, but I left it out as it wouldn't fundamentally change the approach; it just adds a bit of extra complexity that would make it less clear.
One enhancement I could possibly make would be to do the same filtering step I did in the Python version, coming up with the restricted alphabet of only letters that appear on the board first and skipping any words in the english dictionary that have any letters besides those. That would make the Trie construction step faster and the resulting, smaller Trie would use less memory and be faster to search. But building the Trie of the entire dictionary already only takes 82 milliseconds on my laptop (the second part of the program, walking all the subpaths through the board and checking for matches takes about 20 milliseconds for a 5x5 board. By way of comparison, the Python program takes 1.5 seconds to run, so that's about a 10X speedup).
The other enhancement which I might do once I've thought about it a bit more is to serialize a built Trie to a file on disk that could be read in quickly and completely skip the step of building the same Trie every time the program is run.
Writing the C version took me a few evenings, with a lot of stupid mistakes and sidetracks, but it was fun. I like looking at the code and knowing that any inefficiency in it is because I didn't design it well enough, not because I'm using a massive library or framework that has a thousand features that I don't actually need for this particular problem.

No comments:

Post a Comment