Thursday, May 31, 2012

A simple example of Huffman coding on a string


You’ve probably heard about David Huffman and his popular compression algorithm. If you didn’t, you’ll find that info on the Internet. I will not bore you with history or math lessons in this article. I’m going to try to show you a practical example of this algorithm applied to a character string. This application will only generate console output representing the code values for the symbols inputted and generate the original symbols from a given code.

The source code attached to this article will show you how Huffman Coding works so you will get a basic understanding of it. This is for the people who have difficulty understanding the mathematics of it. In a future article (I hope) we’ll be talking about how to apply this to any files to produce their compressed format. (A simple file archiver like WinZip or WinRAR.)
The idea behind Huffman coding is based upon the frequency of a symbol in a sequence. The symbol that is the most frequent in that sequence gets a new code that is very small, the least frequent symbol will get a code that is very long, so that when we’ll translate the input we want to encode the most frequent symbols will take less space than they used to and the least frequent symbols will take more space but because they’re less frequent it won’t matter that much. For this application I chose the symbol to be 8 bits long so that the symbol will be a character (char).
We could just as easily have chosen the symbol to be 16 bits long, so we could have grouped 2 characters together as a symbol or 10 bits or 20 etc. Depending on the input we expect to have, we’ll chose the size of the symbol and the way we use it. For example, if I expect to encode raw video files, I’ll chose the symbol to be the size of a pixel. Keep in mind that when increasing or decreasing the size of the symbol, it will affect the size of the code for each symbol because the bigger the size, the more symbols you can have of that size. There are less ways to write the ones and zeroes on 8 bits than there are on 16 bits. You’ll want to adjust the size of the symbol depending on how the ones and zeroes are likely to repeat themselves in a sequence.
For this algorithm you need to have a basic understanding of binary tree data structure and the priority queue data structure. In the source code we’ll actually use the priority queue code available in a previous article.
Let’s say we have the string “beep boop beer!” which in his actual form, occupies 1 byte of memory for each character. That means that in total, it occupies 15*8 = 120 bits of memory. Through encoding, the string will occupy 40 bits. (Theoretically, in this application we’ll output to the console a string of 40 char elements of 0 and 1 representing the encoded version of the string in bits. For this to occupy 40 bits we need to convert that string directly into bits using logical bit operations which we’ll not discuss now.)
To better understand this example, we’ll going to apply it on an example. The string “beep boop beer!” is a very good example to illustrate this. In order to obtain the code for each element depending on it’s frequency we’ll need to build a binary tree such that each leaf of the tree will contain a symbol (a character from the string). The tree will be build from the leafs to the root, meaning that the elements of least frequency will be farther from the root than the elements that are more frequent. You’ll see soon why we chose to do this.
To build the tree this way we’ll use a priority queue with a slight modification, that the element with the least priority is the most important. Meaning that the elements that are the least frequent will be the first ones we get from the queue. We need to do this so we can build the tree from the leaves to the root.
Firstly we calculate the frequency of each character :
CharacterFrequency
‘b’3
‘e’4
‘p’2
‘ ‘2
‘o’2
‘r’1
‘!’1
After calculating the frequencies, we’ll create binary tree nodes for each character and we’ll introduce them in the priority queue with the frequency as priority :
We now get the first two elements from the queue and create a link between them by creating a new binary tree node to have them both as successors, so that the characters are siblings and we add their priorities. After that we add the new node we created with the sum of the priorities of it’s successors as it’s priority in the queue. (The numbers represent the priority, i.e. their frequency.)
We repeat the same steps and we get the following :
Now after we link the last two elements we’ll get the final tree :
Now, to obtain the code for each symbol we just need to traverse the trees until we get to that symbol and after each step we take to the left we add a 0 to the code or 1 if we go right.
If we do this, we’ll get the following codes :
CharacterCode
‘b’00
‘e’11
‘p’101
‘ ‘011
‘o’010
‘r’1000
‘!’1001
To decode a string of bits we just need to traverse the tree for each bit, if the bit is 0 we take a left step and if the bit is 1 we take a right step until we hit a leaf (which is the symbol we are looking for). For example, if we have the string “101 11 101 11″ and our tree, decoding it we’ll get the string “pepe”.
It’s very important to observe that not one code is a prefix of another code for another symbol. In our example, if 00 is the code for ‘b’, 000 cannot be a code for any other symbol because there’s going to be a conflict. We’ll never reach that symbol because after taking steps for the first two bits we’ll get ‘b’, we’re never going the find the symbol for 000.
A practical aspect of implementing this algorithm is considering to build a Huffman table as soon as we have the tree. The table is basically a linked list or an array that contains each symbol with it’s code because it will make encoding something more efficient. It’s hard to look for a symbol by traversing a tree and at the same time calculating it’s code because we don’t know where exactly in the tree is that symbol located. As a principle, we use a Huffman table for encoding and a Huffman tree for decoding.
The input string : beep boop beer!
The input string in binary : 0110 0010 0110 0101 0110 0101 0111 0000 0010 0000 0110 0010 0110 1111 0110 1111 0111 0000 0010 0000 0110 0010 0110 0101 0110 0101 0111 0010 0010 0001
The encoded string : 0011 1110 1011 0001 0010 1010 1100 1111 1000 1001
As you can see there is a major difference in the ASCII version of the string and the Huffman coded version.
The source code behaves as described above. You’ll find more details in the comments in the code.
All the sources have been compiled and verified using the C99 standard. Happy programming.
To be clearer since there have been comments on this. This article only illustrates how the algorithm works. To use this in a real scenario you need to add the Huffman tree you created at the end or at the start of your encoded string and the reciever should know how to interpret it in order to decode the message. A good way to do this is to build a string of bits by traversing the tree in any order you want (I prefer post-order) and concatenate a 0 for each node that is not a leaf and a 1 for each node that is a leaf followed by the bits representing the original symbol (in our case, 8 bits representing the ASCII char value). Placing this string of bits at the start of your encoded message is ideal. Once the reciever has build the tree, he will know how to decode the message to obtain the original one.

Quotes about programming languages


What computer scientists, authors and programmers think of popular programming languages.
All languages
Tony Hoare quote about programming languages
"There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies."
Tony (C.A.R.) Hoare.
Computing professor, implemented Algol 60, searcher at Microsoft Research.

Algol 60 (Then taken in C)
"I couldn't resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years."
Tony (C.A.R.) Hoare.
Basic
"It is practically impossible to teach good programming style to students that [sic] have had prior exposure to BASIC; as potential programmers they are mentally mutilated beyond hope of regeneration."
E. W. Dijkstra in "The Threats to Computing Science" .
Edsger Wybe Dijkstra contributed to the first Algol 60 compiler. Known for the Disjkstra algorithm and numerous contributions to computer science.
C
"A C program is like a fast dance on a newly waxed dance floor by people carrying razors."
Waldi Ravens. Programmer.
"In My Egotistical Opinion, most people's C programs should be indented six feet downward and covered with dirt."
Blair P. Houghton. Programmer.
"Going from programming in Pascal to programming in C, is like learning to write in Morse code."
J.P. Candusso. Programmer.
"One of the main causes of the fall of the Roman Empire was that, lacking zero, they had no way to indicate successful termination of their C programs."
Robert Firth. Author of programming books.
"Writing in C or C++ is like running a chain saw with all the safety guards removed."
Bob Gray. Author.
"It's 5.50 a.m.... Do you know where your stack pointer is?"
Anonymous.
C++
"C makes it easy to shoot yourself in the foot. In C++ it's harder, but when you do, you blow off your whole leg."
Bjarne Stroustrup. Creator of C++.
"The evolution of languages: FORTRAN is a non-typed language. C is a weakly typed language. Ada is a strongly typed language. C++ is a strongly hyped language."
Ron Sercely. Programmer.
"I invented the term 'Object-Oriented', and I can tell you I did not have C++ in mind."
Alan Kay. Creator of Smalltalk.
"The latest new features in C++ are designed to fix the previously new features in C++."
David Jameson. Author.
"Fifty years of programming language research, and we end up with C++ ?"
Richard A. O'Keefe. Computer scientist.
"Ever spend a little time reading comp.lang.c++ ? That's really the best place to learn about the number of C++ users looking for a better language."
R. William Beckwith.
"C++ has its place in the history of programming languages. Just as Caligula has his place in the history of the Roman Empire."
Robert Firth.
"Java is C++ without the guns, knives, and clubs."
James Gosling, co-inventor of Java.
"C++ is an horrible language. Even if the choice of C were to do *nothing* but keep the C++ programmers out, that in itself would be a huge reason to use C."
Linus Torvalds, inventor of Linux.
COBOL
"The use of COBOL cripples the mind; its teaching should therefore be regarded as a criminal offense."
E.W. Dijkstra.
Fortran
"FORTRAN is not a flower but a weed — it is hardy, occasionally blooms, and grows in every computer."
Alan J. Perlis. Computer scientist and professor (Yale).
"FORTRAN, the infantile disorder, by now nearly 20 years old, is hopelessly inadequate for whatever computer application you have in mind today: it is now too clumsy, too risky, and too expensive to use."
E. W. Dijkstra.
"FORTRAN was the language of choice for the same reason that three-legged races are popular."
Ken Thompson. Co-creator of B, Unix, Plan 9 and the Go programming language.
Java
"If Java had true garbage collection, most programs would delete themselves upon execution."
Robert Sewell. Programmer.
Lisp
"Lisp isn't a language, it's a building material."
Alan Kay.
Perl
"Perl is the only language that looks the same before and after RSA encryption."
Keith Bostic. Programmer, created Sleepycat, contributed to free BSD unices.
PHP
"PHP is a minor evil perpetrated and created by incompetent amateurs, whereas Perl is a great and insidious evil, perpetrated by skilled but perverted professionals."
Jon Ribbens. Programmer.
Bash and other shells
"It is easier to port a shell than a shell script."
Larry Wall. Creator of Perl.

And finally...

"There are only two kinds of programming languages: those people always bitch about and those nobody uses.”
Bjarne Stroustrup.
Unfortunately, I believe he is right. However, there have always been two schools, one for clear and safe languages (Pascal, and then scripting languages) and the other for languages facilitating hacking but with random results.

Multithreading Problems In Game Design

A couple years ago, when I first started designing a game engine to unify Box2D and my graphics engine, I thought this was a superb opportunity to join all the cool kids and multithread it. I mean all the other game developers were talking about having a thread for graphics, a thread for physics, a thread for audio, etc. etc. etc. So I spent a lot of time teaching myself various lockless threading techniques and building quite a few iterations of various multithreading structures. Almost all of them failed spectacularly for various reasons, but in the end they were all too complicated.


I eventually settled on a single worker thread that was sent off to start working on the physics at the beginning of a frame render. Then at the beginning of each subsequent frame I would check to see if the physics were done, and if so sync the physics and graphics and start up another physics render iteration. It was a very clean solution, but fundamentally flawed. For one, it introduces an inescapable frame of input lag.

Single Thread Low Load
  FRAME 1   +----+
            |    |
. Input1 -> |    |
            |[__]| Physics
            |[__]| Render
. FRAME 2   +----+ INPUT 1 ON BACKBUFFER
. Input2 -> |    |
. Process ->|    |
            |[__]| Physics
. Input3 -> |[__]| Render
. FRAME 3   +----+ INPUT 2 ON BACKBUFFER, INPUT 1 VISIBLE
.           |    |
.           |    |
. Process ->|[__]| Physics
            |[__]| Render
  FRAME 4   +----+ INPUT 3 ON BACKBUFFER, INPUT 2 VISIBLE

Multi Thread Low Load
  FRAME 1   +----+
            |    |
            |    |
. Input1 -> |    |
.           |[__]| Render/Physics START
. FRAME 2   +----+    
. Input2 -> |____| Physics END
.           |    |
.           |    |
. Input3 -> |[__]| Render/Physics START
. FRAME 3   +----+ INPUT 1 ON BACKBUFFER
.           |____|
.           |    | PHYSICS END
.           |    |
            |____| Render/Physics START
  FRAME 4   +----+ INPUT 2 ON BACKBUFFER, INPUT 1 VISIBLE

The multithreading, by definition, results in any given physics update only being reflected in the next rendered frame, because the entire point of multithreading is to immediately start rendering the current frame as soon as you start calculating physics. This causes a number of latency issues, but in addition it requires that one introduce a separated "physics update" function to be executed only during the physics/graphics sync. As I soon found out, this is a massive architectural complication, especially when you try to put in scripting or let other languages use your engine.

There is another, more subtle problem with dedicated threads for graphics/physics/audio/AI/anything. It doesn't scale. Let's say you have a platformer - AI will be contained inside the game logic, and the absolute vast majority of your CPU time will either be spent in graphics or physics, or possibly both. That means your game effectively only has two threads that are doing any meaningful amount of work. Modern processors have 8 logical cores1, and the best one currently available has12. You're using two of them. You introduced all this complexity and input lag just so you could use 16.6% of the processor instead of 8.3%.

Instead of trying to create a thread for each individual component, you need to go deeper. You have to parallelize each individual component separately, then tie them together in a single-threaded design. This has the added bonus of being vastly more friendly to single-threaded CPUs that can't thread things (like certain phones), because the parallization goes on at a lower level and is invisible to the overall architecture of the library. So instead of having a graphics thread and a physics thread, you simply call the physics update, then call the graphics update, and inside each physics and graphics update you spawn enough worker threads to match the number of cores you have to work with and concurrently process as much stuff as possible. This eliminates latency problems, complicated library designs, and it scales forever. Even if your initial implementation of concurrency won't handle 32 cores, because the concurrency is encapsulated inside the engine, you can just go back and change it later without ever having to modify any programs that use the graphics engine.

Consequently, don't try to multithread your games. It isn't worth it. Separately parallelize each individual component instead and write your game engine single-threaded; only use additional threads for asynchronous activities like resource loading.

1 The processors actually only have 4 or 6 physical cores, but use hyperthreading techniques so that 8 or 12 logical cores are presented to the operating system. From a software point of view, however, this is immaterial.

Monday, May 28, 2012

Preloading a segfault


Linux (through its dynamic linker) offers a mechanism for loading a predefined shared library prior to loading any other library. This feature can be utilized to override certain functions in other shared libraries (for instance, toprovide a different malloc implementation), or more generally: it can be used to get your own code to execute in the context of a different process. There are, of course, some security restrictions in place for preventing pre-loading your own code with setuid programs and the likes. In this post we shall present an exploitation of this feature which produces quite a frustrating prank.

Apparently, the GNU C library (glibc) comes with a library called libSegFaultwhose purpose is to assist with debugging. Let us consider a different kind of a segfault library; One which has some chance to cause a segmentation fault when it is being loaded.
1// libsegfault: g++ segfault.c -shared -o libsegfault.so
2 
3#include <stdlib.h>
4#include <time.h>
5 
6void segfault () __attribute__ ((constructor));
7 
8void segfault () {
9    srand(time(NULL));
10    if (rand() % 10 < 1)
11        *static_cast<char*>(NULL) = 0;
12}
Statistically, one in every ten invocations of the segfault() function will cause a segmentation fault (by dereferencing the NULL pointer). In our case, we have utilized the constructor attribute from GCC to make sure our function is executed when the library is loaded.
To wrap things up, setting the LD_PRELOAD environmental variable to contain this library will create a very strange system: any command executed by the user (be it even basic shell commands, such as ‘ls’, ‘cat’, etc) has a 10% chance to result in a segmentation fault.
This prank has been pulled on a colleague of mine, and I can’t say he liked it too much. However, the dynamic-loader features presented in this article may be of use in actual work as well — and this is what I hope most of you will take from the post.