Huffman Coding

In a discussion forum, someone recently asked this question:

Assume a training center XYZ has 80,000 students. All students are put into one of the four batches namely, A,B,C and D. XYZ maintains the student records. It needs to compress it. Suppose A appear 40000 times, B appears 15000 times, C appears 20000 times and D appears 5000 times. Using minimum average bits per letter method, find the prefix code for A,B,C and D respectively.

Answer choices:
(a) 0, 10, 111, 001
(b) 1, 000, 01, 001
(c) 1, 001, 000, 00
(d) 0, 111, 10, 001

Here’s my answer.

In 1952, an MIT student named Huffman was working on compressing data. He found that if a document had the number 50 repeating 1000 times, you can simply map 50 to 1 and put 1 1000 times. While decompressing again replace every 1 with 50. Simple, isn’t it? Well you could have done it even better by just writing a simple C program to print 50 1000 times which will be much smaller than printing “50” 1000 times. To decompress, simply execute the C program. Anyways, MIT students are not as brilliant as you people. So, let us continue with Huffman’s logic. His objective is to take every unique item in the document with a binary number so that the document is smaller than the original document at the end of the day. So, in this case, A repeats 40000 times (most frequent). Hence, let us map it to a single bit, say 1. It could be 0 too! Let us assume 1 for now. C appears 20K times (next most frequent). So, we could assign 0 to it. But wait, we have four items. If we assign 0 and 1, there is nothing more left to assign. B comes next and D is last. So, we cannot assign 0 to C.

This is where things get complicated. How to code C, B and D such that we get most compression. The answer seems to be in prefix coding. We assign a unique prefix to each element. So, A=1, C=01, B=001, D=000. Let us say the XYZ record looks like AAAABCAAD. The code will look like 11110010111000. Can you decode it now? Let us start from the left. 1 = A. So, we have AAAA. We have no mapping for 0. So we try 00. We have no mapping for 00 also. So, we check 001. 001 = B. Wow! So, the record is AAAAB so far. Next is 0. No mapping. 01 is C. So, we get AAAABC. Then 1 = A. So, AAAABCAA. Finally, 000 = D. Thus it works! If the prefix were not unique, this would not have worked. Note that the longest item is 000. Its prefix are 0 and 00. None of them are valid codes. Similarly, 001 is a code. But its prefix are again 0 and 00 which are not codes.

Now, if we choose to assign A = 0, we assign C = 10, B=110 and D=111. This should also work. Also, note that switching B and D values will not affect the length of the final document. Clearly, there are many ways to arrive at unique prefixes. In options (a) and (c) prefixes appear as codes. So, they are clearly wrong. In option (a) 001 has prefix as 0 which is a code. In (c) 000 has 00 as prefix which is a code. In (d) 001 has 0 as prefix which is a code. So, the answer must be (b).

If you understood this far, you can now draw a simple huffman tree to get to the codes. This is what @Suvradip Das has done.

This simple trick was a research paper and we now use it so extensively. Some people think best research is cryptic which has so many jargons that nobody understands. That is so untrue. This is an example of good research. Good research is simple like this.

On a side note, if 80,000 students were to rely on XYZ for GATE coaching, it will indeed be a sad state for the country. We want well educated minds not well trained minds. God save us! Thankfully, they perhaps mean all students across branches and across the country which gives some solace.No offence to XYZ. But, the problem is that our educational institutions are not doing their job. That is the sad part.

 

Advertisements

What is the size of int in DOS?

This is a common question that arises in so many minds who are learning to program in any high level language such as C. Thanks for asking. While programming, we need a way to make our program “remember” values. So, we use “variables”. For example, the statement, “int i = 10;” defines an int variable and assigns the value 10 to it. Further, this allows to manipulate the value in statements like “i = i + 20;” or “i = i * 2;”. Thus we are able to build on this idea and write more complicated programs. When we go for placements, or even for GATE kind of exams, the concept of space complexity is of utmost importance. Space complexity denotes the amount of memory our program will occupy. Based on this, programs are sometimes classified as good or bad. Sorting is an example. You can sort in-place i.e., without significant additional memory! The variables we declare and use are the primary reasons for memory usage. More variables you declare, your programs need more memory (RAM). Let us assume you have one program which uses 100 bit type variables and 100 strings. It is a no-brainer to claim that the latter program needs more memory than the former. Each data type, uses a defined amount of memory. Now that we understand the background of variables, data types and the space occupied by them, let us get to this question.

How much memory does int occupy on a DOS system? Let me assume for now that we are talking about C programming. DOS is a family of disk operating systems. There are multiple flavors of DOS namely MS DOS, FreeDOS, AppleDOS and so on. Refer to the wikipedia page of DOS for more detailed discussion on DOS. For this discussion, let us consider MS-DOS, the microsoft DOS that we know better. Interestingly, there are different flavors of DOS based on the underlying system architecture. Here comes the notion of “32-bit” machine. This means that the registers and addresses are 32-bits wide. Even FAT uses 32-bits. So, you cannot store one single file which is more than 4GB on your USB drive even if the drive is say 16 GB in size. Anyways, that was just a side-detail.

An Integer is a primitive data type. To learn about the memory occupied by integer, we refer to the C99 or any such standard. The standard only requires size relations between the data types and minimum sizes for each data type. C wants compiler designers to be only ensure that int is capable of containing at least the [−32,767, +32,767] range. Therefore, it is at least 16 bits in size. But, if we use 32 bit registers, we cannot allocate half register. So, nowadays, it is 32 bits. Earlier, when most C books were written, we had 16 bit processors. So, they stuck to the standard and provided 16 bits to int. Nowadays, it depends on the processor architecture. If you are on a 64-bit machine, you are even worse. You need 8 bytes or 64 bits! You can check this by yourself by using sizeof() in C.

So, the final answer is, assuming the version of MS DOS uses a 32-bit architecture, the sizeof() is very likely to give 4 as result (i.e., 4 bytes = 32 bits) for the size of int. Note the word, “likely”. It means that the compiler designer can come up with a different size. As long as it is more than 16 bits and adheres to the size ratios, they are allowed to do so. If no information such as DOS, or system architecture is given, we go with the C99 standard which requires an int to be at least 16 bits i.e., 2 bytes. Hope this answers your question.

One final bit to add here is that do not memorize these sizes. It is not required. Remember the ideas but not the exact number of bits occupied by each data type. That is not necessary and is often incorrect as I have explained above.

Moret’s “Theory of Computation” has a beautiful “Introduction”

Bernard M. Moret’s book on “The Theory of Computation” has the best Introduction I have ever read in any book. It has everything an Introduction should have: a chronological survey of key problems solved, key challenges, how these challenges can be categorised and what came out of these. After reading the intro, I can’t resist reading the rest of the book.

I was surprised to learn how Mathematicians and Logicians sowed seeds for the current digital era. The work of Godel, Church and Turing for instance are beyond doubt a must know for every “Computer” scientist. If the Theory of complexity, and Theory of Computability can be developed so nicely, there is no research in the world that cannot be well analyzed and presented.

Possessive Forms in English

Possessive forms are always confusing for any average English writer. The book “Elements of Style” starts with this issue. Here’s a small note with appropriate references.
Where to use ‘s?
  1. Charles’s friend
  2. Buns’s poems
Where not to use ‘s?
  1. Jesus’, (if old names end with us, es or is. Seems Charles is not an old name.)
  2. for conscience’ sake, for righteousness’ sake, Moses’ Laws, Isis’ temple.
  3. plurals – boys’, girls’, parents’ (note: parent’s and parents’ mean different things… former is singular and latter is plural)
  4. its, yours, ours, theirs, hers.
Special cases:
  1. It’s means “it is”.
  2. its means “belonging to it” (possessive).
  3. There is no ‘ or s after his.
Some readings: