In a discussion forum, someone recently asked this question:

Assume a training center XYZ has 80,000 students. All students are put into one of the four batches namely, A,B,C and D. XYZ maintains the student records. It needs to compress it. Suppose A appear 40000 times, B appears 15000 times, C appears 20000 times and D appears 5000 times. Using minimum average bits per letter method, find the prefix code for A,B,C and D respectively.

Answer choices:

(a) 0, 10, 111, 001

(b) 1, 000, 01, 001

(c) 1, 001, 000, 00

(d) 0, 111, 10, 001

Here’s my answer.

In 1952, an MIT student named Huffman was working on compressing data. He found that if a document had the number 50 repeating 1000 times, you can simply map 50 to 1 and put 1 1000 times. While decompressing again replace every 1 with 50. Simple, isn’t it? Well you could have done it even better by just writing a simple C program to print 50 1000 times which will be much smaller than printing “50” 1000 times. To decompress, simply execute the C program. Anyways, MIT students are not as brilliant as you people. So, let us continue with Huffman’s logic. His objective is to take every unique item in the document with a binary number so that the document is smaller than the original document at the end of the day. So, in this case, A repeats 40000 times (most frequent). Hence, let us map it to a single bit, say 1. It could be 0 too! Let us assume 1 for now. C appears 20K times (next most frequent). So, we could assign 0 to it. But wait, we have four items. If we assign 0 and 1, there is nothing more left to assign. B comes next and D is last. So, we cannot assign 0 to C.

This is where things get complicated. How to code C, B and D such that we get most compression. The answer seems to be in prefix coding. We assign a unique prefix to each element. So, A=1, C=01, B=001, D=000. Let us say the XYZ record looks like AAAABCAAD. The code will look like 11110010111000. Can you decode it now? Let us start from the left. 1 = A. So, we have AAAA. We have no mapping for 0. So we try 00. We have no mapping for 00 also. So, we check 001. 001 = B. Wow! So, the record is AAAAB so far. Next is 0. No mapping. 01 is C. So, we get AAAABC. Then 1 = A. So, AAAABCAA. Finally, 000 = D. Thus it works! If the prefix were not unique, this would not have worked. Note that the longest item is 000. Its prefix are 0 and 00. None of them are valid codes. Similarly, 001 is a code. But its prefix are again 0 and 00 which are not codes.

Now, if we choose to assign A = 0, we assign C = 10, B=110 and D=111. This should also work. Also, note that switching B and D values will not affect the length of the final document. Clearly, there are many ways to arrive at unique prefixes. In options (a) and (c) prefixes appear as codes. So, they are clearly wrong. In option (a) 001 has prefix as 0 which is a code. In (c) 000 has 00 as prefix which is a code. In (d) 001 has 0 as prefix which is a code. So, the answer must be (b).

If you understood this far, you can now draw a simple huffman tree to get to the codes. This is what @Suvradip Das has done.

This simple trick was a research paper and we now use it so extensively. Some people think best research is cryptic which has so many jargons that nobody understands. That is so untrue. This is an example of good research. Good research is simple like this.

On a side note, if 80,000 students were to rely on XYZ for GATE coaching, it will indeed be a sad state for the country. We want well educated minds not well trained minds. God save us! Thankfully, they perhaps mean all students across branches and across the country which gives some solace.No offence to XYZ. But, the problem is that our educational institutions are not doing their job. That is the sad part.