On Decidability

Today, someone was saying that he/she finds decidability in TOC very hard to understand. And, my mind could not resist the temptation to type this out.

TOC deals with the key notion of “algorithm”. The question that we are interested in answering is, what problems can have an ‘algorithm’ and what cannot? In other words, which problems are solvable and which are not. This forms the theoretical basis for the entire world of software development that follows in courses like Algorithms, Data Structures, Programming, Software Engineering and so on.

If you were asked to answer this key question, how would you approach? There starts the beauty of this subject, TOC. First, we have to build models of machines which can solve something. Then, we have to build models of ‘software’ that solves something. At the end, we need some formalism to prove that depending on the nature of the machine, certain problems are unsolvable.

We first introduce the major idea that everything in this world can be modeled using a set. Yes, the usual normal set notation that we know of, such as A = {1,2,3}. We build and represent machines as sets. We represent input and output as sets. This idea, for itself, must have got a Nobel prize! You should take few days to sink this idea hard into your head.

Next, we show that there are only few classes of machines we could build. Everything else that we can think of is equivalent in power to these machines! I would have given second Nobel prize for this idea. If there are only few such machines, what are they? How do they look like? How do they process the input? How is an algorithm encoded? After answering these usual questions, the last major question we ask is “What is that this machine cannot solve?”.

My dear friends, this last question is your hello world to decidability. If there is one “wow” thing in computer science which stays in my heart forever, it would be the way TOC develops these ideas. Decidability is an extremely easy topic if you have absorbed these ideas correctly and completely. There is no need to mug up any table. There is no need to look at any videos. Just take a nice book, find a calm place, and dive into the beauty. If you have a friend or a parent who will get you a coffee in between, what more do we need in life!

Beautiful computer science.


Beware of your words! :-)

Today, I was reading about the Thumb Instruction Set of ARM family. I landed into this line which made me quite curious – “All Thumb instructions are encoded into a 16-bit half-word format”.

We know of bits and bytes, even words. If we have half-words, there must be some history to it. First of all, word is (put loosely) just a fixed-sized piece of data handled as one unit by the processor. In 1970s, DEC came up with VAX which used 16-bits for these chunks. So, 16 bits became a word. They referred to 32 bits as longword. The x86 family offered three different word lengths (16, 32 and 64). They continued to designate word to 16 bits. The earlier documentation written prior to 1970s did not agree with a 16 bit word though. Microsoft’s Windows API refers 16 bits as WORD regardless of whether you use a 32 bit or a 64 bit processor where the word size changes accordingly. They call 32 bits as Double Word or DWORD and 64 bits as QWORD (quad word). By now, backword compatibility has become a huge issue. So, we stick with this terminology.

Well, ARM like me is quite different! For ARM, we have a 16-bit half-word!! Next time, if you get confused with WORD, at least know that it is not your fault.

Unfortuantely, a little bit of knowledge of our hardware and the associated computer organization is required for us to become good programmers. Yet, at universities, we jump into coding a little too early!

Beautiful computer science.

Huffman Coding

In a discussion forum, someone recently asked this question:

Assume a training center XYZ has 80,000 students. All students are put into one of the four batches namely, A,B,C and D. XYZ maintains the student records. It needs to compress it. Suppose A appear 40000 times, B appears 15000 times, C appears 20000 times and D appears 5000 times. Using minimum average bits per letter method, find the prefix code for A,B,C and D respectively.

Answer choices:
(a) 0, 10, 111, 001
(b) 1, 000, 01, 001
(c) 1, 001, 000, 00
(d) 0, 111, 10, 001

Here’s my answer.

In 1952, an MIT student named Huffman was working on compressing data. He found that if a document had the number 50 repeating 1000 times, you can simply map 50 to 1 and put 1 1000 times. While decompressing again replace every 1 with 50. Simple, isn’t it? Well you could have done it even better by just writing a simple C program to print 50 1000 times which will be much smaller than printing “50” 1000 times. To decompress, simply execute the C program. Anyways, MIT students are not as brilliant as you people. So, let us continue with Huffman’s logic. His objective is to take every unique item in the document with a binary number so that the document is smaller than the original document at the end of the day. So, in this case, A repeats 40000 times (most frequent). Hence, let us map it to a single bit, say 1. It could be 0 too! Let us assume 1 for now. C appears 20K times (next most frequent). So, we could assign 0 to it. But wait, we have four items. If we assign 0 and 1, there is nothing more left to assign. B comes next and D is last. So, we cannot assign 0 to C.

This is where things get complicated. How to code C, B and D such that we get most compression. The answer seems to be in prefix coding. We assign a unique prefix to each element. So, A=1, C=01, B=001, D=000. Let us say the XYZ record looks like AAAABCAAD. The code will look like 11110010111000. Can you decode it now? Let us start from the left. 1 = A. So, we have AAAA. We have no mapping for 0. So we try 00. We have no mapping for 00 also. So, we check 001. 001 = B. Wow! So, the record is AAAAB so far. Next is 0. No mapping. 01 is C. So, we get AAAABC. Then 1 = A. So, AAAABCAA. Finally, 000 = D. Thus it works! If the prefix were not unique, this would not have worked. Note that the longest item is 000. Its prefix are 0 and 00. None of them are valid codes. Similarly, 001 is a code. But its prefix are again 0 and 00 which are not codes.

Now, if we choose to assign A = 0, we assign C = 10, B=110 and D=111. This should also work. Also, note that switching B and D values will not affect the length of the final document. Clearly, there are many ways to arrive at unique prefixes. In options (a) and (c) prefixes appear as codes. So, they are clearly wrong. In option (a) 001 has prefix as 0 which is a code. In (c) 000 has 00 as prefix which is a code. In (d) 001 has 0 as prefix which is a code. So, the answer must be (b).

If you understood this far, you can now draw a simple huffman tree to get to the codes. This is what @Suvradip Das has done.

This simple trick was a research paper and we now use it so extensively. Some people think best research is cryptic which has so many jargons that nobody understands. That is so untrue. This is an example of good research. Good research is simple like this.

On a side note, if 80,000 students were to rely on XYZ for GATE coaching, it will indeed be a sad state for the country. We want well educated minds not well trained minds. God save us! Thankfully, they perhaps mean all students across branches and across the country which gives some solace.No offence to XYZ. But, the problem is that our educational institutions are not doing their job. That is the sad part.


What is the size of int in DOS?

This is a common question that arises in so many minds who are learning to program in any high level language such as C. Thanks for asking. While programming, we need a way to make our program “remember” values. So, we use “variables”. For example, the statement, “int i = 10;” defines an int variable and assigns the value 10 to it. Further, this allows to manipulate the value in statements like “i = i + 20;” or “i = i * 2;”. Thus we are able to build on this idea and write more complicated programs. When we go for placements, or even for GATE kind of exams, the concept of space complexity is of utmost importance. Space complexity denotes the amount of memory our program will occupy. Based on this, programs are sometimes classified as good or bad. Sorting is an example. You can sort in-place i.e., without significant additional memory! The variables we declare and use are the primary reasons for memory usage. More variables you declare, your programs need more memory (RAM). Let us assume you have one program which uses 100 bit type variables and 100 strings. It is a no-brainer to claim that the latter program needs more memory than the former. Each data type, uses a defined amount of memory. Now that we understand the background of variables, data types and the space occupied by them, let us get to this question.

How much memory does int occupy on a DOS system? Let me assume for now that we are talking about C programming. DOS is a family of disk operating systems. There are multiple flavors of DOS namely MS DOS, FreeDOS, AppleDOS and so on. Refer to the wikipedia page of DOS for more detailed discussion on DOS. For this discussion, let us consider MS-DOS, the microsoft DOS that we know better. Interestingly, there are different flavors of DOS based on the underlying system architecture. Here comes the notion of “32-bit” machine. This means that the registers and addresses are 32-bits wide. Even FAT uses 32-bits. So, you cannot store one single file which is more than 4GB on your USB drive even if the drive is say 16 GB in size. Anyways, that was just a side-detail.

An Integer is a primitive data type. To learn about the memory occupied by integer, we refer to the C99 or any such standard. The standard only requires size relations between the data types and minimum sizes for each data type. C wants compiler designers to be only ensure that int is capable of containing at least the [−32,767, +32,767] range. Therefore, it is at least 16 bits in size. But, if we use 32 bit registers, we cannot allocate half register. So, nowadays, it is 32 bits. Earlier, when most C books were written, we had 16 bit processors. So, they stuck to the standard and provided 16 bits to int. Nowadays, it depends on the processor architecture. If you are on a 64-bit machine, you are even worse. You need 8 bytes or 64 bits! You can check this by yourself by using sizeof() in C.

So, the final answer is, assuming the version of MS DOS uses a 32-bit architecture, the sizeof() is very likely to give 4 as result (i.e., 4 bytes = 32 bits) for the size of int. Note the word, “likely”. It means that the compiler designer can come up with a different size. As long as it is more than 16 bits and adheres to the size ratios, they are allowed to do so. If no information such as DOS, or system architecture is given, we go with the C99 standard which requires an int to be at least 16 bits i.e., 2 bytes. Hope this answers your question.

One final bit to add here is that do not memorize these sizes. It is not required. Remember the ideas but not the exact number of bits occupied by each data type. That is not necessary and is often incorrect as I have explained above.

Moret’s “Theory of Computation” has a beautiful “Introduction”

Bernard M. Moret’s book on “The Theory of Computation” has the best Introduction I have ever read in any book. It has everything an Introduction should have: a chronological survey of key problems solved, key challenges, how these challenges can be categorised and what came out of these. After reading the intro, I can’t resist reading the rest of the book.

I was surprised to learn how Mathematicians and Logicians sowed seeds for the current digital era. The work of Godel, Church and Turing for instance are beyond doubt a must know for every “Computer” scientist. If the Theory of complexity, and Theory of Computability can be developed so nicely, there is no research in the world that cannot be well analyzed and presented.

Possessive Forms in English

Possessive forms are always confusing for any average English writer. The book “Elements of Style” starts with this issue. Here’s a small note with appropriate references.
Where to use ‘s?
  1. Charles’s friend
  2. Buns’s poems
Where not to use ‘s?
  1. Jesus’, (if old names end with us, es or is. Seems Charles is not an old name.)
  2. for conscience’ sake, for righteousness’ sake, Moses’ Laws, Isis’ temple.
  3. plurals – boys’, girls’, parents’ (note: parent’s and parents’ mean different things… former is singular and latter is plural)
  4. its, yours, ours, theirs, hers.
Special cases:
  1. It’s means “it is”.
  2. its means “belonging to it” (possessive).
  3. There is no ‘ or s after his.
Some readings: