Keeping pace with programming languages!

Keep a watch on programming languages, folks! They are evolving very fast.

In Java, the following code:

collections.sort (student, new Comparator<Student>() {
public int compare(Student laurel, Student hardy) {
return laurel.getWeight().compareTo(hardy.getWeight());
}
});

can now be written in simply one line:

student.sort(comparing(Student::getWeight));

Wonderful!

Counting in Probability

Question: In the card game bridge, the 52 cards are dealt out equally to 4 players – called East, West, North and South. If North and South have total of 8 spades among them, what is the probability that East has 3 of the remaining 5 spades?

Discussion: East gets 3 spades out of remaining 5 in \binom{5}{3} ways. The sample space should have all possible ways East can get spades. All possible outcomes are \binom{5}{0} + \binom{5}{1} + \binom{5}{2} + \binom{5}{3} + \binom{5}{4} + \binom{5}{5} i.e., East gets no spade, East gets exactly one spade, …and so on till East gets all 5 remaining spades.

Therefore the desired probability is \frac{\binom{5}{3}}{\binom{5}{0}+\binom{5}{1}+\binom{5}{2}+\binom{5}{3}+\binom{5}{4}+\binom{5}{5}} = 0.3125. But this does not agree with the answer Sheldon Ross gets! So, where is the problem?

The problem is in counting. The number of ways three spades can be selected from five, in this case is not exactly \binom{5}{3}. It is in fact, \binom{5}{3}\binom{21}{10} because there could be more ways to rearrange the rest of the cards. I cannot ignore this factor since it is different proportionately when compared with the elements in the denominator. In the denominator, the number of ways East can get no spades is not \binom{5}{0}. Instead, it is \binom{5}{0}\binom{21}{13} because now he can choose 13 cards from remaining 21 non-spade cards. Similarly, when he has one spade, he can rearrange rest of the cards in \binom{21}{12} ways. Of course, East has five choices to pick that one particular spade. Therefore, East has \binom{5}{1}\binom{21}{12} ways to have one spade!

So, this works:

\frac{\binom{5}{3}\binom{21}{10}}{\binom{5}{0}\binom{21}{13}+\binom{5}{1}\binom{21}{12}+\binom{5}{2}\binom{21}{11}+\binom{5}{3}\binom{21}{10}+\binom{5}{4}\binom{21}{9}+\binom{5}{5}\binom{21}{8}} = 0.339.

Two morals from this story: 1) In most introductory probability questions, we make the mistake of counting incorrectly. 2) There are often multiple ways of solving the same problem. Try the problem yourself in your own way and double check with the answer to ensure your approach is correct. Do not read probability book like a novel.

How to parse code-like elements from free-form text?

Partial programs such as uncompilable incomplete code snippets appear in discussion forums, emails, and such informal communication media. A wealth of information is available in such places and we want to parse such partial programs from informal documentation. Lightweight regular expressions can be used based on our knowledge of naming conventions of API elements or other programming constructs. Miler is a technique based on the regex idea. But Miler’s precision is only 33% and varies based on programming language.

Another tool used in this problem of parsing parts of source code is Island Parser. The idea is to see certain parts of code (as Islands) and parse them out ignoring text and rest of content (the water). To parse a snippet, you do not need to know the whole grammar. Unimportant parts can be defined in very relaxed terms such as just a collection of characters. Parsers based on such grammars are known as island parsers. ACE tool uses island parsers that are heuristics based implemented as a bunch of ordered regular expressions. But instead of depending on a collection of source code elements as in the normal regex-based parsers, ACE uses large collections of documents as input. In ACE tool, parts of language that specify control flow are ignored (such as if, for, while). ACE uses island parser to capture code-like elements such as fully qualified API names. In Java, API names are of the form SomeType.someMethod(). For example, SAXParseException.getLineNumber(). Knowledge of such heuristics can help identify code-like elements from text.

Once extracted, ACE attempts to map these items to language elements such as package, class, method, type and variables. It uses specification document to match known items to parsed items. If a match cannot be found, the parsed items are dropped.

Island parsers as implemented in ACE can only find code-like elements which are remarkably different in presentation than normal text. For instance, there is no way we can differentiate a variable “flag” from a word in free-form text, “flag”. ACE website as of today claims that it works on postgres form of stackoverflow only. While the idea should apply to any free-form text, if you wish to play around with this state of the art, you must be ready to make your hands dirty with some setup of their source code.

Hope the programming language design community takes note of this problem and makes it easier to write high quality island parsers.

Abstract Syntax Trees in Code Search

Most code search engines index only the identifiers, comments and such words and hence fail to return impressive results. Stanford researchers  show that instead of indexing keywords,  indexing subtrees lead to an effective search.

AST’s of production grade software systems end up being too large. They claim that the average number of AST nodes, even in simple university programming assignments is close to 150!  While the size is scary, the structure fits the purpose.  AST construction tools are easily available. That becomes an added advantage.

Grammar drives programming language syntax. Parsers leverage grammar to detect errors, extract parts of source code, do type checking, etc.  Since typical programming language grammars are context-free, their derivations are inherently trees. Thus, abstract syntax trees came into existence.

While the original purpose of AST was not much beyond parsing and direct applications of parsing, researchers inspect how semantic analyses can be built over ASTs. Interesting ideas evolved such as replacing subtrees for program repair, duplicate subtrees for clone detection, and semantics preserving AST transformation for synthesizing source code.

Code search has been lagging behind on these lines to leverage the richness of AST information. The gory details of source code can now be abstracted out and meaningful subtrees can be indexed. Hope these ideas make it to production-quality webscale search engines soon.

Answer these for good research

Here are some questions to ask yourself if you want to do good research.
Key points to ponder:
  1. What have you done?
    1. What is the big theme of your research?
    2. How do the small pieces connect for a bigger thesis?
    3. Have you done one work which is cool? Why is it cool? or What is cool about it?
    4. How impactful is your research? Where can we apply the ideas?
  2. When?
    1. Do you have a plan? Do you know when will you be done with your research?
  3. How?
    1. What skills did you accumulate as part of your research efforts?
    2. Do not use too many jargons while explaining.
    3. How do you know you have done a good job? Experimental evaluation.
  4. Literature
    1. How does your work fit into the big picture?
    2. Who has done what and where is the gap?
    3. How do you organize the literature?
    4. What is the scope of your research?
    5. Why the area of research is so cool?
  5. Presentation
    1. Have sufficient backup slides. I spent a lot of time on my backups.
    2. Keep to the flow. You may have to cut off a lot of interesting stuff. Sometimes, our mind does not allow us to do it. It may be cool to show it. I have done it and I want to talk about it. But, it just does not fit the flow. The right thing to do is to chop it off and move it to backup slides.
    3. Sometimes mentioning that “this is very interesting. yet, considering time, I will skim over…” might buy you more time to talk in detail 🙂
    4. Show lot of energy while talking.
  6. Golden rule: I think, above all, show confidence and joy while talking about your research. Rest will be automatically taken care of 🙂
These are not in any order. These are not exhaustive.

The Great Thirukkural

இன்னா செய்தாரை ஒறுத்தல் அவர் நாண

நன்னயம் செய்து விடல்

English Translation: Don’t get into tit-for-tats. When someone does anything bad to you, in return, do something good to them.

There are such great versus in Thirukkural. Thirukkural consists of 133 chapters with each chapter consisting of 10 such couplets.

How to study?

Some notes on student success:

  • Someone’s got to tell you money is not extremely important, at least for now.
  • Success repeats.
  • There is one quality in all achievers, “Gratitude”!
  • Success is not bought. It is built, little by little.
  • Phased repetition is more important than one time slog.
  • Focus. Have a direction and stick to it.
  • First 15 hours or so of study gives you an idea, does not make you an expert.
  • After the first 100 hours of preparation, you will know that you know nothing. Keep patience and confidence. Stick to your direction.
  • Practice makes you perfect. There is this 10 year rule. Listen to Angela Lee Duckworth.
  • Mind works at a different speed. Writing perhaps slows you down enough to give your mind the time to think. So, write down important things. Revise.
  • When you can no longer think, stop reading. Take a break. Aptitude is what you are training for. Not, memory.
  • Most key events such as exams happen in the morning hours. So, keep your body cycle such that you are at your best during these hours.
  • Find pleasure in the process of achievement, and not in the achievement itself. That way, you will have many hours of satisfaction instead of just a few moments.