How to extract and use constraints from source code?

Most static analysis techniques over source code start with constructing an Abstract Syntax Tree (AST), or constructing call graphs.  Eclipse JDT has tools necessary to do these tasks. It is straightforward to traverse through the AST to collect information about the lines, methods, loops, conditions, variables, etc., used in the source code. Here’s a simple example where you can just insert your code whenever Eclipse JDT finds a method declaration node in the AST. If this code template looks unfamiliar to you, you must read more about Visitor pattern (design patterns). A full example is available at the Vogella site.
public class MethodVisitor extends ASTVisitor {
    List methods = new ArrayList();

    public boolean visit(MethodDeclaration node) {
        ...your code here...
Features thus extracted from source code can be used for a variety of purposes. Here, we discuss the process of converting them to First Order Logic (FOL) constraints. For example, consider we extracted a predicate IdentifierLength(id, l) where id is the identifier (consider an identifier as a fancy term for the name of a variable) and l is the number of characters in the identifier. With such predicates and the standard logical operators, it is trivial to define FOL constraints.
Satisfiability Modulo Theories (SMT) problem is in the intersection of computer science and mathematics which deals with such FOL formulae. Visualize an SMT instance as an FOL formula, and our problem at hand is to find whether such a set of formulae is satisfiable. Although SMT provides a much richer set of tools to model decision problems, to keep things simple, let us discuss about boolean satisfiability problems (SAT). All we need to do is create boolean SAT instances and feed them to an SMT solver.
There are many SMT solvers readily available. Z3 is probably a heavily used SMT solver. There are even simpler solvers built on top of Z3 such as Boogie. The input to Z3 is a simple script where you specify the FOL and the predicates. For example,
(assert (> IdentifierLength 10))
specifies that for this FOL to be true the IdentifierLength must be greater than 10.  Let another formula be
(assert (< IdentifierCaseChanges 2))
which means that the variable IdentifierCaseChanges should be less than 2. Of course, these IdentifierLength and IdentifierCaseChanges are what we define as functions with information extracted from source code. I talk about source code here. You may apply the same idea over text as well. Once you have the predicates, just apply
and Z3 will find if there is at least one interpretation such that all asserted formulae are true. Full tutorial is available here.
So, next time, when you are hunting for a class project, try this out! It should work for courses like Artificial Intelligence, Intelligent Systems, Program Analysis, Information Retrieval, Software Engineering, and so on! Of course, talk to your instructor though 🙂

Keeping pace with programming languages!

Keep a watch on programming languages, folks! They are evolving very fast.

In Java, the following code:

collections.sort (student, new Comparator<Student>() {
public int compare(Student laurel, Student hardy) {
return laurel.getWeight().compareTo(hardy.getWeight());

can now be written in simply one line:



Counting in Probability

Question: In the card game bridge, the 52 cards are dealt out equally to 4 players – called East, West, North and South. If North and South have total of 8 spades among them, what is the probability that East has 3 of the remaining 5 spades?

Discussion: East gets 3 spades out of remaining 5 in \binom{5}{3} ways. The sample space should have all possible ways East can get spades. All possible outcomes are \binom{5}{0} + \binom{5}{1} + \binom{5}{2} + \binom{5}{3} + \binom{5}{4} + \binom{5}{5} i.e., East gets no spade, East gets exactly one spade, …and so on till East gets all 5 remaining spades.

Therefore the desired probability is \frac{\binom{5}{3}}{\binom{5}{0}+\binom{5}{1}+\binom{5}{2}+\binom{5}{3}+\binom{5}{4}+\binom{5}{5}} = 0.3125. But this does not agree with the answer Sheldon Ross gets! So, where is the problem?

The problem is in counting. The number of ways three spades can be selected from five, in this case is not exactly \binom{5}{3}. It is in fact, \binom{5}{3}\binom{21}{10} because there could be more ways to rearrange the rest of the cards. I cannot ignore this factor since it is different proportionately when compared with the elements in the denominator. In the denominator, the number of ways East can get no spades is not \binom{5}{0}. Instead, it is \binom{5}{0}\binom{21}{13} because now he can choose 13 cards from remaining 21 non-spade cards. Similarly, when he has one spade, he can rearrange rest of the cards in \binom{21}{12} ways. Of course, East has five choices to pick that one particular spade. Therefore, East has \binom{5}{1}\binom{21}{12} ways to have one spade!

So, this works:

\frac{\binom{5}{3}\binom{21}{10}}{\binom{5}{0}\binom{21}{13}+\binom{5}{1}\binom{21}{12}+\binom{5}{2}\binom{21}{11}+\binom{5}{3}\binom{21}{10}+\binom{5}{4}\binom{21}{9}+\binom{5}{5}\binom{21}{8}} = 0.339.

Two morals from this story: 1) In most introductory probability questions, we make the mistake of counting incorrectly. 2) There are often multiple ways of solving the same problem. Try the problem yourself in your own way and double check with the answer to ensure your approach is correct. Do not read probability book like a novel.

How to parse code-like elements from free-form text?

Partial programs such as uncompilable incomplete code snippets appear in discussion forums, emails, and such informal communication media. A wealth of information is available in such places and we want to parse such partial programs from informal documentation. Lightweight regular expressions can be used based on our knowledge of naming conventions of API elements or other programming constructs. Miler is a technique based on the regex idea. But Miler’s precision is only 33% and varies based on programming language.

Another tool used in this problem of parsing parts of source code is Island Parser. The idea is to see certain parts of code (as Islands) and parse them out ignoring text and rest of content (the water). To parse a snippet, you do not need to know the whole grammar. Unimportant parts can be defined in very relaxed terms such as just a collection of characters. Parsers based on such grammars are known as island parsers. ACE tool uses island parsers that are heuristics based implemented as a bunch of ordered regular expressions. But instead of depending on a collection of source code elements as in the normal regex-based parsers, ACE uses large collections of documents as input. In ACE tool, parts of language that specify control flow are ignored (such as if, for, while). ACE uses island parser to capture code-like elements such as fully qualified API names. In Java, API names are of the form SomeType.someMethod(). For example, SAXParseException.getLineNumber(). Knowledge of such heuristics can help identify code-like elements from text.

Once extracted, ACE attempts to map these items to language elements such as package, class, method, type and variables. It uses specification document to match known items to parsed items. If a match cannot be found, the parsed items are dropped.

Island parsers as implemented in ACE can only find code-like elements which are remarkably different in presentation than normal text. For instance, there is no way we can differentiate a variable “flag” from a word in free-form text, “flag”. ACE website as of today claims that it works on postgres form of stackoverflow only. While the idea should apply to any free-form text, if you wish to play around with this state of the art, you must be ready to make your hands dirty with some setup of their source code.

Hope the programming language design community takes note of this problem and makes it easier to write high quality island parsers.

Abstract Syntax Trees in Code Search

Most code search engines index only the identifiers, comments and such words and hence fail to return impressive results. Stanford researchers  show that instead of indexing keywords,  indexing subtrees lead to an effective search.

AST’s of production grade software systems end up being too large. They claim that the average number of AST nodes, even in simple university programming assignments is close to 150!  While the size is scary, the structure fits the purpose.  AST construction tools are easily available. That becomes an added advantage.

Grammar drives programming language syntax. Parsers leverage grammar to detect errors, extract parts of source code, do type checking, etc.  Since typical programming language grammars are context-free, their derivations are inherently trees. Thus, abstract syntax trees came into existence.

While the original purpose of AST was not much beyond parsing and direct applications of parsing, researchers inspect how semantic analyses can be built over ASTs. Interesting ideas evolved such as replacing subtrees for program repair, duplicate subtrees for clone detection, and semantics preserving AST transformation for synthesizing source code.

Code search has been lagging behind on these lines to leverage the richness of AST information. The gory details of source code can now be abstracted out and meaningful subtrees can be indexed. Hope these ideas make it to production-quality webscale search engines soon.

Answer these for good research

Here are some questions to ask yourself if you want to do good research.
Key points to ponder:
  1. What have you done?
    1. What is the big theme of your research?
    2. How do the small pieces connect for a bigger thesis?
    3. Have you done one work which is cool? Why is it cool? or What is cool about it?
    4. How impactful is your research? Where can we apply the ideas?
  2. When?
    1. Do you have a plan? Do you know when will you be done with your research?
  3. How?
    1. What skills did you accumulate as part of your research efforts?
    2. Do not use too many jargons while explaining.
    3. How do you know you have done a good job? Experimental evaluation.
  4. Literature
    1. How does your work fit into the big picture?
    2. Who has done what and where is the gap?
    3. How do you organize the literature?
    4. What is the scope of your research?
    5. Why the area of research is so cool?
  5. Presentation
    1. Have sufficient backup slides. I spent a lot of time on my backups.
    2. Keep to the flow. You may have to cut off a lot of interesting stuff. Sometimes, our mind does not allow us to do it. It may be cool to show it. I have done it and I want to talk about it. But, it just does not fit the flow. The right thing to do is to chop it off and move it to backup slides.
    3. Sometimes mentioning that “this is very interesting. yet, considering time, I will skim over…” might buy you more time to talk in detail 🙂
    4. Show lot of energy while talking.
  6. Golden rule: I think, above all, show confidence and joy while talking about your research. Rest will be automatically taken care of 🙂
These are not in any order. These are not exhaustive.

The Great Thirukkural

இன்னா செய்தாரை ஒறுத்தல் அவர் நாண

நன்னயம் செய்து விடல்

English Translation: Don’t get into tit-for-tats. When someone does anything bad to you, in return, do something good to them.

There are such great versus in Thirukkural. Thirukkural consists of 133 chapters with each chapter consisting of 10 such couplets.