Jiye to jiye

जीयें तो जीयें कैसे बिन आपके
लगता नहीं दिल कहीं बिन आपके
कैसे कहूँ, बिना तेरे, ज़िंदगी ये क्या होगी
जैसे कोई सज़ा, कोई बद्दुआ होगी

Advertisements

How to extract and use constraints from source code?

Most static analysis techniques over source code start with constructing an Abstract Syntax Tree (AST), or constructing call graphs.  Eclipse JDT has tools necessary to do these tasks. It is straightforward to traverse through the AST to collect information about the lines, methods, loops, conditions, variables, etc., used in the source code. Here’s a simple example where you can just insert your code whenever Eclipse JDT finds a method declaration node in the AST. If this code template looks unfamiliar to you, you must read more about Visitor pattern (design patterns). A full example is available at the Vogella site.
public class MethodVisitor extends ASTVisitor {
    List methods = new ArrayList();

    @Override
    public boolean visit(MethodDeclaration node) {
        ...your code here...
    }
    
}
Features thus extracted from source code can be used for a variety of purposes. Here, we discuss the process of converting them to First Order Logic (FOL) constraints. For example, consider we extracted a predicate IdentifierLength(id, l) where id is the identifier (consider an identifier as a fancy term for the name of a variable) and l is the number of characters in the identifier. With such predicates and the standard logical operators, it is trivial to define FOL constraints.
Satisfiability Modulo Theories (SMT) problem is in the intersection of computer science and mathematics which deals with such FOL formulae. Visualize an SMT instance as an FOL formula, and our problem at hand is to find whether such a set of formulae is satisfiable. Although SMT provides a much richer set of tools to model decision problems, to keep things simple, let us discuss about boolean satisfiability problems (SAT). All we need to do is create boolean SAT instances and feed them to an SMT solver.
There are many SMT solvers readily available. Z3 is probably a heavily used SMT solver. There are even simpler solvers built on top of Z3 such as Boogie. The input to Z3 is a simple script where you specify the FOL and the predicates. For example,
(assert (> IdentifierLength 10))
specifies that for this FOL to be true the IdentifierLength must be greater than 10.  Let another formula be
(assert (< IdentifierCaseChanges 2))
which means that the variable IdentifierCaseChanges should be less than 2. Of course, these IdentifierLength and IdentifierCaseChanges are what we define as functions with information extracted from source code. I talk about source code here. You may apply the same idea over text as well. Once you have the predicates, just apply
(check-sat)
and Z3 will find if there is at least one interpretation such that all asserted formulae are true. Full tutorial is available here.
So, next time, when you are hunting for a class project, try this out! It should work for courses like Artificial Intelligence, Intelligent Systems, Program Analysis, Information Retrieval, Software Engineering, and so on! Of course, talk to your instructor though 🙂

Keeping pace with programming languages!

Keep a watch on programming languages, folks! They are evolving very fast.

In Java, the following code:

collections.sort (student, new Comparator<Student>() {
public int compare(Student laurel, Student hardy) {
return laurel.getWeight().compareTo(hardy.getWeight());
}
});

can now be written in simply one line:

student.sort(comparing(Student::getWeight));

Wonderful!

Counting in Probability

Question: In the card game bridge, the 52 cards are dealt out equally to 4 players – called East, West, North and South. If North and South have total of 8 spades among them, what is the probability that East has 3 of the remaining 5 spades?

Discussion: East gets 3 spades out of remaining 5 in \binom{5}{3} ways. The sample space should have all possible ways East can get spades. All possible outcomes are \binom{5}{0} + \binom{5}{1} + \binom{5}{2} + \binom{5}{3} + \binom{5}{4} + \binom{5}{5} i.e., East gets no spade, East gets exactly one spade, …and so on till East gets all 5 remaining spades.

Therefore the desired probability is \frac{\binom{5}{3}}{\binom{5}{0}+\binom{5}{1}+\binom{5}{2}+\binom{5}{3}+\binom{5}{4}+\binom{5}{5}} = 0.3125. But this does not agree with the answer Sheldon Ross gets! So, where is the problem?

The problem is in counting. The number of ways three spades can be selected from five, in this case is not exactly \binom{5}{3}. It is in fact, \binom{5}{3}\binom{21}{10} because there could be more ways to rearrange the rest of the cards. I cannot ignore this factor since it is different proportionately when compared with the elements in the denominator. In the denominator, the number of ways East can get no spades is not \binom{5}{0}. Instead, it is \binom{5}{0}\binom{21}{13} because now he can choose 13 cards from remaining 21 non-spade cards. Similarly, when he has one spade, he can rearrange rest of the cards in \binom{21}{12} ways. Of course, East has five choices to pick that one particular spade. Therefore, East has \binom{5}{1}\binom{21}{12} ways to have one spade!

So, this works:

\frac{\binom{5}{3}\binom{21}{10}}{\binom{5}{0}\binom{21}{13}+\binom{5}{1}\binom{21}{12}+\binom{5}{2}\binom{21}{11}+\binom{5}{3}\binom{21}{10}+\binom{5}{4}\binom{21}{9}+\binom{5}{5}\binom{21}{8}} = 0.339.

Two morals from this story: 1) In most introductory probability questions, we make the mistake of counting incorrectly. 2) There are often multiple ways of solving the same problem. Try the problem yourself in your own way and double check with the answer to ensure your approach is correct. Do not read probability book like a novel.

How to parse code-like elements from free-form text?

Partial programs such as uncompilable incomplete code snippets appear in discussion forums, emails, and such informal communication media. A wealth of information is available in such places and we want to parse such partial programs from informal documentation. Lightweight regular expressions can be used based on our knowledge of naming conventions of API elements or other programming constructs. Miler is a technique based on the regex idea. But Miler’s precision is only 33% and varies based on programming language.

Another tool used in this problem of parsing parts of source code is Island Parser. The idea is to see certain parts of code (as Islands) and parse them out ignoring text and rest of content (the water). To parse a snippet, you do not need to know the whole grammar. Unimportant parts can be defined in very relaxed terms such as just a collection of characters. Parsers based on such grammars are known as island parsers. ACE tool uses island parsers that are heuristics based implemented as a bunch of ordered regular expressions. But instead of depending on a collection of source code elements as in the normal regex-based parsers, ACE uses large collections of documents as input. In ACE tool, parts of language that specify control flow are ignored (such as if, for, while). ACE uses island parser to capture code-like elements such as fully qualified API names. In Java, API names are of the form SomeType.someMethod(). For example, SAXParseException.getLineNumber(). Knowledge of such heuristics can help identify code-like elements from text.

Once extracted, ACE attempts to map these items to language elements such as package, class, method, type and variables. It uses specification document to match known items to parsed items. If a match cannot be found, the parsed items are dropped.

Island parsers as implemented in ACE can only find code-like elements which are remarkably different in presentation than normal text. For instance, there is no way we can differentiate a variable “flag” from a word in free-form text, “flag”. ACE website as of today claims that it works on postgres form of stackoverflow only. While the idea should apply to any free-form text, if you wish to play around with this state of the art, you must be ready to make your hands dirty with some setup of their source code.

Hope the programming language design community takes note of this problem and makes it easier to write high quality island parsers.

Abstract Syntax Trees in Code Search

Most code search engines index only the identifiers, comments and such words and hence fail to return impressive results. Stanford researchers  show that instead of indexing keywords,  indexing subtrees lead to an effective search.

AST’s of production grade software systems end up being too large. They claim that the average number of AST nodes, even in simple university programming assignments is close to 150!  While the size is scary, the structure fits the purpose.  AST construction tools are easily available. That becomes an added advantage.

Grammar drives programming language syntax. Parsers leverage grammar to detect errors, extract parts of source code, do type checking, etc.  Since typical programming language grammars are context-free, their derivations are inherently trees. Thus, abstract syntax trees came into existence.

While the original purpose of AST was not much beyond parsing and direct applications of parsing, researchers inspect how semantic analyses can be built over ASTs. Interesting ideas evolved such as replacing subtrees for program repair, duplicate subtrees for clone detection, and semantics preserving AST transformation for synthesizing source code.

Code search has been lagging behind on these lines to leverage the richness of AST information. The gory details of source code can now be abstracted out and meaningful subtrees can be indexed. Hope these ideas make it to production-quality webscale search engines soon.