How to parse code-like elements from free-form text?

Partial programs such as uncompilable incomplete code snippets appear in discussion forums, emails, and such informal communication media. A wealth of information is available in such places and we want to parse such partial programs from informal documentation. Lightweight regular expressions can be used based on our knowledge of naming conventions of API elements or other programming constructs. Miler is a technique based on the regex idea. But Miler’s precision is only 33% and varies based on programming language.

Another tool used in this problem of parsing parts of source code is Island Parser. The idea is to see certain parts of code (as Islands) and parse them out ignoring text and rest of content (the water). To parse a snippet, you do not need to know the whole grammar. Unimportant parts can be defined in very relaxed terms such as just a collection of characters. Parsers based on such grammars are known as island parsers. ACE tool uses island parsers that are heuristics based implemented as a bunch of ordered regular expressions. But instead of depending on a collection of source code elements as in the normal regex-based parsers, ACE uses large collections of documents as input. In ACE tool, parts of language that specify control flow are ignored (such as if, for, while). ACE uses island parser to capture code-like elements such as fully qualified API names. In Java, API names are of the form SomeType.someMethod(). For example, SAXParseException.getLineNumber(). Knowledge of such heuristics can help identify code-like elements from text.

Once extracted, ACE attempts to map these items to language elements such as package, class, method, type and variables. It uses specification document to match known items to parsed items. If a match cannot be found, the parsed items are dropped.

Island parsers as implemented in ACE can only find code-like elements which are remarkably different in presentation than normal text. For instance, there is no way we can differentiate a variable “flag” from a word in free-form text, “flag”. ACE website as of today claims that it works on postgres form of stackoverflow only. While the idea should apply to any free-form text, if you wish to play around with this state of the art, you must be ready to make your hands dirty with some setup of their source code.

Hope the programming language design community takes note of this problem and makes it easier to write high quality island parsers.


Abstract Syntax Trees in Code Search

Most code search engines index only the identifiers, comments and such words and hence fail to return impressive results. Stanford researchers  show that instead of indexing keywords,  indexing subtrees lead to an effective search.

AST’s of production grade software systems end up being too large. They claim that the average number of AST nodes, even in simple university programming assignments is close to 150!  While the size is scary, the structure fits the purpose.  AST construction tools are easily available. That becomes an added advantage.

Grammar drives programming language syntax. Parsers leverage grammar to detect errors, extract parts of source code, do type checking, etc.  Since typical programming language grammars are context-free, their derivations are inherently trees. Thus, abstract syntax trees came into existence.

While the original purpose of AST was not much beyond parsing and direct applications of parsing, researchers inspect how semantic analyses can be built over ASTs. Interesting ideas evolved such as replacing subtrees for program repair, duplicate subtrees for clone detection, and semantics preserving AST transformation for synthesizing source code.

Code search has been lagging behind on these lines to leverage the richness of AST information. The gory details of source code can now be abstracted out and meaningful subtrees can be indexed. Hope these ideas make it to production-quality webscale search engines soon.

Answer these for good research

Here are some questions to ask yourself if you want to do good research.
Key points to ponder:
  1. What have you done?
    1. What is the big theme of your research?
    2. How do the small pieces connect for a bigger thesis?
    3. Have you done one work which is cool? Why is it cool? or What is cool about it?
    4. How impactful is your research? Where can we apply the ideas?
  2. When?
    1. Do you have a plan? Do you know when will you be done with your research?
  3. How?
    1. What skills did you accumulate as part of your research efforts?
    2. Do not use too many jargons while explaining.
    3. How do you know you have done a good job? Experimental evaluation.
  4. Literature
    1. How does your work fit into the big picture?
    2. Who has done what and where is the gap?
    3. How do you organize the literature?
    4. What is the scope of your research?
    5. Why the area of research is so cool?
  5. Presentation
    1. Have sufficient backup slides. I spent a lot of time on my backups.
    2. Keep to the flow. You may have to cut off a lot of interesting stuff. Sometimes, our mind does not allow us to do it. It may be cool to show it. I have done it and I want to talk about it. But, it just does not fit the flow. The right thing to do is to chop it off and move it to backup slides.
    3. Sometimes mentioning that “this is very interesting. yet, considering time, I will skim over…” might buy you more time to talk in detail 🙂
    4. Show lot of energy while talking.
  6. Golden rule: I think, above all, show confidence and joy while talking about your research. Rest will be automatically taken care of 🙂
These are not in any order. These are not exhaustive.

The Great Thirukkural

இன்னா செய்தாரை ஒறுத்தல் அவர் நாண

நன்னயம் செய்து விடல்

English Translation: Don’t get into tit-for-tats. When someone does anything bad to you, in return, do something good to them.

There are such great versus in Thirukkural. Thirukkural consists of 133 chapters with each chapter consisting of 10 such couplets.

Practical challenges in computing recall

Most experiments are designed on controlled corpus i.e., the precision and recall of the corpus are already known either manually or through some other means (not the same as the experimental tool/automation itself). Thus, these are smaller samples of the real corpus. An Oracle can now be implemented to compute recall. Sampling works in most cases. However, it has its own limitations too. For example, samples can suffer from a serious threat to validity. With another sample, the results could be different. Creating several large samples in several circumstances could be infeasible. Let us review some techniques followed by researchers in these contexts to compute recall.

Gold Sets or Benchmarks

Using an existing benchmark: One way to address this issue is to use a carefully selected representative dataset such as sf100 ( While this is large and unbiased, the issue could be that this is still too large for certain recall computation tasks. Such benchmarks are also referred to as “Gold Set”. Moreover, such benchmarks are rare and specialized that these may not suit your purpose all the time.

Creating your own benchmark: In Shepherd et al.’s paper on “Using Natural Language Program Analysis to Locate and Understand Action-Oriented Concerns”, he hires a new person to prepare the gold set along with relevant results. Another person verifies the results. Both these people discuss and reconcile wherever there were disagreements. This gold set is then released to the community.

Comparative Evaluation instead of Recall

In papers such as “Improving Bug Localization using Structured Information Retrieval”, a comparative result is given instead of recall. They claim that their approach finds x% of bugs more than another tool.


If there is only one result expected. Computing MRR is more appropriate than Recall. It is easier to do so in top-10 or top-k results.

More on this …soon.



Searching for UI

Over the last few months, I have been studying code search. A beautiful application of Code Search is the work done by Steven P. Reiss of Brown University on searching for user interfaces. He searches for the UI structure and APIs using Java and Swing/AWT knowledge. Further, he applies some transformations to avoid duplicates and score the search results. Basic idea is to extract UI code from the search results (from Ohloh, GitHub, etc) and build a new class file with standard identifier naming conventions. More transformations are applied to clean the code. For example, data providers are replaced with dummy providers. This transformed code is made compilable and then scored. Since the code is compilable, the user interfaces can now be shown as search results! What a beauty! Now, you can search for user interfaces and the results can be viewed as user interface snapshots. A very neat idea, well experimented. He finds 79 relevant usable results for a query “name mail phone jlist” query. The system is available for playing around at

Whats with queries in text search?

Am sure, if you are reading this blog, you must have at least “googled” once in your life. You must have issued “queries”. If you are more experienced with searches, you must have even wondered how to query effectively to get to your expected results sooner! The world of queries looks as though they are simple short text. They turn out to be much more valuable in providing insights and data that is necessary to improve your search experiences. Someone asked me if “understanding queries” is a worthy research topic and here are my quick thoughts.

Queries contain concepts. Concepts or senses instead of the actual terms used for queries work as a good abstraction in pulling up relevant results. Fechun Peng (MS Bing) reports significant DCG and CTR on application of this idea. Extracting concepts have even led to good query classification systems ie., queries can now be mapped as navigational, transactional, belonging to a particular domain such as finance, local, etc.

Queries could be long or short and each produces its own set of challenges from the perspective of intent understanding. Associating user profiles have been one way of reading queries accurately. Yet, the length of queries have had a significant impact on relevance. Some short queries easily lose context. “Michael Jordan” is the classic text book example. There are several famous MJs in the world.

Query terms are usually not independent. There are temporal, spatial, aggregational and several such relationships between the terms. Emperical results have shown that automatically weighted term strengths (on their impact over matching) give good relevance impact.

Query segmentation, not just limited to entity extraction, is a humungous open problem. Queries carry entities, task identification, events and what not! Several external data sources such as Wikipedia and dictionaries have been used to perform effective segmentation. Probabilistic models and Liguistic techniques have also been deeply explored.

We have known unstructured information from wikipedia being moved to structured representation in Yago and DBPedia. However, have you known of attempts to convert queries to structred text? This is true. Michael Bendersky of UMass has attempted just the same.

After knowing this much, its hard for anyone to wonder if studying the advances in handling queries for textual information retrieval is still a deep enough research topic!