Correctness, Precision, Accuracy – What is the difference?

Correctness and completeness are typically used to evaluate software quality. The idea can be extended to just about any task. Correctness usually reflects if the program works as expected for all test inputs. Completeness refers to the availability of all features requested as per the requirement. In software testing parlance, a program is required to be correct and complete to the best extent possible.

This terminology sounds strikingly similar to precision and recall. In the Information Retrieval community, precision is defined as the fraction of relevant results retrieved over all the retrieved results. Recall refers to the fraction of relevant results retrieved over all relevant results found in the system.

In our research work of ANNE, we tag code snippets with natural language phrases. For example, “[ ]” is tagged as “array”. In this context, instead of the information retrieval terminology of precision, we chose to use “correctness” for the tagging task. It was more intuitive since we are not searching for anything here. Similarly, completeness refers to the fact if all entities are discovered and tagged. It makes more sense over recall.

Machine Learning community, especially the classifier builders use “Accuracy” which is not exactly the same as a combination of precision and recall, say “f-score”. Let us say that you are developing a binary classifier to predict if a tumor is malignant or benign. You would calculate accuracy as the fraction of the number of correct classification over the entire set that was classified. This sounds exactly like precision. The only difference is we do not have a restricted set of results. For instance, there is no Accuracy@10 like we do with precision@10.

I believe, the ideas as similar. Terminology is different. Depending on the task at hand, we apply our best judgment to select the most appropriate measure.


How to spend your time during MTech studies?

If each one of us asks, “Where should I go such that I make the best use of my time?”, we will start doing things that is most productive for us and for this universe. This is the right question which will transform you into someone great.

Now, that you have decided to go for MTech, let us answer the second best question to ask which is, “How to utilize my time during MTech?”. Before anything, my first advice is not to worry too much about what after MTech. You are in. You will rock. Believe in you. You should have a goal. But, you should not have worries like “Do I have a future”. So, your first job is to make a goal. You need a plan to achieve that goal. Do not have a plan B. That will drag you down. Always have Plan A and only Plan A. You may keep changing your Plan A. That is alright. But, at any moment, you have only one plan. All you have to do is 1) Make a Goal. 2) Plan how to achieve it. 3) Achieve it. I know this does not help. So, let us discuss a bit more in detail.

Some goals are bad. For example, “I want to work for Microsoft Research” is a bad goal. You are putting your happiness in Microsoft’s hands. Do not do that. “I want to do high quality research” is a good goal. If you do it in Microsoft, it is a blessed company. Else, it is it’s bad fate. So, all goals should be inward focused.

Having fixed this goal, the next is “How to be a good researcher?”. MTech is not for preparing yourself towards this goal. PhD is. From research perspective, MTech prepares you for a PhD. The goal of MTech program is to give you the necessary breadth in CS, much advanced than a BTech program so that you can do your PhD well. Some of you are not good at theoretical CS. Some of you are not good at programming. Some of you may not have studied few areas of your liking at depth. MTech is an opportunity to fill these gaps by taking two more years. Having done a BTech, and having prepared for GATE, you should know what your strengths and weaknesses are. Use your MTech time to remove weaknesses and add strengths. This means that you will sign up for courses, select electives, do projects accordingly. So, for someone with theoritical CS weakness or interest, the electives to chase would be Randomized Algorithms, Advanced topics in ToC, Approximation Algorithms, Number Theory and so on. If the goal is different, say, “I want a job with more than X salary”, the plan would be very different. I am not the best to advice for this goal. So, I will stay out of it.

Irrespective of what your goal is, to become a master, you need to perform on a larger stage. You should be a good communicator. You need to be good at all forms of communication – spoken, written, body language and presentation. It should be your constant endaevour to observe, and learn from all possible sources to improve your skills here. The fortunate ones can take courses inside and outside the university. But, if that can’t happen, learn from the better ones. You will always find great orators around you. Observe them. There is no other better way than to try and fail. Don’t be shy. Just do it.

Hope this helps.

How to parse code-like elements from free-form text?

Partial programs such as uncompilable incomplete code snippets appear in discussion forums, emails, and such informal communication media. A wealth of information is available in such places and we want to parse such partial programs from informal documentation. Lightweight regular expressions can be used based on our knowledge of naming conventions of API elements or other programming constructs. Miler is a technique based on the regex idea. But Miler’s precision is only 33% and varies based on programming language.

Another tool used in this problem of parsing parts of source code is Island Parser. The idea is to see certain parts of code (as Islands) and parse them out ignoring text and rest of content (the water). To parse a snippet, you do not need to know the whole grammar. Unimportant parts can be defined in very relaxed terms such as just a collection of characters. Parsers based on such grammars are known as island parsers. ACE tool uses island parsers that are heuristics based implemented as a bunch of ordered regular expressions. But instead of depending on a collection of source code elements as in the normal regex-based parsers, ACE uses large collections of documents as input. In ACE tool, parts of language that specify control flow are ignored (such as if, for, while). ACE uses island parser to capture code-like elements such as fully qualified API names. In Java, API names are of the form SomeType.someMethod(). For example, SAXParseException.getLineNumber(). Knowledge of such heuristics can help identify code-like elements from text.

Once extracted, ACE attempts to map these items to language elements such as package, class, method, type and variables. It uses specification document to match known items to parsed items. If a match cannot be found, the parsed items are dropped.

Island parsers as implemented in ACE can only find code-like elements which are remarkably different in presentation than normal text. For instance, there is no way we can differentiate a variable “flag” from a word in free-form text, “flag”. ACE website as of today claims that it works on postgres form of stackoverflow only. While the idea should apply to any free-form text, if you wish to play around with this state of the art, you must be ready to make your hands dirty with some setup of their source code.

Hope the programming language design community takes note of this problem and makes it easier to write high quality island parsers.

Abstract Syntax Trees in Code Search

Most code search engines index only the identifiers, comments and such words and hence fail to return impressive results. Stanford researchers  show that instead of indexing keywords,  indexing subtrees lead to an effective search.

AST’s of production grade software systems end up being too large. They claim that the average number of AST nodes, even in simple university programming assignments is close to 150!  While the size is scary, the structure fits the purpose.  AST construction tools are easily available. That becomes an added advantage.

Grammar drives programming language syntax. Parsers leverage grammar to detect errors, extract parts of source code, do type checking, etc.  Since typical programming language grammars are context-free, their derivations are inherently trees. Thus, abstract syntax trees came into existence.

While the original purpose of AST was not much beyond parsing and direct applications of parsing, researchers inspect how semantic analyses can be built over ASTs. Interesting ideas evolved such as replacing subtrees for program repair, duplicate subtrees for clone detection, and semantics preserving AST transformation for synthesizing source code.

Code search has been lagging behind on these lines to leverage the richness of AST information. The gory details of source code can now be abstracted out and meaningful subtrees can be indexed. Hope these ideas make it to production-quality webscale search engines soon.

Answer these for good research

Here are some questions to ask yourself if you want to do good research.
Key points to ponder:
  1. What have you done?
    1. What is the big theme of your research?
    2. How do the small pieces connect for a bigger thesis?
    3. Have you done one work which is cool? Why is it cool? or What is cool about it?
    4. How impactful is your research? Where can we apply the ideas?
  2. When?
    1. Do you have a plan? Do you know when will you be done with your research?
  3. How?
    1. What skills did you accumulate as part of your research efforts?
    2. Do not use too many jargons while explaining.
    3. How do you know you have done a good job? Experimental evaluation.
  4. Literature
    1. How does your work fit into the big picture?
    2. Who has done what and where is the gap?
    3. How do you organize the literature?
    4. What is the scope of your research?
    5. Why the area of research is so cool?
  5. Presentation
    1. Have sufficient backup slides. I spent a lot of time on my backups.
    2. Keep to the flow. You may have to cut off a lot of interesting stuff. Sometimes, our mind does not allow us to do it. It may be cool to show it. I have done it and I want to talk about it. But, it just does not fit the flow. The right thing to do is to chop it off and move it to backup slides.
    3. Sometimes mentioning that “this is very interesting. yet, considering time, I will skim over…” might buy you more time to talk in detail 🙂
    4. Show lot of energy while talking.
  6. Golden rule: I think, above all, show confidence and joy while talking about your research. Rest will be automatically taken care of 🙂
These are not in any order. These are not exhaustive.

The Great Thirukkural

இன்னா செய்தாரை ஒறுத்தல் அவர் நாண

நன்னயம் செய்து விடல்

English Translation: Don’t get into tit-for-tats. When someone does anything bad to you, in return, do something good to them.

There are such great versus in Thirukkural. Thirukkural consists of 133 chapters with each chapter consisting of 10 such couplets.

Practical challenges in computing recall

Most experiments are designed on controlled corpus i.e., the precision and recall of the corpus are already known either manually or through some other means (not the same as the experimental tool/automation itself). Thus, these are smaller samples of the real corpus. An Oracle can now be implemented to compute recall. Sampling works in most cases. However, it has its own limitations too. For example, samples can suffer from a serious threat to validity. With another sample, the results could be different. Creating several large samples in several circumstances could be infeasible. Let us review some techniques followed by researchers in these contexts to compute recall.

Gold Sets or Benchmarks

Using an existing benchmark: One way to address this issue is to use a carefully selected representative dataset such as sf100 ( While this is large and unbiased, the issue could be that this is still too large for certain recall computation tasks. Such benchmarks are also referred to as “Gold Set”. Moreover, such benchmarks are rare and specialized that these may not suit your purpose all the time.

Creating your own benchmark: In Shepherd et al.’s paper on “Using Natural Language Program Analysis to Locate and Understand Action-Oriented Concerns”, he hires a new person to prepare the gold set along with relevant results. Another person verifies the results. Both these people discuss and reconcile wherever there were disagreements. This gold set is then released to the community.

Comparative Evaluation instead of Recall

In papers such as “Improving Bug Localization using Structured Information Retrieval”, a comparative result is given instead of recall. They claim that their approach finds x% of bugs more than another tool.


If there is only one result expected. Computing MRR is more appropriate than Recall. It is easier to do so in top-10 or top-k results.

More on this …soon.