Abstract Syntax Trees in Code Search

Most code search engines index only the identifiers, comments and such words and hence fail to return impressive results. Stanford researchers  show that instead of indexing keywords,  indexing subtrees lead to an effective search.

AST’s of production grade software systems end up being too large. They claim that the average number of AST nodes, even in simple university programming assignments is close to 150!  While the size is scary, the structure fits the purpose.  AST construction tools are easily available. That becomes an added advantage.

Grammar drives programming language syntax. Parsers leverage grammar to detect errors, extract parts of source code, do type checking, etc.  Since typical programming language grammars are context-free, their derivations are inherently trees. Thus, abstract syntax trees came into existence.

While the original purpose of AST was not much beyond parsing and direct applications of parsing, researchers inspect how semantic analyses can be built over ASTs. Interesting ideas evolved such as replacing subtrees for program repair, duplicate subtrees for clone detection, and semantics preserving AST transformation for synthesizing source code.

Code search has been lagging behind on these lines to leverage the richness of AST information. The gory details of source code can now be abstracted out and meaningful subtrees can be indexed. Hope these ideas make it to production-quality webscale search engines soon.


Answer these for good research

Here are some questions to ask yourself if you want to do good research.
Key points to ponder:
  1. What have you done?
    1. What is the big theme of your research?
    2. How do the small pieces connect for a bigger thesis?
    3. Have you done one work which is cool? Why is it cool? or What is cool about it?
    4. How impactful is your research? Where can we apply the ideas?
  2. When?
    1. Do you have a plan? Do you know when will you be done with your research?
  3. How?
    1. What skills did you accumulate as part of your research efforts?
    2. Do not use too many jargons while explaining.
    3. How do you know you have done a good job? Experimental evaluation.
  4. Literature
    1. How does your work fit into the big picture?
    2. Who has done what and where is the gap?
    3. How do you organize the literature?
    4. What is the scope of your research?
    5. Why the area of research is so cool?
  5. Presentation
    1. Have sufficient backup slides. I spent a lot of time on my backups.
    2. Keep to the flow. You may have to cut off a lot of interesting stuff. Sometimes, our mind does not allow us to do it. It may be cool to show it. I have done it and I want to talk about it. But, it just does not fit the flow. The right thing to do is to chop it off and move it to backup slides.
    3. Sometimes mentioning that “this is very interesting. yet, considering time, I will skim over…” might buy you more time to talk in detail 🙂
    4. Show lot of energy while talking.
  6. Golden rule: I think, above all, show confidence and joy while talking about your research. Rest will be automatically taken care of 🙂
These are not in any order. These are not exhaustive.

The Great Thirukkural

இன்னா செய்தாரை ஒறுத்தல் அவர் நாண

நன்னயம் செய்து விடல்

English Translation: Don’t get into tit-for-tats. When someone does anything bad to you, in return, do something good to them.

There are such great versus in Thirukkural. Thirukkural consists of 133 chapters with each chapter consisting of 10 such couplets.

How to study?

Some notes on student success:

  • Someone’s got to tell you money is not extremely important, at least for now.
  • Success repeats.
  • There is one quality in all achievers, “Gratitude”!
  • Success is not bought. It is built, little by little.
  • Phased repetition is more important than one time slog.
  • Focus. Have a direction and stick to it.
  • First 15 hours or so of study gives you an idea, does not make you an expert.
  • After the first 100 hours of preparation, you will know that you know nothing. Keep patience and confidence. Stick to your direction.
  • Practice makes you perfect. There is this 10 year rule. Listen to Angela Lee Duckworth.
  • Mind works at a different speed. Writing perhaps slows you down enough to give your mind the time to think. So, write down important things. Revise.
  • When you can no longer think, stop reading. Take a break. Aptitude is what you are training for. Not, memory.
  • Most key events such as exams happen in the morning hours. So, keep your body cycle such that you are at your best during these hours.
  • Find pleasure in the process of achievement, and not in the achievement itself. That way, you will have many hours of satisfaction instead of just a few moments.

Practical challenges in computing recall

Most experiments are designed on controlled corpus i.e., the precision and recall of the corpus are already known either manually or through some other means (not the same as the experimental tool/automation itself). Thus, these are smaller samples of the real corpus. An Oracle can now be implemented to compute recall. Sampling works in most cases. However, it has its own limitations too. For example, samples can suffer from a serious threat to validity. With another sample, the results could be different. Creating several large samples in several circumstances could be infeasible. Let us review some techniques followed by researchers in these contexts to compute recall.

Gold Sets or Benchmarks

Using an existing benchmark: One way to address this issue is to use a carefully selected representative dataset such as sf100 (http://www.evosuite.org/experimental-data/sf100/). While this is large and unbiased, the issue could be that this is still too large for certain recall computation tasks. Such benchmarks are also referred to as “Gold Set”. Moreover, such benchmarks are rare and specialized that these may not suit your purpose all the time.

Creating your own benchmark: In Shepherd et al.’s paper on “Using Natural Language Program Analysis to Locate and Understand Action-Oriented Concerns”, he hires a new person to prepare the gold set along with relevant results. Another person verifies the results. Both these people discuss and reconcile wherever there were disagreements. This gold set is then released to the community.

Comparative Evaluation instead of Recall

In papers such as “Improving Bug Localization using Structured Information Retrieval”, a comparative result is given instead of recall. They claim that their approach finds x% of bugs more than another tool.


If there is only one result expected. Computing MRR is more appropriate than Recall. It is easier to do so in top-10 or top-k results.

More on this …soon.



How to title your thesis?

Four simple rules to keep in mind while naming your thesis are:

  1. Avoid redundancy.
  2. Title can be broader but never narrower.
  3. A title worth to be a survey paper will be good.
  4. Complete, catchy and crisp.

Following is one approach to arrive at a title:

  1. List down the connecting ideas that determine your work. Usually there are three to four ideas. For instance, I
    1. Improve code search.
    2. Leverage naturalness of source code.
    3. Use natural language descriptions around source code.
  2. See if any of these are too narrow. If yes, make them broader. For instance,
    1. “natural language description” are highly specialized form of “documentation”. In other words, documentation can be in any format.  So, let us make it “Use documentation”.
  3. Look at survey titles in your area of research to find some naming styles. I went to google scholar and tried the query “TSE code search survey” In my case, here are some examples that I liked:
    1. Feature location is source code: A taxonomy and survey.
    2. A survey of software reuse libraries
    3. Exemplar: A source code search engine for finding highly relevant applications
    4. Comparing two methods of sending out questionnaires; E-mail versus mail
    5. Tracelet-based code search in executables
    6. … and so on
  4. Now, the third one looks like an extension of a single conference paper idea. So, I drop it. For the rest, I abstract and note down the styles as follows:
    1. X in Y: A taxonomy and survey.
    2. A survey of X.
    3. Comparing two methods of X; x1 versus x2.
    4. X-based Y in Z.
    5. X’ing Y-based applications via automated combination of Z techniques.
    6. Learning from X to improve Y.
    7. Comparison and evaluaiton of X tools and techniques: A qualitative approach.
    8. X based recommendation for Y.
    9. Effective X based on Y model.
    10. Exploring the X patterns of Y in Z.
    11. … and so on.
  5. Ok! There are a lot. So, let us find what type of these abstractions will suit us. Clearly, I do no comparative evaluation. So, it won’t suit me. I have to combine the key ideas of “software engineering applications”, “modeling source code”, “using documentation”, “leveraging naturalness” and “code search”. So, let us narrow down and look for such patterns:
    1. Leveraging documentation and exploiting the naturalness of source code in improving code search. (too long)
    2. Enhanced retrieval of source code by leveraging big code and big data. (too heavy – big code, big data, retrieval)
    3. Enhancing code search by automatically mining related documentation. (not bad but too simple).
    4. Improving code search using relevant documentation (much better than 3 but still simple).
    5. Exploiting retrieval models for analysis of source code. (sounds good)
    6. Models of source code to support retrieval based applications.
    7. Leveraging naturalness and relevant documentation in source code representations.
    8. Source code representations for search.  (too short – misses key points)
    9. Improving code search using retrieval models.
    10. Adapting text retrieval models for analysis of source code: Benefits and Challenges.
  6. Note that the above step makes me think what exactly am I doing?
    1. There is an implied priority in the order of phrases. For example, In “Models of source code to support retrieval based applications”, the emphasis is more in modeling source code. Naturally, it is expected that the survey will cover state of the art code models. This fits my work.  In “Adapting text retrieval models for analysis of source code”, it sounds like I am going to cover text retrieval models in depth, and perhaps no source code models. I do both to some extent actually!
  7. Let us now pick a few and think deeper. To aid our work, let’s group our ideas as perspectives.
    1. Perspectives on modeling source code
      1. Models of source code to support retrieval based applications.
      2. Source code representations for search.
    2. IR perspective
      1. Improving code search using retrieval models.
      2. Enhancing code search by automatically mining related documentation.
      3. Building retrieval based applications by leveraging naturalness in source code.
    3. Naturalness perspective
      1. Leveraging statistical properties of source code in improving code search.
      2. Leveraging statistical properties of source code in retrieval (based applications).
      3. Leveraging statistical properties of source code for effective code search.
      4. Leveraging naturalness of source code in building retrieval based applications.
    4. Intelligence perspective
      1. Knowledge discovery from Big Code and relevant documentation.
      2. Leveraging large scale source code repositories for building search-based applications.
  8. Ok! So, what should I do now? Best way to go ahead would be to discuss this with few people around and decide which one I would be most comfortable with.

Good luck!

Doing a PhD

PhD students often have several questions about conducting research, job opportunities after PhD, etc. Having talked to several students, professors and researchers. Here is a compilation of wisdom obtained on these lines. There is no specific right answer and there are always exceptions. So, take these with caution. Also, most of these apply to computer science, big data, data science, ML kind of background.
  1. Positioning: Typically, the inverted triangle approach is followed to find research gaps and select an area to focus. As an example, here’s how a colleague of mine shaped his work during his PhD: Image Analysis –> Biometrics –> Fingerprint recognition –> Latent Fingerprint Analysis. Note that there may be many ways to draw the hierarchy to reach to Latent Fingerprint Analysis. There is no rule or any right way to select one of them. However, having clarity on this hierarchy is important for few reasons:
    1. After PhD, how would you sell yourself? As Latent Fingerprint Analysis expert? It is too narrow to find job opportunities. How about Fingerprint Recognition expertise? Still too narrow. Our country may not have sufficient job opportunities. Much broader levels may work; but is still hard. Moreover, at much broader levels, how good are we?. So, as much as we gain depth in our research field, a solid breadth is also required. Moral: Be a domain expert, area expert and not just a problem expert.
    2. Finding right problems to solve. Time is too short to focus on everything.
  2. Dependence on Advisor: Be independent. It is your PhD. PhD is all about training you to be an independent researcher.
  3. PhD Training: PhD is all about training yourself for independent research. Doing high quality research requires skills in terms of:
    1. Area survey.
    2. Finding the right problem.
    3. Literature review.
    4. Problem definition.
    5. Solution approach.
    6. … all sections of the paper.
  4. Timing for Job Application: At least 6 months goes in the application process if you are applying to academia. Keep an eye on the requirements. xx conf papers, yy journal papers, zz TRs are important for UGC norms.
  5. Does brand value matter? Unfortunately, yes. Internship and post-docs at good places are probably important for this reason. Credibility of profile is very important. Good publications, a reputed post-doc, competitive skills, etc will help you.
  6. Why should I do internship?
    1. Brand value to resume.
    2. Learn different styles of writing, working, environment etc.
    3. Exposure to real world.
    4. Adapting to newer problems and people.
    5. Make contacts.
Skill Set
  1. What skill sets are you building? Develop skill sets during PhD period. In this case,
    1. Technical: Feature analysis, Image Segmentation, Noise removal, data enhancement, deep learning libraries, ML, general problem solving, etc.
    2. Managerial: Worked with other students on BTP, IP, individually, etc.
    3. Teaching: TA awards, etc.
    4. Coding: Java, Hadoop, R, etc.
    5. Financial: Acquiring funding – Writing research proposals.
    6. Communication
    7. Networking: In the domain of work, build contacts.
  2. PhD in Computer Science: Implies that you can solve problems in computer science. You are not a PhD in Latent Fingerprint Analysis. Keep this in mind. Think CS, Do CS. Keep learning CS.
  3. Making tangible contributions: Create products, tools, proof of concepts. Publish papers. Pass competitive exams.
Where should I spend my time?
  1. Improve skills on which you are already good at? Or, Build new skills? Prioritize. Have a clear map based on direction you want to take in future. Needs clarity on vision.
  2. Keep honing your skills.
  3. Manage breadth and depth in parallel. Do not get bogged down too much in depth alone.
  4. Presentations are just a tool to communicate your ideas. Do not overspend your time on preparing ppts. Work on your skills and thinking process.
Industry Expectations
  1. Your research topic is “blah blah”. What else have you done apart from this? What skills do you bring? Show a flavor of breadth you bring in. Can you code?
  2. Analyzing a real problem. Typically, a project which the interviewer is part of, is presented in interview as a toy problem and you are tested on how you would approach such problems. In a way, this tests your “ability to think from scratch”.
Managing Complexity
  1. There are too many things to learn. Too little time with us. Clear thinking, good breadth, analytical skills, presence of mind, and communication skills can help you here.
  2. Do not over defend your work. Every work has its limitations.

More on this… soon.