How to Title Case?

Most conferences, especially the ACM conferences which I write to, suggest that we use title case for titles and headings.

There seems to be no agreement on what is right way to title case. Turns out that there are several styles such as the APA, Chicago, AP, MLA and so on.

The best way seems to be go with the site. Check the APA and Chicago style for your title. Use a title where both these styles agree. For instance, I wanted to title “Code Variants and their Retrieval using Knowledge Discover based Approaches”. I figured out that the APA and Chicago version after title casing this statement is, “Code Variants and Their Retrieval Using Knowledge Discover Based Approaches”. So, I went with it.


The Devil Named Masters Thesis

Masters Thesis introduces you to serious research. The purpose is to expose you to the tasks associated with research. It is a training to help you appreciate the steps like finding a problem, scoping the work, conducting a literature survey, studying related work, conducting experiments, devising algorithms, implementing your ideas, conducting comparative evaluation and finally being able to convey all this well to both novice and expert audience.

The specific field or the problem you choose to solve is not very important from the perspective of this training. The focus througout should be to get trained for the above. The problem should not be of your great worry.

But that said, it is important to choose a problem that would set you up for success. Since you are likely to continue on similar ideas later for a PhD, you want to take up something that interests you. Even if not research, you would be asked to explain your thesis at hundreds of job interviews. So, you should be good in the related field. Many of us at this stage do not have a favorite subject. That is sad. Even if we do, it is a favorite subject mostly because we understood everything and see no problems! The way we study is such that we look for solutions, not for problems. You ask a GATE topper. He will tell you, give me any problem in ToC, I can solve it! This thinking is not suitable for research. Instead, you will now need to develop the skill to find a problem with every solution. Let us take an FSM (Finite State Machine). A very simple model, right? There seems to be nothing wrong with it. There seems to be nothing missing in it!? Think again. What if you want to add the notion of time to FSM? That led to timed automata or temporal models. Welcome to the first step of research. Finding problems in the most elegant solutions which you understand thoroughly.

Even if you have a favorite subject, and you throughly understand a part of it, and let us say, you can see possible extensions or fundamental issues with such solution for which there seems to be no alternative. The next thing is to understand if this is a problem you can solve within roughly 6 to 12 months. Given there is so much uncertaintly and lack of skills in the forthcoming steps, your estimate is going to be wrong. You are very likely to come up with overambitious problem which you will not end up solving. This is the top reason why most Masters thesis fail or get delayed.

Let us assume you managed to find a problem whose scope is very reasonable. You may be surprised to see that the solution takes you to a field which you are not comfortable with. My work on finding similarities between code snippets is now taking me to Category Theory! I had identified myself as a systems researcher so long and now, it is impossible to skip theory.

Being able to learn quickly and turn good results is something you young guys specialize in. You can live without sleep, food and easily without bathing for several days. Although I do not recommond these, the point is that your stamina and strength will take you to the solution.

If you defy all these odds and still think you are doing a good job, your Indian English with a Whatsapp accent will kill all your chances of a publication. Inspirational and Scientific writing needs a lot of practice. Most of you hesitate to write. Even after pleading and begging for comments, all I get is one or two lines from a very small subset of students here. Good luck to the rest! You will learn the hard way.

Another issue with publishing your work is related to how many people are solving similar problems. Crowded areas are competitive. Stay out of the crowd wherever possible. There are fields like Programming Language Design, Computer Architectures, Transport Layer Problems which have very few researchers in the country. Areas such as AI, ML and Data Science are flooded. Do not enter crowded areas unless you know what you are doing. One plausible reason could be that you want a job in AI. In that case AI is worth the struggle.

All these make Masters thesis “The perfect devil to beat”. At the end, you get that feeling of having learned useful skills. Yesterday, as I was listening to a Kasparov (Former Chess Champion), he said, “I like the masters of the past because they bring something new to the game”. I believe, our masters thesis is no different. We bring something new to our field. We should do such things for which we are respected.

So, let me now come to the short answer to your question. For the reasons of complexity involved, you need a good guide to steer you to success. So, in the broad areas of your interest, find the best people and get associated with them. Rest will fall in place automatically.

Am sorry for such a disappointing answer to a lovely question. But that is the best I could give with my limited knowledge. Good luck.

PhD Life

PhD life is like a train that takes you to a new station every day. You look through the window in anxiety for whether you will find something new, something novel as we say it! Moreover, the taste of the same tea but from different hands in different stations, is so different! This is what happens to us when we find the same solution for the same problem through a different route.

How to extract and use constraints from source code?

Most static analysis techniques over source code start with constructing an Abstract Syntax Tree (AST), or constructing call graphs.  Eclipse JDT has tools necessary to do these tasks. It is straightforward to traverse through the AST to collect information about the lines, methods, loops, conditions, variables, etc., used in the source code. Here’s a simple example where you can just insert your code whenever Eclipse JDT finds a method declaration node in the AST. If this code template looks unfamiliar to you, you must read more about Visitor pattern (design patterns). A full example is available at the Vogella site.
public class MethodVisitor extends ASTVisitor {
    List methods = new ArrayList();

    public boolean visit(MethodDeclaration node) {
        ...your code here...
Features thus extracted from source code can be used for a variety of purposes. Here, we discuss the process of converting them to First Order Logic (FOL) constraints. For example, consider we extracted a predicate IdentifierLength(id, l) where id is the identifier (consider an identifier as a fancy term for the name of a variable) and l is the number of characters in the identifier. With such predicates and the standard logical operators, it is trivial to define FOL constraints.
Satisfiability Modulo Theories (SMT) problem is in the intersection of computer science and mathematics which deals with such FOL formulae. Visualize an SMT instance as an FOL formula, and our problem at hand is to find whether such a set of formulae is satisfiable. Although SMT provides a much richer set of tools to model decision problems, to keep things simple, let us discuss about boolean satisfiability problems (SAT). All we need to do is create boolean SAT instances and feed them to an SMT solver.
There are many SMT solvers readily available. Z3 is probably a heavily used SMT solver. There are even simpler solvers built on top of Z3 such as Boogie. The input to Z3 is a simple script where you specify the FOL and the predicates. For example,
(assert (> IdentifierLength 10))
specifies that for this FOL to be true the IdentifierLength must be greater than 10.  Let another formula be
(assert (< IdentifierCaseChanges 2))
which means that the variable IdentifierCaseChanges should be less than 2. Of course, these IdentifierLength and IdentifierCaseChanges are what we define as functions with information extracted from source code. I talk about source code here. You may apply the same idea over text as well. Once you have the predicates, just apply
and Z3 will find if there is at least one interpretation such that all asserted formulae are true. Full tutorial is available here.
So, next time, when you are hunting for a class project, try this out! It should work for courses like Artificial Intelligence, Intelligent Systems, Program Analysis, Information Retrieval, Software Engineering, and so on! Of course, talk to your instructor though 🙂

How to title your thesis?

Four simple rules to keep in mind while naming your thesis are:

  1. Avoid redundancy.
  2. Title can be broader but never narrower.
  3. A title worth to be a survey paper will be good.
  4. Complete, catchy and crisp.

Following is one approach to arrive at a title:

  1. List down the connecting ideas that determine your work. Usually there are three to four ideas. For instance, I
    1. Improve code search.
    2. Leverage naturalness of source code.
    3. Use natural language descriptions around source code.
  2. See if any of these are too narrow. If yes, make them broader. For instance,
    1. “natural language description” are highly specialized form of “documentation”. In other words, documentation can be in any format.  So, let us make it “Use documentation”.
  3. Look at survey titles in your area of research to find some naming styles. I went to google scholar and tried the query “TSE code search survey” In my case, here are some examples that I liked:
    1. Feature location is source code: A taxonomy and survey.
    2. A survey of software reuse libraries
    3. Exemplar: A source code search engine for finding highly relevant applications
    4. Comparing two methods of sending out questionnaires; E-mail versus mail
    5. Tracelet-based code search in executables
    6. … and so on
  4. Now, the third one looks like an extension of a single conference paper idea. So, I drop it. For the rest, I abstract and note down the styles as follows:
    1. X in Y: A taxonomy and survey.
    2. A survey of X.
    3. Comparing two methods of X; x1 versus x2.
    4. X-based Y in Z.
    5. X’ing Y-based applications via automated combination of Z techniques.
    6. Learning from X to improve Y.
    7. Comparison and evaluaiton of X tools and techniques: A qualitative approach.
    8. X based recommendation for Y.
    9. Effective X based on Y model.
    10. Exploring the X patterns of Y in Z.
    11. … and so on.
  5. Ok! There are a lot. So, let us find what type of these abstractions will suit us. Clearly, I do no comparative evaluation. So, it won’t suit me. I have to combine the key ideas of “software engineering applications”, “modeling source code”, “using documentation”, “leveraging naturalness” and “code search”. So, let us narrow down and look for such patterns:
    1. Leveraging documentation and exploiting the naturalness of source code in improving code search. (too long)
    2. Enhanced retrieval of source code by leveraging big code and big data. (too heavy – big code, big data, retrieval)
    3. Enhancing code search by automatically mining related documentation. (not bad but too simple).
    4. Improving code search using relevant documentation (much better than 3 but still simple).
    5. Exploiting retrieval models for analysis of source code. (sounds good)
    6. Models of source code to support retrieval based applications.
    7. Leveraging naturalness and relevant documentation in source code representations.
    8. Source code representations for search.  (too short – misses key points)
    9. Improving code search using retrieval models.
    10. Adapting text retrieval models for analysis of source code: Benefits and Challenges.
  6. Note that the above step makes me think what exactly am I doing?
    1. There is an implied priority in the order of phrases. For example, In “Models of source code to support retrieval based applications”, the emphasis is more in modeling source code. Naturally, it is expected that the survey will cover state of the art code models. This fits my work.  In “Adapting text retrieval models for analysis of source code”, it sounds like I am going to cover text retrieval models in depth, and perhaps no source code models. I do both to some extent actually!
  7. Let us now pick a few and think deeper. To aid our work, let’s group our ideas as perspectives.
    1. Perspectives on modeling source code
      1. Models of source code to support retrieval based applications.
      2. Source code representations for search.
    2. IR perspective
      1. Improving code search using retrieval models.
      2. Enhancing code search by automatically mining related documentation.
      3. Building retrieval based applications by leveraging naturalness in source code.
    3. Naturalness perspective
      1. Leveraging statistical properties of source code in improving code search.
      2. Leveraging statistical properties of source code in retrieval (based applications).
      3. Leveraging statistical properties of source code for effective code search.
      4. Leveraging naturalness of source code in building retrieval based applications.
    4. Intelligence perspective
      1. Knowledge discovery from Big Code and relevant documentation.
      2. Leveraging large scale source code repositories for building search-based applications.
  8. Ok! So, what should I do now? Best way to go ahead would be to discuss this with few people around and decide which one I would be most comfortable with.

Good luck!

Doing a PhD

PhD students often have several questions about conducting research, job opportunities after PhD, etc. Having talked to several students, professors and researchers. Here is a compilation of wisdom obtained on these lines. There is no specific right answer and there are always exceptions. So, take these with caution. Also, most of these apply to computer science, big data, data science, ML kind of background.
  1. Positioning: Typically, the inverted triangle approach is followed to find research gaps and select an area to focus. As an example, here’s how a colleague of mine shaped his work during his PhD: Image Analysis –> Biometrics –> Fingerprint recognition –> Latent Fingerprint Analysis. Note that there may be many ways to draw the hierarchy to reach to Latent Fingerprint Analysis. There is no rule or any right way to select one of them. However, having clarity on this hierarchy is important for few reasons:
    1. After PhD, how would you sell yourself? As Latent Fingerprint Analysis expert? It is too narrow to find job opportunities. How about Fingerprint Recognition expertise? Still too narrow. Our country may not have sufficient job opportunities. Much broader levels may work; but is still hard. Moreover, at much broader levels, how good are we?. So, as much as we gain depth in our research field, a solid breadth is also required. Moral: Be a domain expert, area expert and not just a problem expert.
    2. Finding right problems to solve. Time is too short to focus on everything.
  2. Dependence on Advisor: Be independent. It is your PhD. PhD is all about training you to be an independent researcher.
  3. PhD Training: PhD is all about training yourself for independent research. Doing high quality research requires skills in terms of:
    1. Area survey.
    2. Finding the right problem.
    3. Literature review.
    4. Problem definition.
    5. Solution approach.
    6. … all sections of the paper.
  4. Timing for Job Application: At least 6 months goes in the application process if you are applying to academia. Keep an eye on the requirements. xx conf papers, yy journal papers, zz TRs are important for UGC norms.
  5. Does brand value matter? Unfortunately, yes. Internship and post-docs at good places are probably important for this reason. Credibility of profile is very important. Good publications, a reputed post-doc, competitive skills, etc will help you.
  6. Why should I do internship?
    1. Brand value to resume.
    2. Learn different styles of writing, working, environment etc.
    3. Exposure to real world.
    4. Adapting to newer problems and people.
    5. Make contacts.
Skill Set
  1. What skill sets are you building? Develop skill sets during PhD period. In this case,
    1. Technical: Feature analysis, Image Segmentation, Noise removal, data enhancement, deep learning libraries, ML, general problem solving, etc.
    2. Managerial: Worked with other students on BTP, IP, individually, etc.
    3. Teaching: TA awards, etc.
    4. Coding: Java, Hadoop, R, etc.
    5. Financial: Acquiring funding – Writing research proposals.
    6. Communication
    7. Networking: In the domain of work, build contacts.
  2. PhD in Computer Science: Implies that you can solve problems in computer science. You are not a PhD in Latent Fingerprint Analysis. Keep this in mind. Think CS, Do CS. Keep learning CS.
  3. Making tangible contributions: Create products, tools, proof of concepts. Publish papers. Pass competitive exams.
Where should I spend my time?
  1. Improve skills on which you are already good at? Or, Build new skills? Prioritize. Have a clear map based on direction you want to take in future. Needs clarity on vision.
  2. Keep honing your skills.
  3. Manage breadth and depth in parallel. Do not get bogged down too much in depth alone.
  4. Presentations are just a tool to communicate your ideas. Do not overspend your time on preparing ppts. Work on your skills and thinking process.
Industry Expectations
  1. Your research topic is “blah blah”. What else have you done apart from this? What skills do you bring? Show a flavor of breadth you bring in. Can you code?
  2. Analyzing a real problem. Typically, a project which the interviewer is part of, is presented in interview as a toy problem and you are tested on how you would approach such problems. In a way, this tests your “ability to think from scratch”.
Managing Complexity
  1. There are too many things to learn. Too little time with us. Clear thinking, good breadth, analytical skills, presence of mind, and communication skills can help you here.
  2. Do not over defend your work. Every work has its limitations.

More on this… soon.

Publishing in top conferences

What does it take to publish in top conferences? A question, that comes to every new PhD student’s mind. My short answer is “forget the world, enjoy the problem at hand”. However, here are few things that you should never neglect:

  1. Don’t subdue your curiosity. Experiment liberally and see for yourself, the results.
  2. Whenever you have a very interesting result, know how to write it as a research paper.
  3. Keep reading. Do not under-estimate the need for right vocabulary and knowledge of the state of the art.
  4. Know that you belong to a community. Your community has a style of writing, a focus area, state of the art, benchmarks and so on. Know these and make incremental and interesting contributions.

A much better summary on this is available here. Do check it out!