How to extract and use constraints from source code?

Most static analysis techniques over source code start with constructing an Abstract Syntax Tree (AST), or constructing call graphs.  Eclipse JDT has tools necessary to do these tasks. It is straightforward to traverse through the AST to collect information about the lines, methods, loops, conditions, variables, etc., used in the source code. Here’s a simple example where you can just insert your code whenever Eclipse JDT finds a method declaration node in the AST. If this code template looks unfamiliar to you, you must read more about Visitor pattern (design patterns). A full example is available at the Vogella site.
public class MethodVisitor extends ASTVisitor {
    List methods = new ArrayList();

    @Override
    public boolean visit(MethodDeclaration node) {
        ...your code here...
    }
    
}
Features thus extracted from source code can be used for a variety of purposes. Here, we discuss the process of converting them to First Order Logic (FOL) constraints. For example, consider we extracted a predicate IdentifierLength(id, l) where id is the identifier (consider an identifier as a fancy term for the name of a variable) and l is the number of characters in the identifier. With such predicates and the standard logical operators, it is trivial to define FOL constraints.
Satisfiability Modulo Theories (SMT) problem is in the intersection of computer science and mathematics which deals with such FOL formulae. Visualize an SMT instance as an FOL formula, and our problem at hand is to find whether such a set of formulae is satisfiable. Although SMT provides a much richer set of tools to model decision problems, to keep things simple, let us discuss about boolean satisfiability problems (SAT). All we need to do is create boolean SAT instances and feed them to an SMT solver.
There are many SMT solvers readily available. Z3 is probably a heavily used SMT solver. There are even simpler solvers built on top of Z3 such as Boogie. The input to Z3 is a simple script where you specify the FOL and the predicates. For example,
(assert (> IdentifierLength 10))
specifies that for this FOL to be true the IdentifierLength must be greater than 10.  Let another formula be
(assert (< IdentifierCaseChanges 2))
which means that the variable IdentifierCaseChanges should be less than 2. Of course, these IdentifierLength and IdentifierCaseChanges are what we define as functions with information extracted from source code. I talk about source code here. You may apply the same idea over text as well. Once you have the predicates, just apply
(check-sat)
and Z3 will find if there is at least one interpretation such that all asserted formulae are true. Full tutorial is available here.
So, next time, when you are hunting for a class project, try this out! It should work for courses like Artificial Intelligence, Intelligent Systems, Program Analysis, Information Retrieval, Software Engineering, and so on! Of course, talk to your instructor though 🙂
Advertisements

How to title your thesis?

Four simple rules to keep in mind while naming your thesis are:

  1. Avoid redundancy.
  2. Title can be broader but never narrower.
  3. A title worth to be a survey paper will be good.
  4. Complete, catchy and crisp.

Following is one approach to arrive at a title:

  1. List down the connecting ideas that determine your work. Usually there are three to four ideas. For instance, I
    1. Improve code search.
    2. Leverage naturalness of source code.
    3. Use natural language descriptions around source code.
  2. See if any of these are too narrow. If yes, make them broader. For instance,
    1. “natural language description” are highly specialized form of “documentation”. In other words, documentation can be in any format.  So, let us make it “Use documentation”.
  3. Look at survey titles in your area of research to find some naming styles. I went to google scholar and tried the query “TSE code search survey” In my case, here are some examples that I liked:
    1. Feature location is source code: A taxonomy and survey.
    2. A survey of software reuse libraries
    3. Exemplar: A source code search engine for finding highly relevant applications
    4. Comparing two methods of sending out questionnaires; E-mail versus mail
    5. Tracelet-based code search in executables
    6. … and so on
  4. Now, the third one looks like an extension of a single conference paper idea. So, I drop it. For the rest, I abstract and note down the styles as follows:
    1. X in Y: A taxonomy and survey.
    2. A survey of X.
    3. Comparing two methods of X; x1 versus x2.
    4. X-based Y in Z.
    5. X’ing Y-based applications via automated combination of Z techniques.
    6. Learning from X to improve Y.
    7. Comparison and evaluaiton of X tools and techniques: A qualitative approach.
    8. X based recommendation for Y.
    9. Effective X based on Y model.
    10. Exploring the X patterns of Y in Z.
    11. … and so on.
  5. Ok! There are a lot. So, let us find what type of these abstractions will suit us. Clearly, I do no comparative evaluation. So, it won’t suit me. I have to combine the key ideas of “software engineering applications”, “modeling source code”, “using documentation”, “leveraging naturalness” and “code search”. So, let us narrow down and look for such patterns:
    1. Leveraging documentation and exploiting the naturalness of source code in improving code search. (too long)
    2. Enhanced retrieval of source code by leveraging big code and big data. (too heavy – big code, big data, retrieval)
    3. Enhancing code search by automatically mining related documentation. (not bad but too simple).
    4. Improving code search using relevant documentation (much better than 3 but still simple).
    5. Exploiting retrieval models for analysis of source code. (sounds good)
    6. Models of source code to support retrieval based applications.
    7. Leveraging naturalness and relevant documentation in source code representations.
    8. Source code representations for search.  (too short – misses key points)
    9. Improving code search using retrieval models.
    10. Adapting text retrieval models for analysis of source code: Benefits and Challenges.
  6. Note that the above step makes me think what exactly am I doing?
    1. There is an implied priority in the order of phrases. For example, In “Models of source code to support retrieval based applications”, the emphasis is more in modeling source code. Naturally, it is expected that the survey will cover state of the art code models. This fits my work.  In “Adapting text retrieval models for analysis of source code”, it sounds like I am going to cover text retrieval models in depth, and perhaps no source code models. I do both to some extent actually!
  7. Let us now pick a few and think deeper. To aid our work, let’s group our ideas as perspectives.
    1. Perspectives on modeling source code
      1. Models of source code to support retrieval based applications.
      2. Source code representations for search.
    2. IR perspective
      1. Improving code search using retrieval models.
      2. Enhancing code search by automatically mining related documentation.
      3. Building retrieval based applications by leveraging naturalness in source code.
    3. Naturalness perspective
      1. Leveraging statistical properties of source code in improving code search.
      2. Leveraging statistical properties of source code in retrieval (based applications).
      3. Leveraging statistical properties of source code for effective code search.
      4. Leveraging naturalness of source code in building retrieval based applications.
    4. Intelligence perspective
      1. Knowledge discovery from Big Code and relevant documentation.
      2. Leveraging large scale source code repositories for building search-based applications.
  8. Ok! So, what should I do now? Best way to go ahead would be to discuss this with few people around and decide which one I would be most comfortable with.

Good luck!

Doing a PhD

PhD students often have several questions about conducting research, job opportunities after PhD, etc. Having talked to several students, professors and researchers. Here is a compilation of wisdom obtained on these lines. There is no specific right answer and there are always exceptions. So, take these with caution. Also, most of these apply to computer science, big data, data science, ML kind of background.
Research
  1. Positioning: Typically, the inverted triangle approach is followed to find research gaps and select an area to focus. As an example, here’s how a colleague of mine shaped his work during his PhD: Image Analysis –> Biometrics –> Fingerprint recognition –> Latent Fingerprint Analysis. Note that there may be many ways to draw the hierarchy to reach to Latent Fingerprint Analysis. There is no rule or any right way to select one of them. However, having clarity on this hierarchy is important for few reasons:
    1. After PhD, how would you sell yourself? As Latent Fingerprint Analysis expert? It is too narrow to find job opportunities. How about Fingerprint Recognition expertise? Still too narrow. Our country may not have sufficient job opportunities. Much broader levels may work; but is still hard. Moreover, at much broader levels, how good are we?. So, as much as we gain depth in our research field, a solid breadth is also required. Moral: Be a domain expert, area expert and not just a problem expert.
    2. Finding right problems to solve. Time is too short to focus on everything.
  2. Dependence on Advisor: Be independent. It is your PhD. PhD is all about training you to be an independent researcher.
  3. PhD Training: PhD is all about training yourself for independent research. Doing high quality research requires skills in terms of:
    1. Area survey.
    2. Finding the right problem.
    3. Literature review.
    4. Problem definition.
    5. Solution approach.
    6. … all sections of the paper.
  4. Timing for Job Application: At least 6 months goes in the application process if you are applying to academia. Keep an eye on the requirements. xx conf papers, yy journal papers, zz TRs are important for UGC norms.
  5. Does brand value matter? Unfortunately, yes. Internship and post-docs at good places are probably important for this reason. Credibility of profile is very important. Good publications, a reputed post-doc, competitive skills, etc will help you.
  6. Why should I do internship?
    1. Brand value to resume.
    2. Learn different styles of writing, working, environment etc.
    3. Exposure to real world.
    4. Adapting to newer problems and people.
    5. Make contacts.
Skill Set
  1. What skill sets are you building? Develop skill sets during PhD period. In this case,
    1. Technical: Feature analysis, Image Segmentation, Noise removal, data enhancement, deep learning libraries, ML, general problem solving, etc.
    2. Managerial: Worked with other students on BTP, IP, individually, etc.
    3. Teaching: TA awards, etc.
    4. Coding: Java, Hadoop, R, etc.
    5. Financial: Acquiring funding – Writing research proposals.
    6. Communication
    7. Networking: In the domain of work, build contacts.
  2. PhD in Computer Science: Implies that you can solve problems in computer science. You are not a PhD in Latent Fingerprint Analysis. Keep this in mind. Think CS, Do CS. Keep learning CS.
  3. Making tangible contributions: Create products, tools, proof of concepts. Publish papers. Pass competitive exams.
Where should I spend my time?
  1. Improve skills on which you are already good at? Or, Build new skills? Prioritize. Have a clear map based on direction you want to take in future. Needs clarity on vision.
  2. Keep honing your skills.
  3. Manage breadth and depth in parallel. Do not get bogged down too much in depth alone.
  4. Presentations are just a tool to communicate your ideas. Do not overspend your time on preparing ppts. Work on your skills and thinking process.
Industry Expectations
  1. Your research topic is “blah blah”. What else have you done apart from this? What skills do you bring? Show a flavor of breadth you bring in. Can you code?
  2. Analyzing a real problem. Typically, a project which the interviewer is part of, is presented in interview as a toy problem and you are tested on how you would approach such problems. In a way, this tests your “ability to think from scratch”.
Managing Complexity
  1. There are too many things to learn. Too little time with us. Clear thinking, good breadth, analytical skills, presence of mind, and communication skills can help you here.
  2. Do not over defend your work. Every work has its limitations.

More on this… soon.

Publishing in top conferences

What does it take to publish in top conferences? A question, that comes to every new PhD student’s mind. My short answer is “forget the world, enjoy the problem at hand”. However, here are few things that you should never neglect:

  1. Don’t subdue your curiosity. Experiment liberally and see for yourself, the results.
  2. Whenever you have a very interesting result, know how to write it as a research paper.
  3. Keep reading. Do not under-estimate the need for right vocabulary and knowledge of the state of the art.
  4. Know that you belong to a community. Your community has a style of writing, a focus area, state of the art, benchmarks and so on. Know these and make incremental and interesting contributions.

A much better summary on this is available here. Do check it out!

Success as a PhD student

PhD is different from any other academic venture. To enjoy PhD, you must have an agenda, a purpose why to do what you do. In my opinion, to experience academic research at professional level is a good enough reason (and there exist many more good reasons) to do a PhD. I believe, the real treasure of knowledge is in academia and hence will focus on this as the reason to do a PhD. As an academic, you need to start enjoying student interactions, delivering classes, learning stuff deeper than what is necessary and try out interesting ideas. In short, its apparantly about “Learn, Do and Distribute Knowledge”.

To learn effectively, start with motivations on why the topic of study is important. If it were not important, you would not be studying it. Since it is important, not knowing why it is so important is a crime. Knowing applications of any subject will give us a thinking scheme that syncs well with the subject. With sufficient motivation, the selection of reading resources becomes the second key factor. Depending upon the level of reading, there are multiple advices here. At least, do not waste more time accumulating resources than the time spent in reading them. At the same time, ensure you study the most popular or most cited material first. These are the ones you wish to talk about, given an opportunity. Its ok to study little, but its mandatory to study well.

To do things effectively, ensure you have a schedule. Start with easy things in your to-do list and get to hard ones. I was just talking to a friend today that just like our body, mind also needs exercise (and warm ups) to start functioning at its best potential. Do not worry about making impressions. If we were to impress someone, we try to do several unnecessary things and it only deviates us from our pursuit for knowledge. Each one of us may have our own ways to waste time. While I see some sleeping it off, some just try to do things when their mind is not at its best. Some play while others chat. Facebook, movies and so many such distractions. At times, relaxation is also important. At least be aware of how you spend your time and ensure you are happy about the way you distribute your most valueable resource – time. Knowledge distribution typically happens through writing and teaching. Teaching is an art. Only the most passionate can teach very well. Being knowledged is not a sufficient condition to teach. Sometimes, being knowledged to an extreme extent is not even a necessary condition. To take several minds to an insight, all it takes is to show them the path. Of course, knowledge is certainly important and in most cases helps in teaching effectively. Best forms of writing keep deep roots in existing knowledge. Hence, good deep study, communication skills, passion to teach, patience to listen to students and willingness to design courses are important for success. With youtube, online MOOCs getting popular, knowledge distribution is finding more intersting and far reaching mechanisms.

Obviously, PhD necessitates to go beyond the surface level to know what’s happening. Deeper knowledge is hard to gain. To do that, its important that you feel relaxed and not stressed. You need to make study a continuous process. Take time to think. James Hayton talks about the need to treat study with same importance as thought it was a high paid job. I tend to agree. He talks about the need to be fit and focussed. Eat well and sleep well. These prepare you for a long tenure as a good student, scholar and teacher. Remember, to attain success you dont have to be extraordinarily smart. You just need to have passion and persistance. Good luck!

PhD or Not – That is the question!

Many people wonder if they should pursue a PhD or just go to job. After all, in today’s world, where information is freely available (internet – wikipedia, newspapers, magazines, etc), learning process has accelerated and tools such as laptops have made life comfortable at every stage. In a competitive world, if you have proved yourself with a good masters level education in a nice institute or managed to get a decent job in the area of your liking, why do a PhD?!

I had similar opinion when I was doing my Masters. Later, I went on to work at several leading organizations (in my area of interest) like Microsoft and Yahoo. I met two different kinds of people. The first kind derived pleasure from discipline, money-making, traveling, spending time with family, etc. Such people had no great ambition on work or work-related activities even though they were sincere and capable. Their passion and ambition was to get a good living, support their near and dears financially and have a safe and decent living. I too belonged to this kind for several years. Over time, I started meeting some people who possessed extra-ordinary skills or knowledge or both. These were people who had sacrificed their life for “one” thing. Most famous people belonged to this group. They too supported their family very well. Many of them even had more wealth when compared to the “working” class. Many of them traveled more than the working class. More than anything, the second kind seem to be content and happy with their activities, people, environment, etc. Of course, its not all that rosy and there are people who sacrificed themselves to something and never had the potential or luck to make it big.

PhD is one platform to focus on such “one” thing. It gives you an opportunity to “be” someone. You are supposed to be a knowledge source (and not a sink) if you have a doctorate. Too many problems that human race faces have no identified solution. Someone is got to work on them and contribute for betterment.

You should not do a PhD for the sake of finding a better job or living. You should do a PhD when you find a passion to solve a problem or answer a question. Or at the least, you should pursue PhD when you believe you will enjoy solving “one” awesome problem all by yourself. Even in industry, problems are solved. Many new inventions are made. However, your part in these inventions will be close to negligible. You will have duties to discharge. In a PhD, your duty is to solve the “one” problem that you yourself have located. Trust me, its fun.

This is not to mean that everyone with a passion to problem solving, should do a PhD. Timing of joining a PhD program is very crucial. You should have some area of your study/work that impressed you to a great extent. You should see lots of problems in that area. You should be unhappy with the current state. Along with that, you should have the luxury to take few years off from your valuable time and be prepared for low income. Typically, least loss of income happens early in your career or just after your masters. Hence, I believe, there are lots of PhD conversions after Masters. However, joining a phd without sufficient passion and potential will only put you into the wrong topic, with a wrong advisor and in a wrong institute. Guess what, you will not find it worthwhile.

Mathematics in Research

Mathematical modelling has become very mature! More mature than any other field of science, to be honest. Today, I was reading about extraction and understanding “tables” (yes, those tr,td stuff, but not necessarily limited to html) in text. There, I observed Wang’s notation to represent table as an ordered pair (C, delta) where C is the definition (read as header) and delta carries the values. Here’s an example table:

(Car, (compact, null), (luxury, null))
delta(Car.compact) = “Nano”.
delta(Car.luxury) = “BMW”.

Embley’s notation is as follows:

(
(Car,compact,Nano)
(Car,luxury,BMW)
)

Vectors, Matrices and Sets are fairly common general purpose tools to model. For people who have lost touch with basic mathematics, its hard initially to get a hang of these to think mathematically. Most high level models are based on these. For instance, consider probabilistic models such as HMM which uses set notation. Every researcher (of computer science to my best knowledge) ends up using these models.

I see some people expressing their hate to mathematics. Mathematics as it is introduced in primary and high schools are described without a purpose. Thus, it gets hard to appreciate the motivations to learn. Only, when I started researching did I find that maths provides methods to deal with abstractions in a mature way. For instance, Singular Value Decomposition helps in reducing matrices. This is a valueable tool in search technologies. Without it, we would be reinventing the same wheel.

I will remain a learner of mathematics for life and keep reading about its advances. I must keep in mind that without strong fundamentals, higher order mathematics does not make any sense. i.e., if you do not understand law of large numbers, probability makes no sense. If you don’t understand probability, randomized algorithms make no sense. If you do not understand probability and randomized algorithms, parts of search technologies and information retrieval will not make much sense.

A very happy new year to you and my resolution this year is to enjoy the beauty of mathematics by giving it the time it deserves.