How to study?

Some notes on student success:

  • Someone’s got to tell you money is not extremely important, at least for now.
  • Success repeats.
  • There is one quality in all achievers, “Gratitude”!
  • Success is not bought. It is built, little by little.
  • Phased repetition is more important than one time slog.
  • Focus. Have a direction and stick to it.
  • First 15 hours or so of study gives you an idea, does not make you an expert.
  • After the first 100 hours of preparation, you will know that you know nothing. Keep patience and confidence. Stick to your direction.
  • Practice makes you perfect. There is this 10 year rule. Listen to Angela Lee Duckworth.
  • Mind works at a different speed. Writing perhaps slows you down enough to give your mind the time to think. So, write down important things. Revise.
  • When you can no longer think, stop reading. Take a break. Aptitude is what you are training for. Not, memory.
  • Most key events such as exams happen in the morning hours. So, keep your body cycle such that you are at your best during these hours.
  • Find pleasure in the process of achievement, and not in the achievement itself. That way, you will have many hours of satisfaction instead of just a few moments.

Practical challenges in computing recall

Most experiments are designed on controlled corpus i.e., the precision and recall of the corpus are already known either manually or through some other means (not the same as the experimental tool/automation itself). Thus, these are smaller samples of the real corpus. An Oracle can now be implemented to compute recall. Sampling works in most cases. However, it has its own limitations too. For example, samples can suffer from a serious threat to validity. With another sample, the results could be different. Creating several large samples in several circumstances could be infeasible. Let us review some techniques followed by researchers in these contexts to compute recall.

Gold Sets or Benchmarks

Using an existing benchmark: One way to address this issue is to use a carefully selected representative dataset such as sf100 ( While this is large and unbiased, the issue could be that this is still too large for certain recall computation tasks. Such benchmarks are also referred to as “Gold Set”. Moreover, such benchmarks are rare and specialized that these may not suit your purpose all the time.

Creating your own benchmark: In Shepherd et al.’s paper on “Using Natural Language Program Analysis to Locate and Understand Action-Oriented Concerns”, he hires a new person to prepare the gold set along with relevant results. Another person verifies the results. Both these people discuss and reconcile wherever there were disagreements. This gold set is then released to the community.

Comparative Evaluation instead of Recall

In papers such as “Improving Bug Localization using Structured Information Retrieval”, a comparative result is given instead of recall. They claim that their approach finds x% of bugs more than another tool.


If there is only one result expected. Computing MRR is more appropriate than Recall. It is easier to do so in top-10 or top-k results.

More on this …soon.



How to title your thesis?

Four simple rules to keep in mind while naming your thesis are:

  1. Avoid redundancy.
  2. Title can be broader but never narrower.
  3. A title worth to be a survey paper will be good.
  4. Complete, catchy and crisp.

Following is one approach to arrive at a title:

  1. List down the connecting ideas that determine your work. Usually there are three to four ideas. For instance, I
    1. Improve code search.
    2. Leverage naturalness of source code.
    3. Use natural language descriptions around source code.
  2. See if any of these are too narrow. If yes, make them broader. For instance,
    1. “natural language description” are highly specialized form of “documentation”. In other words, documentation can be in any format.  So, let us make it “Use documentation”.
  3. Look at survey titles in your area of research to find some naming styles. I went to google scholar and tried the query “TSE code search survey” In my case, here are some examples that I liked:
    1. Feature location is source code: A taxonomy and survey.
    2. A survey of software reuse libraries
    3. Exemplar: A source code search engine for finding highly relevant applications
    4. Comparing two methods of sending out questionnaires; E-mail versus mail
    5. Tracelet-based code search in executables
    6. … and so on
  4. Now, the third one looks like an extension of a single conference paper idea. So, I drop it. For the rest, I abstract and note down the styles as follows:
    1. X in Y: A taxonomy and survey.
    2. A survey of X.
    3. Comparing two methods of X; x1 versus x2.
    4. X-based Y in Z.
    5. X’ing Y-based applications via automated combination of Z techniques.
    6. Learning from X to improve Y.
    7. Comparison and evaluaiton of X tools and techniques: A qualitative approach.
    8. X based recommendation for Y.
    9. Effective X based on Y model.
    10. Exploring the X patterns of Y in Z.
    11. … and so on.
  5. Ok! There are a lot. So, let us find what type of these abstractions will suit us. Clearly, I do no comparative evaluation. So, it won’t suit me. I have to combine the key ideas of “software engineering applications”, “modeling source code”, “using documentation”, “leveraging naturalness” and “code search”. So, let us narrow down and look for such patterns:
    1. Leveraging documentation and exploiting the naturalness of source code in improving code search. (too long)
    2. Enhanced retrieval of source code by leveraging big code and big data. (too heavy – big code, big data, retrieval)
    3. Enhancing code search by automatically mining related documentation. (not bad but too simple).
    4. Improving code search using relevant documentation (much better than 3 but still simple).
    5. Exploiting retrieval models for analysis of source code. (sounds good)
    6. Models of source code to support retrieval based applications.
    7. Leveraging naturalness and relevant documentation in source code representations.
    8. Source code representations for search.  (too short – misses key points)
    9. Improving code search using retrieval models.
    10. Adapting text retrieval models for analysis of source code: Benefits and Challenges.
  6. Note that the above step makes me think what exactly am I doing?
    1. There is an implied priority in the order of phrases. For example, In “Models of source code to support retrieval based applications”, the emphasis is more in modeling source code. Naturally, it is expected that the survey will cover state of the art code models. This fits my work.  In “Adapting text retrieval models for analysis of source code”, it sounds like I am going to cover text retrieval models in depth, and perhaps no source code models. I do both to some extent actually!
  7. Let us now pick a few and think deeper. To aid our work, let’s group our ideas as perspectives.
    1. Perspectives on modeling source code
      1. Models of source code to support retrieval based applications.
      2. Source code representations for search.
    2. IR perspective
      1. Improving code search using retrieval models.
      2. Enhancing code search by automatically mining related documentation.
      3. Building retrieval based applications by leveraging naturalness in source code.
    3. Naturalness perspective
      1. Leveraging statistical properties of source code in improving code search.
      2. Leveraging statistical properties of source code in retrieval (based applications).
      3. Leveraging statistical properties of source code for effective code search.
      4. Leveraging naturalness of source code in building retrieval based applications.
    4. Intelligence perspective
      1. Knowledge discovery from Big Code and relevant documentation.
      2. Leveraging large scale source code repositories for building search-based applications.
  8. Ok! So, what should I do now? Best way to go ahead would be to discuss this with few people around and decide which one I would be most comfortable with.

Good luck!

Doing a PhD

PhD students often have several questions about conducting research, job opportunities after PhD, etc. Having talked to several students, professors and researchers. Here is a compilation of wisdom obtained on these lines. There is no specific right answer and there are always exceptions. So, take these with caution. Also, most of these apply to computer science, big data, data science, ML kind of background.
  1. Positioning: Typically, the inverted triangle approach is followed to find research gaps and select an area to focus. As an example, here’s how a colleague of mine shaped his work during his PhD: Image Analysis –> Biometrics –> Fingerprint recognition –> Latent Fingerprint Analysis. Note that there may be many ways to draw the hierarchy to reach to Latent Fingerprint Analysis. There is no rule or any right way to select one of them. However, having clarity on this hierarchy is important for few reasons:
    1. After PhD, how would you sell yourself? As Latent Fingerprint Analysis expert? It is too narrow to find job opportunities. How about Fingerprint Recognition expertise? Still too narrow. Our country may not have sufficient job opportunities. Much broader levels may work; but is still hard. Moreover, at much broader levels, how good are we?. So, as much as we gain depth in our research field, a solid breadth is also required. Moral: Be a domain expert, area expert and not just a problem expert.
    2. Finding right problems to solve. Time is too short to focus on everything.
  2. Dependence on Advisor: Be independent. It is your PhD. PhD is all about training you to be an independent researcher.
  3. PhD Training: PhD is all about training yourself for independent research. Doing high quality research requires skills in terms of:
    1. Area survey.
    2. Finding the right problem.
    3. Literature review.
    4. Problem definition.
    5. Solution approach.
    6. … all sections of the paper.
  4. Timing for Job Application: At least 6 months goes in the application process if you are applying to academia. Keep an eye on the requirements. xx conf papers, yy journal papers, zz TRs are important for UGC norms.
  5. Does brand value matter? Unfortunately, yes. Internship and post-docs at good places are probably important for this reason. Credibility of profile is very important. Good publications, a reputed post-doc, competitive skills, etc will help you.
  6. Why should I do internship?
    1. Brand value to resume.
    2. Learn different styles of writing, working, environment etc.
    3. Exposure to real world.
    4. Adapting to newer problems and people.
    5. Make contacts.
Skill Set
  1. What skill sets are you building? Develop skill sets during PhD period. In this case,
    1. Technical: Feature analysis, Image Segmentation, Noise removal, data enhancement, deep learning libraries, ML, general problem solving, etc.
    2. Managerial: Worked with other students on BTP, IP, individually, etc.
    3. Teaching: TA awards, etc.
    4. Coding: Java, Hadoop, R, etc.
    5. Financial: Acquiring funding – Writing research proposals.
    6. Communication
    7. Networking: In the domain of work, build contacts.
  2. PhD in Computer Science: Implies that you can solve problems in computer science. You are not a PhD in Latent Fingerprint Analysis. Keep this in mind. Think CS, Do CS. Keep learning CS.
  3. Making tangible contributions: Create products, tools, proof of concepts. Publish papers. Pass competitive exams.
Where should I spend my time?
  1. Improve skills on which you are already good at? Or, Build new skills? Prioritize. Have a clear map based on direction you want to take in future. Needs clarity on vision.
  2. Keep honing your skills.
  3. Manage breadth and depth in parallel. Do not get bogged down too much in depth alone.
  4. Presentations are just a tool to communicate your ideas. Do not overspend your time on preparing ppts. Work on your skills and thinking process.
Industry Expectations
  1. Your research topic is “blah blah”. What else have you done apart from this? What skills do you bring? Show a flavor of breadth you bring in. Can you code?
  2. Analyzing a real problem. Typically, a project which the interviewer is part of, is presented in interview as a toy problem and you are tested on how you would approach such problems. In a way, this tests your “ability to think from scratch”.
Managing Complexity
  1. There are too many things to learn. Too little time with us. Clear thinking, good breadth, analytical skills, presence of mind, and communication skills can help you here.
  2. Do not over defend your work. Every work has its limitations.

More on this… soon.

Moret’s “Theory of Computation” has a beautiful “Introduction”

Bernard M. Moret’s book on “The Theory of Computation” has the best Introduction I have ever read in any book. It has everything an Introduction should have: a chronological survey of key problems solved, key challenges, how these challenges can be categorised and what came out of these. After reading the intro, I can’t resist reading the rest of the book.

I was surprised to learn how Mathematicians and Logicians sowed seeds for the current digital era. The work of Godel, Church and Turing for instance are beyond doubt a must know for every “Computer” scientist. If the Theory of complexity, and Theory of Computability can be developed so nicely, there is no research in the world that cannot be well analyzed and presented.

Searching for UI

Over the last few months, I have been studying code search. A beautiful application of Code Search is the work done by Steven P. Reiss of Brown University on searching for user interfaces. He searches for the UI structure and APIs using Java and Swing/AWT knowledge. Further, he applies some transformations to avoid duplicates and score the search results. Basic idea is to extract UI code from the search results (from Ohloh, GitHub, etc) and build a new class file with standard identifier naming conventions. More transformations are applied to clean the code. For example, data providers are replaced with dummy providers. This transformed code is made compilable and then scored. Since the code is compilable, the user interfaces can now be shown as search results! What a beauty! Now, you can search for user interfaces and the results can be viewed as user interface snapshots. A very neat idea, well experimented. He finds 79 relevant usable results for a query “name mail phone jlist” query. The system is available for playing around at

231 Pages – Cover to Cover – Read in one sitting – 1.5 hours! You can also do it.

Today, I read the book, “How to read better and faster” (author: Norman Lewis) in one sitting, within almost one and half hours. I have always believed that I am a slow reader and I need to increase my reading speed. This book caught my eye at my college library. Today was a peaceful day with almost entire college taking off. So, decided to have a go at it, today.
The comfort of my lab and a purpose to read gave me the initial fuel necessary for speed reading. I looked for central ideas on speed reading. I felt that I sync’ed up with author’s style of writing very quickly, perhaps in the first chapter itself. Thankfully, the author kept to the same writing style and used very lucid vocabulary. The beginning of each chapter summarized the earlier chapters and gave hints at what to expect in that chapter. Clear titles and sub-titles made the speed reading easier. I picked up the concept of “no regression” from this book. This refers to the practice of not going back to re-read something that I believe to have missed. Further, I picked up the concept of “varying the pace” of my reading. I read the initial summary a little slower than the middle of the book. I saw the author cajoling, threatening and providing lots of evidences whenever he made some point. I was least interested in this part. So, I skimmed through the middle of chapters and slowed down in areas where some concepts were emphasized.
When my purpose was clear, I noticed that I did not perform “lip movements” or “hearing the words“, that the author claims as deterrent habits of speed reading. Whenever I felt that I have lost the author, I saw this happening. In this book, there very hardly one or two occasions in which I had such problems. But, it is not uncommon for me, with other books (novels, etc). Perhaps, the point is, as soon as we can sync up with the author’s writing style, vocabulary and get a basic idea of the topic, we should be good to read better and faster. If I feel that I do too much of lip movements and hearing of words, I should just stop and think, what’s wrong or missing in my understanding instead of forcing myself to complete the reading.
A key point that the author emphasized was on comprehension. There is no point in wasting our time on reading something that does not add any value to us or that which cannot extend our knowledge. To this context, it is important to mix the above techniques appropriately. Always read with a purpose. Decide what to gain and what to leave. Know if you are reading for pleasure or business.
Author also emphasized on practice. From my experience, I know that I am a slow learner. I always take time to learn. I excel in my second attempt. In my undergrad days, I used to wonder how my brilliant class-mates could understand the material in first attempt! Now, I realize, it is just my nature. I must skim first, get the idea. I should read again and again to master the thoughts. However, I understand that it takes fewer iterations for me to get the real mastery over the subject. My comprehension came a long way to help me with my current study/research rigor of PhD. I realize that what matters is not how much we can learn in a short while, rather, how much can we retain for a long while! Iterations of learning help much better than read-forget cycle, which unfortunately is heavily supported by our educational system in schools.
For someone of my kind, whose daily life includes massive amount of reading, it is important to structure the reading behavior. Time is of course, limited and we want to ensure that we make the most of every minute spent reading. I will strongly recommend anyone who feels that their reading habits are not efficient, to read this book titled “How to read better and faster”. Happy reading!