While browsing some of the faculty profiles, I noticed classification of their publications into Journal and Conference papers. Was interesting to understand why we need this sort of bifurcation and what each one is for!
Turns out that journals are longer versions that contain details about experiments, research and so on. Conference papers crisply capture the essence of the work with key data supporting them with a strict page limit. This page is a good reading. For someone like me who derives motivation from people who read and listen to my work, conferences are just perfect. Its an opportunity to quickly get my work to masses and get feedback. At the same time, crisp notes help me digest the state of the art much quickly and just in time. Journals are probably good for understanding history of our field or if we have a ground breaking story that’s worth the effort and time (and also something that will endure the time and still be impactful!).
Inductive Logic Programming systems look to explain behavior using examples. This is a well researched and documented research area with applications ranging from drug industry to search technologies.
In the paper on “Drug Design Using Inductive Logic Programming”, Machine Learning is used as a pattern recognition technology. Fundamental hypothesis is that you can start with few examples, look at common attributes and build a rule based on them. Now, iterate on this idea with more examples and refine the rules. In some sense, I believe this is what “least general generalization” (leggs) (see ILP literature for more details) refers to.
When I imagine its application to search technologies or information retrieval, there are plenty of opportunities. For instance, consider categorization. Two websites can be categorized as hotels based on few parameters that we read (such as if name contains “Inn”, “Eating Joint”, “Take Away”). Now, with more and more such examples, supervised learning should allow us to construct the best possible rules. Nice knowing ILP!
Search engines of this era are very sophisticated. Power of machine learning over massive web-scale data makes them very competitive. However, we still face several open problems that challenge the existing technologies. Some such problems as I perceive it are listed below:
- Entity Extraction from web: Entity attributes such as company name, address, phone, etc lay in unstructured form and need to be extracted. Even with Web 2.0 and structured semantic pages, this problem still remains to be a huge one. Imagine extracting an entity and its information from a free-form text or webpage!
- Entity Deduplication: We could buy data from multiple providers. Consider a news item that says “Tendulkar gets out at 99”. Same news item may come from several providers. If we are a news website such as Yahoo! or MSN, how do you de-duplicate or conflate these news items that convey the same meaning but are essentially authored in different ways.
- Entity Categorization: Given a web page, can you categorize it? Same applies to entites – eg., Amitabh Bachan is an actor, Ford is an automotive and so on.
- Popularity: Given two entities, which one is more popular than other? Can popularity be static? Is Taj Krishna more popular than Westin?
- Semantic Similarity: If the query is “best levis outlet in Hyderabad”, is there a way to cluster all levis outlets in Hyderabad and later rank them?
- Query Alteration: How can queries be altered for the best benefit of overall search experience?
- Query Language: site:blahblah, filetype:pdf are some common annotations that we might have used. What problems do they solve? Is there a better way of solving them? Do we need these?
- Intent Extraction from Queries: How to disambiguate query intent? How to identify multiple query intents?
- Classification: Queries can be of multiple types: Name (as in Airtel), Category (as in telecom providers), Navigational Intent (as in youtube), Task oriented (as in 32 degrees to farenheit) and so on.
- Interplay of distance & popularity: How far can I go for a coffee shop vis-a-vis a university?
- How should local search be measured? NDCG, Precision & Recall have their limitations and factors like content richness should matter.
- Presentation: Map vs Search Results – Should there be a difference?
- Crowdsourcing local data.
- Noise correction & Quality improvement of local content
Knowing the state of the art in this field would be an amazingly interesting experience!
- Start one project or take one unsolved problem
- Define breadth & depth to solve this problem
- Divide the problem; Write on the challenges, motivation, survey, state of the art;
- Experiment on intuitions; write on experiments
- Combine all the work into a thesis. There still could be open yet-to-be solved problems.
Create your own projects (simple/small is fine) . Examples:
- Js for data visualization over R.
- Identify facts from internet
- Summarize from books
Good to have it hosted / uploaded for visibility
Simple mindmap of the field of data analytics.
Computing chromatic number is an NP – Complete problem. It is the minimum number of colors required to color a graph so that no two adjacent vertices have the same color. See mathworld…
This is my mindmap of what an introduction to graph theory constitutes of. Will explore some of these topics in near future.
- Vertex, Path, Connectivity, Trees, Forests, Cycles
- Matching & Coloring
- Chromatic Number
- Based on connectivity: Isomorphic, Null, Bipartite Graphs, Planar (Dual, Infinite) Graphs, k-partite & k-colorable, Directed (Digraph)
- Based on Walks: Eulerian, Hamiltonian (walks cover all edges and end at initial vertex)
- Based on shapes: Stars, Platonic, etc…
- Euler Tour & Hamilton Cycles
- Hall’s, Euler’s Formula, Brook’s, matroid theory
- Well-known problems