Search engines of this era are very sophisticated. Power of machine learning over massive web-scale data makes them very competitive. However, we still face several open problems that challenge the existing technologies. Some such problems as I perceive it are listed below:
- Entity Extraction from web: Entity attributes such as company name, address, phone, etc lay in unstructured form and need to be extracted. Even with Web 2.0 and structured semantic pages, this problem still remains to be a huge one. Imagine extracting an entity and its information from a free-form text or webpage!
- Entity Deduplication: We could buy data from multiple providers. Consider a news item that says “Tendulkar gets out at 99”. Same news item may come from several providers. If we are a news website such as Yahoo! or MSN, how do you de-duplicate or conflate these news items that convey the same meaning but are essentially authored in different ways.
- Entity Categorization: Given a web page, can you categorize it? Same applies to entites – eg., Amitabh Bachan is an actor, Ford is an automotive and so on.
- Popularity: Given two entities, which one is more popular than other? Can popularity be static? Is Taj Krishna more popular than Westin?
- Semantic Similarity: If the query is “best levis outlet in Hyderabad”, is there a way to cluster all levis outlets in Hyderabad and later rank them?
- Query Alteration: How can queries be altered for the best benefit of overall search experience?
- Query Language: site:blahblah, filetype:pdf are some common annotations that we might have used. What problems do they solve? Is there a better way of solving them? Do we need these?
- Intent Extraction from Queries: How to disambiguate query intent? How to identify multiple query intents?
- Classification: Queries can be of multiple types: Name (as in Airtel), Category (as in telecom providers), Navigational Intent (as in youtube), Task oriented (as in 32 degrees to farenheit) and so on.
- Interplay of distance & popularity: How far can I go for a coffee shop vis-a-vis a university?
- How should local search be measured? NDCG, Precision & Recall have their limitations and factors like content richness should matter.
- Presentation: Map vs Search Results – Should there be a difference?
- Crowdsourcing local data.
- Noise correction & Quality improvement of local content
Knowing the state of the art in this field would be an amazingly interesting experience!
- Start one project or take one unsolved problem
- Define breadth & depth to solve this problem
- Divide the problem; Write on the challenges, motivation, survey, state of the art;
- Experiment on intuitions; write on experiments
- Combine all the work into a thesis. There still could be open yet-to-be solved problems.
Create your own projects (simple/small is fine) . Examples:
- Js for data visualization over R.
- Identify facts from internet
- Summarize from books
Good to have it hosted / uploaded for visibility
Simple mindmap of the field of data analytics.
Computing chromatic number is an NP – Complete problem. It is the minimum number of colors required to color a graph so that no two adjacent vertices have the same color. See mathworld…
This is my mindmap of what an introduction to graph theory constitutes of. Will explore some of these topics in near future.
- Vertex, Path, Connectivity, Trees, Forests, Cycles
- Matching & Coloring
- Chromatic Number
- Based on connectivity: Isomorphic, Null, Bipartite Graphs, Planar (Dual, Infinite) Graphs, k-partite & k-colorable, Directed (Digraph)
- Based on Walks: Eulerian, Hamiltonian (walks cover all edges and end at initial vertex)
- Based on shapes: Stars, Platonic, etc…
- Euler Tour & Hamilton Cycles
- Hall’s, Euler’s Formula, Brook’s, matroid theory
- Well-known problems
Big O is not the only notation used to describe algorithm complexity. There are theta and omega as well. In each (O, theta, omega), there are big and little notations.
As I read from online sources:
f(x) = O(g(x)) (big-oh) means that the growth rate of f(x) is asymptotically less than or equal to to the growth rate of g(x).
f(x) = Ω(g(x)) (big-omega) means that the growth rate of f(x) is asymptotically greater than or equal to the growth rate of g(x)
f(x) = o(g(x)) (small-oh) means that the growth rate of f(x) is asymptotically less than the growth rate of g(x).
f(x) = ω(g(x)) (small-omega) means that the growth rate of f(x) is asymptotically greater than the growth rate of g(x)
f(x) = Θ(g(x)) (theta) means that the growth rate of f(x) is asymptotically equal to the growth rate of g(x)
What are P, NP, NP Hard and NP Complete problems?
P = Problems solvable in polynomial time. At least they have polynomial upper and lower bounds (worst and best cases). Polynomial time: O(n2), O(n3), O(1), O(n lg n), O(n^k)
NP (Non-Deterministic Polynomial time) = Verifyable in polynomial time. For example, finding a path may be difficult. However, given a path, verifying if it has at most k nodes is easy.
NP Complete = Can be reduced to another problem x (x is NP Hard) in polynomial time such that if x is solved, this problem can be solved.
NP Hard = There are million dollar awards on such x problems that when solved will help computer science solve several hard problems.
Turing’s “Halting Problem” is not solvable by any computer, no matter how much time is given! Another example: Given any boolean expression (3 sat example), can there exists valid values such that the expression evaluates to true? Apparantly, these can be solved by brute force or guessing the values and not through any polytime algorithm.