Whacky and Newspaper Article Categorization

Whacky is a project to capture user information from social networks. It mines twitter and other linked accounts to collect as much information as is available. Application of such a technology seems to be awesome such as in understanding leakage of personal identifiable information.0

Dr. Haimonti Dutta’s work on newspaper article categorisation was also an inspiration. Although, at the first glance, this looked relatively simple, it will be interesting to know what features were used to categorise short articles. Obviously, multiple categories may apply to an article and ranking should have been a mandatory part.

Both these projects descriptions are reachable through the IIIT Delhi website. Have a read.

Advertisements

Training Data – Mutual Independence – Is this a myth?

Assumptions that the one training sample is totally unrelated to another seems to be a myth. Modelling these relationships will lead to the next generation of AI applications claims University of Washington in their description of “Statistical Relational Learning” project.

This paper nicely introduces not only SRL, but several other related terms such as:

  • Markov Model: Assuming that the next state depends solely upon the present state (a set of variables), we predict the next state by modelling the present state.
  • Monte Carlo Methods: Iterate “random sampling & observation” several times and heuristically reach acceptable approximations.
  • First Order Logic: Known as predicate logic and in contrast to propositional logic, here, we assume that the universe is a set of objects, relations and functions. For eg: count(riversInIndia) > count(riversInSrilanka).

In my subsequent posts, I will discuss these items.

ML Projects at CMU

Browsing through the projects at CMU ML Department, the NELL project titled “Read the web” particularly caught my attention. Considering web as a massive collection of facts, learning from these is a very interesting problem. This is a good read on Exploratory Learning. While dealing with huge data its natural to expect several clusters to exist. This paper on EL claims to extend the existing Semi supervised learning (SSL) algorithms into what they call as “Expectation Maximization”. This paper also gave me a summary of well known SSL techniques such as:

  • Naive Bayes: Assuming features are independent of each other, existence of each of the features contributes to the probability of the item’s classification.
  • K-Means: NP Hard problem of clustering items to their nearest mean.
  • Von Mises Fisher

Another interesting project was to detect novel topics from all available topics in time t – 1.

Sorting Algorithms – Reference

Found this image here as a good way to refresh my knowledge of sorting algorithms:

Image

Non-Comparitive Sorting Techniques 

(Source: To a large extent, wikipedia)

Bucket SortVariants:

Counting sort (has bucket size = 1)

 

  1. Set up an array of initially empty “buckets”.
  2. Scatter: Go over the original array, putting each object in its bucket.
  3. Sort each non-empty bucket.
  4. Gather: Visit the buckets in order and put all elements back into the original array.

Bucket_sort_1

Radix Sort
  1. Take the least significant digit (or group of bits, both being examples of radices) of each key.
  2. Group the keys based on that digit, but otherwise keep the original order of keys. (This is what makes the LSD radix sort a stable sort).
  3. Repeat the grouping process with each more significant digit.

170, 45, 75, 90, 802, 2,24, 66

170, 90, 8022, 24, 45, 75, 66

802, 2, 24, 45, 66, 170, 75, 90

2, 24, 45, 66, 75, 90, 170, 802

Each digit place can be sorted in a single pass.

Journal or Conference-Paper?

While browsing some of the faculty profiles, I noticed classification of their publications into Journal and Conference papers. Was interesting to understand why we need this sort of bifurcation and what each one is for! 

Turns out that journals are longer versions that contain details about experiments, research and so on. Conference papers crisply capture the essence of the work with key data supporting them with a strict page limit. This page is a good reading. For someone like me who derives motivation from people who read and listen to my work, conferences are just perfect. Its an opportunity to quickly get my work to masses and get feedback. At the same time, crisp notes help me digest the state of the art much quickly and just in time. Journals are probably good for understanding history of our field or if we have a ground breaking story that’s worth the effort and time (and also something that will endure the time and still be impactful!). 

 

Extracting Hypothesis from Examples (ILP)

Inductive Logic Programming systems look to explain behavior using examples. This is a well researched and documented research area with applications ranging from drug industry to search technologies.

In the paper on “Drug Design Using Inductive Logic Programming”, Machine Learning is used as a pattern recognition technology. Fundamental hypothesis is that you can start with few examples, look at common attributes and build a rule based on them. Now, iterate on this idea with more examples and refine the rules. In some sense, I believe this is what “least general generalization” (leggs) (see ILP literature for more details) refers to.

When I imagine its application to search technologies or information retrieval, there are plenty of opportunities. For instance, consider categorization. Two websites can be categorized as hotels based on few parameters that we read (such as if name contains “Inn”, “Eating Joint”, “Take Away”). Now, with more and more such examples, supervised learning should allow us to construct the best possible rules. Nice knowing ILP!

Challenges in local listings search

Search engines of this era are very sophisticated. Power of machine learning over massive web-scale data makes them very competitive. However, we still face several open problems that challenge the existing technologies. Some such problems as I perceive it are listed below:

  • Entity Extraction from web: Entity attributes such as company name, address, phone, etc lay in unstructured form and need to be extracted. Even with Web 2.0 and structured semantic pages, this problem still remains to be a huge one. Imagine extracting an entity and its information from a free-form text or webpage!
  • Entity Deduplication: We could buy data from multiple providers. Consider a news item that says “Tendulkar gets out at 99”. Same news item may come from several providers. If we are a news website such as Yahoo! or MSN, how do you de-duplicate or conflate these news items that convey the same meaning but are essentially authored in different ways.
  • Entity Categorization: Given a web page, can you categorize it? Same applies to entites – eg., Amitabh Bachan is an actor, Ford is an automotive and so on.
  • Popularity: Given two entities, which one is more popular than other? Can popularity be static? Is Taj Krishna more popular than Westin?
  • Semantic Similarity: If the query is “best levis outlet in Hyderabad”, is there a way to cluster all levis outlets in Hyderabad and later rank them?
  • Query Alteration: How can queries be altered for the best benefit of overall search experience?
  • Query Language: site:blahblah, filetype:pdf are some common annotations that we might have used. What problems do they solve? Is there a better way of solving them? Do we need these?
  • Intent Extraction from Queries: How to disambiguate query intent? How to identify multiple query intents?
  • Classification: Queries can be of multiple types: Name (as in Airtel), Category (as in telecom providers), Navigational Intent (as in youtube), Task oriented (as in 32 degrees to farenheit) and so on.
  • Interplay of distance & popularity: How far can I go for a coffee shop vis-a-vis a university?
  • How should local search be measured? NDCG, Precision & Recall have their limitations and factors like content richness should matter.
  • Presentation: Map vs Search Results – Should there be a difference?
  • Crowdsourcing local data.
  • Noise correction & Quality improvement of local content

Knowing the state of the art in this field would be an amazingly interesting experience!