B and B+ Trees

BTrees and B+Trees are n-ary trees with minimum of 2 keys per node. Root nodes do not contain any key.

These are indexing structures commonly used for the advantages of fast retrieval. Think of each node having large number of keys. Beyond a threshold, we add more nodes.

B+ tree is a specialization of B Tree because only the leaf nodes of B+ trees contain values. All other nodes contain only keys.

On the downside, root node becomes a point of failure as is the case with any tree.

Found this to be a good read.

Advertisements

Whacky and Newspaper Article Categorization

Whacky is a project to capture user information from social networks. It mines twitter and other linked accounts to collect as much information as is available. Application of such a technology seems to be awesome such as in understanding leakage of personal identifiable information.0

Dr. Haimonti Dutta’s work on newspaper article categorisation was also an inspiration. Although, at the first glance, this looked relatively simple, it will be interesting to know what features were used to categorise short articles. Obviously, multiple categories may apply to an article and ranking should have been a mandatory part.

Both these projects descriptions are reachable through the IIIT Delhi website. Have a read.

Training Data – Mutual Independence – Is this a myth?

Assumptions that the one training sample is totally unrelated to another seems to be a myth. Modelling these relationships will lead to the next generation of AI applications claims University of Washington in their description of “Statistical Relational Learning” project.

This paper nicely introduces not only SRL, but several other related terms such as:

  • Markov Model: Assuming that the next state depends solely upon the present state (a set of variables), we predict the next state by modelling the present state.
  • Monte Carlo Methods: Iterate “random sampling & observation” several times and heuristically reach acceptable approximations.
  • First Order Logic: Known as predicate logic and in contrast to propositional logic, here, we assume that the universe is a set of objects, relations and functions. For eg: count(riversInIndia) > count(riversInSrilanka).

In my subsequent posts, I will discuss these items.

ML Projects at CMU

Browsing through the projects at CMU ML Department, the NELL project titled “Read the web” particularly caught my attention. Considering web as a massive collection of facts, learning from these is a very interesting problem. This is a good read on Exploratory Learning. While dealing with huge data its natural to expect several clusters to exist. This paper on EL claims to extend the existing Semi supervised learning (SSL) algorithms into what they call as “Expectation Maximization”. This paper also gave me a summary of well known SSL techniques such as:

  • Naive Bayes: Assuming features are independent of each other, existence of each of the features contributes to the probability of the item’s classification.
  • K-Means: NP Hard problem of clustering items to their nearest mean.
  • Von Mises Fisher

Another interesting project was to detect novel topics from all available topics in time t – 1.

Sorting Algorithms – Reference

Found this image here as a good way to refresh my knowledge of sorting algorithms:

Image

Non-Comparitive Sorting Techniques 

(Source: To a large extent, wikipedia)

Bucket SortVariants:

Counting sort (has bucket size = 1)

 

  1. Set up an array of initially empty “buckets”.
  2. Scatter: Go over the original array, putting each object in its bucket.
  3. Sort each non-empty bucket.
  4. Gather: Visit the buckets in order and put all elements back into the original array.

Bucket_sort_1

Radix Sort
  1. Take the least significant digit (or group of bits, both being examples of radices) of each key.
  2. Group the keys based on that digit, but otherwise keep the original order of keys. (This is what makes the LSD radix sort a stable sort).
  3. Repeat the grouping process with each more significant digit.

170, 45, 75, 90, 802, 2,24, 66

170, 90, 8022, 24, 45, 75, 66

802, 2, 24, 45, 66, 170, 75, 90

2, 24, 45, 66, 75, 90, 170, 802

Each digit place can be sorted in a single pass.

Journal or Conference-Paper?

While browsing some of the faculty profiles, I noticed classification of their publications into Journal and Conference papers. Was interesting to understand why we need this sort of bifurcation and what each one is for! 

Turns out that journals are longer versions that contain details about experiments, research and so on. Conference papers crisply capture the essence of the work with key data supporting them with a strict page limit. This page is a good reading. For someone like me who derives motivation from people who read and listen to my work, conferences are just perfect. Its an opportunity to quickly get my work to masses and get feedback. At the same time, crisp notes help me digest the state of the art much quickly and just in time. Journals are probably good for understanding history of our field or if we have a ground breaking story that’s worth the effort and time (and also something that will endure the time and still be impactful!). 

 

Extracting Hypothesis from Examples (ILP)

Inductive Logic Programming systems look to explain behavior using examples. This is a well researched and documented research area with applications ranging from drug industry to search technologies.

In the paper on “Drug Design Using Inductive Logic Programming”, Machine Learning is used as a pattern recognition technology. Fundamental hypothesis is that you can start with few examples, look at common attributes and build a rule based on them. Now, iterate on this idea with more examples and refine the rules. In some sense, I believe this is what “least general generalization” (leggs) (see ILP literature for more details) refers to.

When I imagine its application to search technologies or information retrieval, there are plenty of opportunities. For instance, consider categorization. Two websites can be categorized as hotels based on few parameters that we read (such as if name contains “Inn”, “Eating Joint”, “Take Away”). Now, with more and more such examples, supervised learning should allow us to construct the best possible rules. Nice knowing ILP!