Indexing to Search to Insight (& Artificial Intelligence)

What a Long Strange Trip it’s Been

We just completed a few assignments for clients seeking to improve the performance of search in key applications within their organizations. These assignments reminded me of the long and winding road search has been on for the past 30 years, together with some valuable lessons we’ve learned along the way.

I thought it would be useful to share these lessons in a series of blog posts where we will compare and examine the following:

Indexing engines (the first generation of search engines)
Modern search engines
Insight engines (the next generation search engine using artificial intelligence)

First Generation Search: Indexing

If you can believe it, back in the good old days before the internet (e.g. prior to the release of Netscape Navigator in 1994), full text search capabilities were not generally pervasive. Search was available in on-line information systems (e.g. Dialog) but inside organizations, databases were still using string matching (e.g. a SQL LIKE statement) and this meant word search was very slow, especially when the database was very large.

This slowness gave rise to a series of products and capabilities, which solved this problem by indexing words. Over time these indexing capabilities were embedded in database products enabling fast keyword retrieval over large databases of virtually any size.

We did some recent work with a customer that was using an indexing product in which they had built a document repository of about 100,000 PDFs. The application had been doing its job for many years, but more recently users and the content administrators (information professionals in the research center) had noticed a few problems. Most importantly, they were frustrated that the relevance ranking of the search results seemed poor.

How is Relevance Determined

Indexing products (including advanced ones like MS SQL Server full text search) generally compute indexing using a combination of the following:

The frequency of the search terms within a record (e.g. how often do search terms occur in a document – more means more relevant).
The ‘density’ of the search terms within a record (e.g. the number of search terms divided by the total number of words in a record).
How common the search terms are in the database as a whole (matching on a rare word is more important than matching on a common word).

This type of relevance ranking algorithm works pretty well, and when combined with Boolean search operators and parametric/fielded search, can yield excellent results. But it is far from perfect.

Problems with Relevance

In fact, the customer sighted three issues they were experiencing with relevance:

Problem #1 - Big documents (like the thousand page annual industry handbook) always seemed to be at the top the search results.
Problem #2 - Documents where the search term was in the title field were not deemed more relevant than documents where the search term was on page 24.
Problem #3 - More recent documents were not deemed more relevant that much older documents on the same topic.

A general problem with Indexing products is that the built-in relevancy algorithm is generally a take it or leave it proposition. It cannot be tuned or adjusted. This is a major difference between indexing products and modern search engines.

When we discussed next steps with the client, we quickly moved into a discussion of search engines, which is the subject of the next blog post.