Knowledge Management

Insight Engines (Next Generation Search Engines using AI)

This is part three of a three-part blog post. (See part-one and part-two here.)

“Insight Engine” is a new term popularized by research firms such as Gartner and Forrester, and represents the next step in information discovery. The general idea of an Insight engine is to dramatically improve user access to relevant content and to make access to that content frictionless. While definitions vary, I define these engines as using AI techniques to offer three core capabilities generally lacking in search engines:

  • Increasing user access to all your content no matter where it resides
  • Improving the relevance of search results by using personal relevancy algorithms
  • Enabling next generation push/alerts

These capabilities have the ability to move enterprise search/knowledge management from a state where users spend significant time everyday trying to find information, to one where information is almost magically at one's disposal. Let’s take a closer look.

Access to Content

Insight engines dramatically change the dynamics of accessing content through the use of natural language processing. As we have all experienced, every search engine has its own unique syntax. Sometimes that difference is subtle, such as Google vs Bing. But other times the query syntax is incomprehensible to most users (e.g. native SQL queries) and thus the content becomes inaccessible. Insight engines tackle this problem by:

  1. Allowing users to enter a query using natural language. For example - “How many units did we sell today?”
  2. Parsing the query (using natural language processing) and then reformulating it in the native query syntax of the repository.
  3. Returning the results in a format useful to the user (e.g. a spreadsheet).

With this type of power users can query any repository for any data, giving them a true 360-degree view of the available content and thus enabling better decision making.

In the knowledge management world, this is a home run.

Improving Relevance

As we discussed in the previous post, context can be a game changer in relevancy as demonstrated by Google’s use of PageRank to dramatically improve the relevancy of internet search. Unfortunately, PageRank does not apply to enterprise search, and so to date, enterprise search engines have largely ignored context. Insight engines change that by using AI to analyze user behavior (e.g. previous queries, previous downloads, time spent on various articles, etc.) and then incorporating this user-specific context when computing search relevance. The results can be stunning, as each user now has a personal relevancy algorithm, one that is built on an analysis of data they have consumed in the past. This is similar to when you search for “cameras” on the internet and then later in the day you get personalized camera ads, but on steroids.

For you geeks out there, this is a form of Learning to Rank (LTR) where relevancy is customized by an AI model. The training data is your past search and content consumption behavior and the model seeks to predict what content you will like the best.

In the knowledge management world, this is second a home run.

Next Generation Push / Alerts

Insight engines also tackle another huge issue. A great deal of research has been done documenting that knowledge workers spend a significant portion of the day looking for information and often meet with failure. As if that isn’t bad enough, imagine how much information is missed altogether because we didn’t spend all day every day updating our searches and watching newsfeeds for changes in our industry, etc.

Insight engines seek to significantly reduce the need for users to search by enabling next-generation push/alerts. Insight engines use AI to analyze users past content consumption and then, in the background, find new relevant content and deliver it to the user. This means users get awesome, up-to-date, targeted content without even asking for it.

In the knowledge management world, this is a third home run.

A Bit of Caution

Insight engines are definitely here. But please know that:

  • Not all these capabilities are available from all Insight engine vendors.
  • Not all of these capabilities are as far along as the hype would lead you to believe.

But, if you are looking for that next leap forward, you should certainly look at Insight engines and vendors including: Coveo, Attivio, Sinequa, Lucidworks, as well as offerings from the big boys Microsoft, IBM, and HP.

Have you looked at Insight engines? If so, drop me an email and tell me your story.

Modern Search Engines

This is part two of a three-part blog post. (See part-one here.)

As we discussed in the previous post, index engines were developed to make searching across large textual repositories fast. But once high-speed retrieval was achieved, a new problem occurred – users were unable to find the most relevant/interesting documents within a large set of search results. The obvious answer/solution was to rank the documents by relevancy and present the most relevant results first.

As we saw in the previous post, indexing products often have problems with relevancy, giving birth to modern search engines, which improve relevancy by using two key techniques. Let’s take a look.


Enhancing Relevance with Context

Google is by far the best example of using context to improve relevance. Early internet search engines largely ranked results by counting the number of times the search terms appeared on the page. Google took a new approach to relevance by introducing the page’s importance (called PageRank) into ranking search results. (PageRank is loosely based on how many other websites linked to that page.) The inclusion of context into the relevance ranking had a huge and dramatic effect enabling Google to leapfrog their competitors Yahoo and Lycos, largely because users found Google’s searches so much better.

While PageRank is an awesome piece of context on the web, it does not work inside an organization. So enterprise search engines developed other techniques to improve relevance.

Enhancing Relevance with Tuning

Modern enterprise search engines attempt to address the problem of poor relevance by making the relevance calculation tunable for an organization’s specific set of circumstances. For example, Elastic Search (a widely used open source search engine based on Lucene) has many options for tuning relevancy. A few examples (all of which were critical to the organization described in the previous post) include:

Commonly Used Adjustments/Techniques

  • Field Boosting – used to boost the relevance of documents when the search term is in a field such as “Title” as opposed to buried on page 24.
  • Time Boosting – used to make more recent items more relevant and therefore is very helpful in applications like news or research.
  • Search Term frequency saturation – used to ensure that a large document does not dominate all others just because it contains more search terms.

Specialized Adjustments/Techniques

  • Location Boosting – used to increase the relevance of items that are close.
  • Price Boosting – used to increase or decrease relevancy based on price (often critical for e-commerce applications).
  • Boosting by Popularity – used to increase the relevance based on data from another field such as a popularity rating  (like Google, context is being used to increase relevance).

Modern search engines are much more tunable than indexing engines, and therefore often produce a much better search experience for the user. We recommend that when organizations adopt a search engine, they include relevancy tuning as part of the project to ensure a custom fit for their needs. However, once the low hanging fruit has been done (e.g. Boost the Title field), we strongly recommend against some organization’s desire to continue tweaking relevance daily. As Elastic notes “relevancy tuning is a rabbit hole that you can easily fall into and never emerge.” We advise organizations to visit tuning regularly but infrequently, and only doing so when they have the proper instrumentation and monitoring in place to know if you are increasing or decreasing relevance. Again, according to Elastic, you should monitor relevance by keeping track of items such as “how often your users click the top result, the top 10, and the first page; how often they execute a secondary query without selecting a result first; how often they click a result and immediately go back to the search results, and so forth.” With these objective measures in hand, you can clearly understand how relevance tuning is affecting users search experience.

The final post in this three-part series will discuss next generation search and why some organizations are already reaping big benefits from Insight engines.

Indexing to Search to Insight (& Artificial Intelligence)

What a Long Strange Trip it’s Been

We just completed a few assignments for clients seeking to improve the performance of search in key applications within their organizations. These assignments reminded me of the long and winding road search has been on for the past 30 years, together with some valuable lessons we’ve learned along the way.

windy road.jpg

I thought it would be useful to share these lessons in a series of blog posts where we will compare and examine the following:

  • Indexing engines (the first generation of search engines)
  • Modern search engines
  • Insight engines (the next generation search engine using artificial intelligence)


First Generation Search: Indexing

If you can believe it, back in the good old days before the internet (e.g. prior to the release of Netscape Navigator in 1994), full text search capabilities were not generally pervasive. Search was available in on-line information systems (e.g. Dialog) but inside organizations, databases were still using string matching (e.g. a SQL LIKE statement) and this meant word search was very slow, especially when the database was very large.

This slowness gave rise to a series of products and capabilities, which solved this problem by indexing words. Over time these indexing capabilities were embedded in database products enabling fast keyword retrieval over large databases of virtually any size.

We did some recent work with a customer that was using an indexing product in which they had built a document repository of about 100,000 PDFs. The application had been doing its job for many years, but more recently users and the content administrators (information professionals in the research center) had noticed a few problems. Most importantly, they were frustrated that the relevance ranking of the search results seemed poor.

How is Relevance Determined

Indexing products (including advanced ones like MS SQL Server full text search) generally compute indexing using a combination of the following:

  • The frequency of the search terms within a record (e.g. how often do search terms occur in a document – more means more relevant).
  • The ‘density’ of the search terms within a record (e.g. the number of search terms divided by the total number of words in a record).
  • How common the search terms are in the database as a whole (matching on a rare word is more important than matching on a common word).

This type of relevance ranking algorithm works pretty well, and when combined with Boolean search operators and parametric/fielded search, can yield excellent results. But it is far from perfect.

Problems with Relevance

In fact, the customer sighted three issues they were experiencing with relevance:

  • Problem #1 - Big documents (like the thousand page annual industry handbook) always seemed to be at the top the search results.
  • Problem #2 - Documents where the search term was in the title field were not deemed more relevant than documents where the search term was on page 24.
  • Problem #3 - More recent documents were not deemed more relevant that much older documents on the same topic.

A general problem with Indexing products is that the built-in relevancy algorithm is generally a take it or leave it proposition. It cannot be tuned or adjusted. This is a major difference between indexing products and modern search engines.

When we discussed next steps with the client, we quickly moved into a discussion of search engines, which is the subject of the next blog post.