Suggesting search terms: how is it done? – Part1

Search term suggestions or  autosuggest or autocomplete is pretty common place these days. Google probably did it first with Google Suggest. Other search engines followed suit. You don’t see this feature on many STM publisher sites. PubMed recently introduced the PubMed Auto Suggest feature. That’s based on popular search terms on PubMed that match your search term.

This is a pretty basic feature that one would expect to see on all search engines. So how is this done?

You could always do a basic dictionary look up but I doubt users want to scroll through a list of terms to find what they want. You want the system to sort of guess what you are typing and quickly bubble up the relevant term to the top. That would be a bit more complex.

I chanced upon some information on the this site – http://ajaxpatterns.org/Suggestion and below are some excerpts from there

What algorithm will be used to produce the suggestions list?

Ultimately, the suggestion algorithm needs to find the most likely completions.

A few rules of thumb:

  • In general, historical data is the best predictor. Log what users are searching for, and use that as a basis for suggestions. So when the user types “A”, suggest the most frequent responses beginning with “A”.
  • Instead of completing “A” with the most common “A” query all users have entered, consider personalisation. Return the most common “A” queries for this particular user. The only problem here is lack of data, so as a more sophisticated approach, use a collaborative filtering algorithm to provide suggestions based on similar users.
  • Recent history is a more pertinent guide, whether or not you’re personalising the results. In some cases, it makes sense only to provide recent queries. In other cases, consider weighting recent results more heavily.

You might be assuming the results come from the server, but that doesn’t have to be the case. If you want to reduce server queries, you can have the browser script scour whatever information it has to produce a kind of Guesstimate suggestion list. The browser is perfectly capable of suggesting recent queries, and might also have enough business logic to guess at other reasonable suggestions.

The strategy for finding a suggestion is related to that of a search engine, so you might want to investigate the search algorithm theory. Personalisation and collaborative filtering also play a part.

Below is information from an article in AMIA Annual Symposium Proceedings Archive titled UMLSKS SUGGEST: An Auto-complete Feature for the UMLSKS interface using AJAX – Authors: Anantha Bangalore, Allen Browne, and Guy Divita

As part of the UMLSKS logging system we store the details of every query made by a user. These details include the query term, term matching method (exact match, normalized string index, word index etc.) and a flag indicating whether the query was successful or not. We extracted a list of all queries made to the UMLSKS from 2003 to 2005. From this list we eliminated all queries made by NLM users, as most of those queries were test queries. We further eliminated all queries where the query terms were CUI’s. E.g. query term=C0001175. We also eliminated all queries for which no results were found using any of the matching methods. We then lower cased all the query terms and sorted the list alphabetically. From this list we created a new list which contained for each query term, the term frequency count and the total number of unique users who have used that query term. Since the goal of this method is to create a list of most likely suggestions for all of the users of the UMLSKS, we threw away all queries with a term frequency count of one and a user count of one. After performing these steps we were able to reduce the size of our list by about 90%. In the remaining list we needed a metric to rank each term based on frequency of usage. We used the commonly used tf-idf weight (term frequency-inverse document frequency) to compute relevancy of each query term. We substituted users in place of documents. A high weight in tf-idf indicates a high term frequency and a low frequency of usage. Since we wanted to rank terms by high frequency of usage, a lower weight term got a higher ranking. We then sorted this list alphabetically on the query term and in descending order of ranking. In order to reduce the load on the backend server we also put a restriction that a minimum of four characters have to be typed in before any suggestions are displayed. Initial feedback from users on the list of suggestions generated by this method has been very encouraging.

I don’t know if these do much in the way of ‘guessing’ what you might be typing in but its a start I suppose.

This process, does however, seem simple enough for any search engine that maintains basic usage stats. It does also feel that a simplistic process such as this can always be open for abuse.

A subsequent next step would be looking for things like term replacements, synonym suggestion and so on. More on that in a future post.

[tweetmeme source=”shiv17674” http://www.URL.com%5D

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: