Suggesting search terms: how is it done? – Part1

Search term suggestions or  autosuggest or autocomplete is pretty common place these days. Google probably did it first with Google Suggest. Other search engines followed suit. You don’t see this feature on many STM publisher sites. PubMed recently introduced the PubMed Auto Suggest feature. That’s based on popular search terms on PubMed that match your search term.

This is a pretty basic feature that one would expect to see on all search engines. So how is this done?

You could always do a basic dictionary look up but I doubt users want to scroll through a list of terms to find what they want. You want the system to sort of guess what you are typing and quickly bubble up the relevant term to the top. That would be a bit more complex.

I chanced upon some information on the this site – and below are some excerpts from there

What algorithm will be used to produce the suggestions list?

Ultimately, the suggestion algorithm needs to find the most likely completions.

A few rules of thumb:

  • In general, historical data is the best predictor. Log what users are searching for, and use that as a basis for suggestions. So when the user types “A”, suggest the most frequent responses beginning with “A”.
  • Instead of completing “A” with the most common “A” query all users have entered, consider personalisation. Return the most common “A” queries for this particular user. The only problem here is lack of data, so as a more sophisticated approach, use a collaborative filtering algorithm to provide suggestions based on similar users.
  • Recent history is a more pertinent guide, whether or not you’re personalising the results. In some cases, it makes sense only to provide recent queries. In other cases, consider weighting recent results more heavily.

You might be assuming the results come from the server, but that doesn’t have to be the case. If you want to reduce server queries, you can have the browser script scour whatever information it has to produce a kind of Guesstimate suggestion list. The browser is perfectly capable of suggesting recent queries, and might also have enough business logic to guess at other reasonable suggestions.

The strategy for finding a suggestion is related to that of a search engine, so you might want to investigate the search algorithm theory. Personalisation and collaborative filtering also play a part.

Below is information from an article in AMIA Annual Symposium Proceedings Archive titled UMLSKS SUGGEST: An Auto-complete Feature for the UMLSKS interface using AJAX – Authors: Anantha Bangalore, Allen Browne, and Guy Divita

As part of the UMLSKS logging system we store the details of every query made by a user. These details include the query term, term matching method (exact match, normalized string index, word index etc.) and a flag indicating whether the query was successful or not. We extracted a list of all queries made to the UMLSKS from 2003 to 2005. From this list we eliminated all queries made by NLM users, as most of those queries were test queries. We further eliminated all queries where the query terms were CUI’s. E.g. query term=C0001175. We also eliminated all queries for which no results were found using any of the matching methods. We then lower cased all the query terms and sorted the list alphabetically. From this list we created a new list which contained for each query term, the term frequency count and the total number of unique users who have used that query term. Since the goal of this method is to create a list of most likely suggestions for all of the users of the UMLSKS, we threw away all queries with a term frequency count of one and a user count of one. After performing these steps we were able to reduce the size of our list by about 90%. In the remaining list we needed a metric to rank each term based on frequency of usage. We used the commonly used tf-idf weight (term frequency-inverse document frequency) to compute relevancy of each query term. We substituted users in place of documents. A high weight in tf-idf indicates a high term frequency and a low frequency of usage. Since we wanted to rank terms by high frequency of usage, a lower weight term got a higher ranking. We then sorted this list alphabetically on the query term and in descending order of ranking. In order to reduce the load on the backend server we also put a restriction that a minimum of four characters have to be typed in before any suggestions are displayed. Initial feedback from users on the list of suggestions generated by this method has been very encouraging.

I don’t know if these do much in the way of ‘guessing’ what you might be typing in but its a start I suppose.

This process, does however, seem simple enough for any search engine that maintains basic usage stats. It does also feel that a simplistic process such as this can always be open for abuse.

A subsequent next step would be looking for things like term replacements, synonym suggestion and so on. More on that in a future post.

[tweetmeme source=”shiv17674”


ScienceDirect – Search log analysis: What it is, what’s been done, how to do it

The use of data stored in transaction logs of Web search engines, Intranets, and Web sites can provide valuable insight into understanding the information-searching process of online searchers. This understanding can enlighten information system design, interface development, and devising the information architecture for content collections. This article presents a review and foundation for conducting Web search transaction log analysis. A methodology is outlined consisting of three stages, which are collection, preparation, and analysis. The three stages of the methodology are presented in detail with discussions of goals, metrics, and processes at each stage. Critical terms in transaction log analysis for Web searching are defined. The strengths and limitations of transaction log analysis as a research method are presented. An application to log client-side interactions that supplements transaction logs is reported on, and the application is made available for use by the research community. Suggestions are provided on ways to leverage the strengths of, while addressing the limitations of, transaction log analysis for Web-searching research. Finally, a complete flat text transaction log from a commercial search engine is available as supplementary material with this manuscript.

via ScienceDirect – Library & Information Science Research : Search log analysis: What it is, what’s been done, how to do it.

[tweetmeme source=”shiv17674”

Challenges to finding relevant scientific literature

Mere searching for literature on Google or any other search engine would not cut it; especially for scientific literature searching. We have to help users get to the information they are really looking for. Its challenging. How do you provide tools without getting too noisy and distracting?

The paper referenced below, does not answer all the questions but it does provide some insight (or review)  into the kind of tools that are available, in the field of biology, for researchers today and things we need to do to improve upon them.

In the words of the author –

This review shows the promise of literature data mining and the need for challenge evaluations. It shows how current language processing approaches can be successfully used to extract and organize information from the literature. It also illustrates the diversity of applications and evaluation metrics. By defining several biologically important challenge problems and by providing the associated infrastructure, we can accelerate progress in this field. This will allow us to compare approaches, to scale up the technology to tackle important problems, and to learn what works and what areas still need work.

This comment perhaps sums up the challenge information providers face and compromises that tend to be made –

…, it is unclear how to compare the different approaches; it is also unclear how well a system has to
perform to be useful. To compare technical approaches, different systems must be applied to the same domain via common evaluations. To know how good a system has to be, prototypes must be given to biologists in user-centered evaluations. As learned from previous evaluations in the information retrieval community (Hersh et al., 2001), it is hard to extrapolate from results of batch experiments to predict complex issues of utility and user acceptance of interactive tools. However, even imperfect tools are useful, if they give improved functionality at low cost.

Accomplishments and challenges in literature data mining for biology — Hirschman et al. 18 12: 1553 — Bioinformatics.

Michael Nielsen » Is scientific publishing about to be disrupted?

This is a very insightful entry by Michael Nielsen.  Due to my bias, I had to immediately skip to Part II before i came back and read Part I. Some of the things i might question – Automatic spelling correct/relevancy ranking/alerting service, etc are indeed offered on Scopus. But whether they are good (I believe they are competitive) is certainly something the users will judge and Michael would qualify as one. I haven’t heard from any of the users i talked to that any of these feature are poor but again it could be my bias.

A great search engine for science: ISI’s Web of Knowledge, Elsevier’s Scopus and Google Scholar are remarkable tools, but there’s still huge scope to extend and improve scientific search engines [6]. With a few exceptions, they don’t do even basic things like automatic spelling correction, good relevancy ranking of papers (preferably personalized), automated translation, or decent alerting services. They certainly don’t do more advanced things, like providing social features, or strong automated tools for data mining. Why not have a public API [7] so people can build their own applications to extract value out of the scientific literature? Imagine using techniques from machine learning to automatically identify underappreciated papers, or to identify emerging areas of study.

via Michael Nielsen » Is scientific publishing about to be disrupted?.

Read the article in its entirety. It  is very insightful and several pointers can be taken away as always.

Google Books and the man

Many people love to read books online. I’m not much of an online book reader. I prefer my books to be made of paper. But maybe that’s because i haven’t really tried. If and when Kindle comes out with support for color, I’ll probably jump in. Anyway, I digress.

To date i have not paid too much attention to Google Books.  That is until i got this article in my feed today.

Google Books Just Got Better: Better Search Within Books, Embedding, & More.

I feel like i’ve come late to the party but probably just in time when the fun begins. If your experience reading books online has been getting a PDF version and scrolling through the pages or perhaps downloading a chapter at a time, then prepare to be amazed.

Here’s what I found most intriguing –

First the left hand pane.

A book's left hand pane

A book's left hand pane

Three very good and important features

1. An overview page: Its not clear to me where they pulled all this from but it looks like there is a brief abstract about the book, keywords and phrases (I don’t believe these are author supplied so they must have pulled out key terms/topics), reviews (that’s ok) and a slew of other information.

2. Search in this book: Its not just a simple search feature. Search within this book. The results as stated in commentary linked above –

appear in their context in a list of short snippets from the text

Good gracious almighty… how can you not be swept off your feet by that?

3. I also love the Related books feature but i haven’t tested it enough to see if they are truly relevant.

There are other features like page turners that don’t necessarily turn me on but its these simple things that add sweetness to the user experience. The fact that they care enough about the user to add that little feature will bring me back to Google Books.

And apparently you can embed the book in your blog. I tried but haven’t been able to get it to work. I’ll work on that.

Sure Amazon does a mighty fine job as well. If you compare what Amazon and Google do for the same book you’ll probably find both search within book features are nice. I prefer the Google version though, where we get small paragraph snippets within the results page instead of the entire page. That’s just me. Related books appeared to be similar except that Amazon points out 5 instead of 3 by default (hardly a differentiation).

All said, they are both kind of similar. I feel like i prefer Google’s layout better maybe because I’m just familiar with it or maybe it feels cleaner… just can’t lay my finger on it.

Both are definitely waaaaay better than some of the interfaces i have seen with more traditional publishers.

OK… i’m hooked. When’s Kindle color coming out?

[Update June 20, 2009: For a detailed summary of the latest Google Books feature – read this post from Brandon Badger on the Google Book Search blog]

At Google’s Searchology event, executives give search ‘state of the union’

This is brilliant. I can see this kind of feature applicable to scientific research like searching through topics or when looking at an article visually representing the article against its references and cited by articles.

Marissa Mayer and her team are introducing new features to Google’s search results.

Via a tool called “search options,” users can now quickly “slice and dice” their search results in a variety of new ways.

Solar-ovensOn the results page, you can click “show options,” for example, on a search of “solar ovens.” You can then quickly filter the results to see video entries, entries from discussion forums and even user reviews that have undergone “sentiment analysis” — that is, whether the reviewer liked the product (a solar oven) or not.

Also included on the search options is a feature called “wonder wheel,” where Google will draw a simple topic diagram that connects your search query to similar topics. For “solar oven,” you might be given the option to search “how solar ovens work,” or “homemade solar ovens.”

from –

Is Search Broker? Presentation from Endeca

This presentation is actually several months old but worth going over