CloudCourse: An Enterprise Application in the Cloud – Love it

It would be silly to say i’m jealous but its like they know what’s on my mind. To paraphrase a known commercial… This was my idea 🙂

Provide educators a personalized way to prepare course material for their class and students. I think this is now possible with CloudCourse: An Enterprise Application in the Cloud

via CloudCourse: An Enterprise Application in the Cloud – Google Open Source Blog.

Congrats Google!!

What Surprising Number Will Change Your Business? Harvard Business Review

Business isn’t just a battle of products and services. It’s a battle of ideas about priorities, opportunities, values, and value. Ultimately, those competing ideas get reduced to competing numbers. So, if you can arrive at numbers that matter, you’ve got a better chance at winning the battle of ideas.

via What Surprising Number Will Change Your Business? – Bill Taylor – Harvard Business Review.

Suggesting search terms: how is it done? – Part1

Search term suggestions or  autosuggest or autocomplete is pretty common place these days. Google probably did it first with Google Suggest. Other search engines followed suit. You don’t see this feature on many STM publisher sites. PubMed recently introduced the PubMed Auto Suggest feature. That’s based on popular search terms on PubMed that match your search term.

This is a pretty basic feature that one would expect to see on all search engines. So how is this done?

You could always do a basic dictionary look up but I doubt users want to scroll through a list of terms to find what they want. You want the system to sort of guess what you are typing and quickly bubble up the relevant term to the top. That would be a bit more complex.

I chanced upon some information on the this site – and below are some excerpts from there

What algorithm will be used to produce the suggestions list?

Ultimately, the suggestion algorithm needs to find the most likely completions.

A few rules of thumb:

  • In general, historical data is the best predictor. Log what users are searching for, and use that as a basis for suggestions. So when the user types “A”, suggest the most frequent responses beginning with “A”.
  • Instead of completing “A” with the most common “A” query all users have entered, consider personalisation. Return the most common “A” queries for this particular user. The only problem here is lack of data, so as a more sophisticated approach, use a collaborative filtering algorithm to provide suggestions based on similar users.
  • Recent history is a more pertinent guide, whether or not you’re personalising the results. In some cases, it makes sense only to provide recent queries. In other cases, consider weighting recent results more heavily.

You might be assuming the results come from the server, but that doesn’t have to be the case. If you want to reduce server queries, you can have the browser script scour whatever information it has to produce a kind of Guesstimate suggestion list. The browser is perfectly capable of suggesting recent queries, and might also have enough business logic to guess at other reasonable suggestions.

The strategy for finding a suggestion is related to that of a search engine, so you might want to investigate the search algorithm theory. Personalisation and collaborative filtering also play a part.

Below is information from an article in AMIA Annual Symposium Proceedings Archive titled UMLSKS SUGGEST: An Auto-complete Feature for the UMLSKS interface using AJAX – Authors: Anantha Bangalore, Allen Browne, and Guy Divita

As part of the UMLSKS logging system we store the details of every query made by a user. These details include the query term, term matching method (exact match, normalized string index, word index etc.) and a flag indicating whether the query was successful or not. We extracted a list of all queries made to the UMLSKS from 2003 to 2005. From this list we eliminated all queries made by NLM users, as most of those queries were test queries. We further eliminated all queries where the query terms were CUI’s. E.g. query term=C0001175. We also eliminated all queries for which no results were found using any of the matching methods. We then lower cased all the query terms and sorted the list alphabetically. From this list we created a new list which contained for each query term, the term frequency count and the total number of unique users who have used that query term. Since the goal of this method is to create a list of most likely suggestions for all of the users of the UMLSKS, we threw away all queries with a term frequency count of one and a user count of one. After performing these steps we were able to reduce the size of our list by about 90%. In the remaining list we needed a metric to rank each term based on frequency of usage. We used the commonly used tf-idf weight (term frequency-inverse document frequency) to compute relevancy of each query term. We substituted users in place of documents. A high weight in tf-idf indicates a high term frequency and a low frequency of usage. Since we wanted to rank terms by high frequency of usage, a lower weight term got a higher ranking. We then sorted this list alphabetically on the query term and in descending order of ranking. In order to reduce the load on the backend server we also put a restriction that a minimum of four characters have to be typed in before any suggestions are displayed. Initial feedback from users on the list of suggestions generated by this method has been very encouraging.

I don’t know if these do much in the way of ‘guessing’ what you might be typing in but its a start I suppose.

This process, does however, seem simple enough for any search engine that maintains basic usage stats. It does also feel that a simplistic process such as this can always be open for abuse.

A subsequent next step would be looking for things like term replacements, synonym suggestion and so on. More on that in a future post.

[tweetmeme source=”shiv17674”

OCR – Reading into images – Google

In scientific publishing, customers have been asking us to provide features that will read images like graphs and derive data from them.

This is not a revolutionary new concept. Optical character recognition has been around for ages. We use it to read from images of historic documents but why can’t we do simple applications like what Google has done?

Google uses optical character recognition, or OCR, to turn the image into words, and then uses its translation services to turn those words into a language you recognize.

via Goggles turns Android into pocket translator – Google 24/7 – Fortune Tech.