metric matching


Imagine that a user is about to give the word "Stradivarius" to a computer program but accidentally gives it, say, "Stadivarius" instead. As far as most programs are concerned, "Stadivarius" is as close to "Stradivarius" as it is to "waffle iron." Most programmers apparently assume that their users are monkeys at the keyboard; everything the user says---unless it's exactly what they want the user to say at that particular moment---is equally meaningless.

If the program had a metric space defined on the space of words, however, then when the user says "Stadivarius" it would have a chance of figuring out what the user probably meant. "Stadivarius" is only one character away from "Stradivarius" so in a reasonable metric space defined on the set of all words, the two words would be very close to each other. If one of those words meant something to the system and the other didn't, it's a simple matter for the program to map the word to its closest match.

Such a system has other advantages. For example, it might even notice errors the user makes regularly by seeing how close they come to what the user actually intended. Perhaps it could even start noticing patterns in the kinds of errors the user makes. Even such a simple system as this has the potential to become quite smart about what the user wants.

Suppose a user is searching for copy machines and asks for "cpoy machines" instead. A smart search engine should look through the space of all pages looking for pages with that string. Some pages will have more occurrences of various substrings of it than others. That variation imposes a metric on the space of pages relating the pages to each other based on how close they are along this one dimension. So by mapping the pages onto a metric space that indicates the relative nearness and farness of strings, the engine can produce a set of pages that either contain the search string, or contain near matches to that search string.

The system should also have a user-updatable dictionary of people's names, company names, and other proper names. Whenever it analyses any page (no matter its source) it should look for those names, or close variants. It isn't necessary to be able to parse every conceivable name ever invented---the user is unlikely to care. But if any of the people the user is interested in is mentioned anywhere, the user probably has some interest in knowing about it, and is certainly more likely to use such names in searches than arbitrary names.

For example, if a user has collected a lot of Tom Clancy pages and one day does a search for Jack Ryan, that user probably wants information about Jack Ryan the hero of Clancy's novels, not Jack Ryan the historian. And, of course, if that same user later types "Jcak Ryan" it's likely that Jack Ryan was meant. It's simply rude to force the user to state all of that context when it is already clear from the user's space of pages.



last | | to sitemap | | up one level | | next