davv | Validity of stemming

Search engines often reduce words so that they match variant forms. For instance, a search engine might reduce both "computer" and "computers" to "computer" so that if you search for the latter word, you also get documents that contain the former. This is called stemming.

However, if the words don't map to the same thing, this can cause problems. I'm going to use an example here which those that know me will find amusing. A naive algorithm might reduce "jesses" (as in the thing one keeps on birds) to "jess" (i.e. Jessica). If it does so, then there's no way of searching for "jesses" proper -- unless the search engine permits the user to override the stemming for particular words.

Why are words reduced in the first place? It is to aid the user, so that the engine "does what he wants". So, ideally, we would want the words to be reduced (as "computer" and "computers") when they refer to the same thing, but not when they don't. That would mean that when people search for "computer", they get documents involving computers, whether or not these documents speak of computers in the singular or plural, but when they search for "jesses", they get only falconry results.

But here's a problem again. The search engine doesn't know "meaning". It knows only a bag of words. So how might we measure whether the terms refer to the same kind of thing? Fortunately for us, there has been lots of research on that subject. Simple methods just use the documents as sets of words and apply set distance measures to them, complex methods take meaning into account.

Thus we don't need to know what the distance measure is. All we have to know is that it's something that, when given some document x and y, determines how similar they are. So if we want to find out the similarity of "computers" to "computer", we just take the similarity of every document where "computer" appears to every document where "computers" appear. For something like computer/computers, the two terms talk about the same thing, so the documents will be, on average, quite similar. But for something like jess/jesses, one will have lots of Jessica-related words, and the second will have falconry words, and so the average distance will be large.

Thus, to make stemming that doesn't get in the way of the user, pick a threshold distance. If the mean distance between documents, as mentioned above, is below that threshold, then alias both to the same word. If it isn't, then don't!

Of course, if your engine is Google, this is going to be pretty hard to do because you have billions of documents. And there may be rock-paper-scissors oddities, too, where A is alike B and B is alike C, but A isn't alike C. And how to find the threshold? I think you'd just have to try different ones and see if it appears to make sense.

S	M	T	W	T	F	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Journal of Dw

Creatures, code, and corvids, oh my!

Validity of stemming

Validity of stemming

Profile

March 2018

Most Popular Tags

Style Credit

Expand Cut Tags