Seach: Time vs Relvancy

Something that seems to be missing from searches is time. Search engines base their results on relevancy, which makes finding newer methods of doing something difficult.

For example, I will search for how to do something in linux, like configuring a RAID array. There is a ton of information on this, but the most relevant hits you get are about configuring raidtools. Mdadm has replaced raidtools as the tool of choice, but since raidtools has been around so long, and there are so many old pages that link to it, it scores the highest. I’m sure there’s millions of other examples of this on other topics too.

Google has an advanced search where you can specify pages modified in the last x months, but it doesn’t really help much. One of the pages returned when I limit the search to the last 3 months has a revision history typed out at the top of it, and it shows the last update in 2003. MSN has a “Search builder” function, where (among other options) you can specify how important it is to be recently updated, popular, and a relevant match. This still doesn’t bring up really relevant results. Yahoo is the only one of the three that actually does return an mdadm-related result as #1 when you search within the last 3 months. (I should point out that both Google and Yahoo return this same page as #5 and #6, respectively, but my point here is that someone who knows nothing about it is probably going to pick #1 or #2, and implement raid with the older raidtools method).

MSN search-tuning functions

All three have a news search engine that returns date-based results for recent news items, but this is pretty limited in that it’s only searching news sites. Linux software RAID developments aren’t exactly breaking news on CNN, so the news search isn’t exactly the place to find this stuff.

I think one problem with the date-based results as they are now is the way they are likely determining the date of the page. If they are using the last modified header (part of HTTP specifications), then that would explain a lot of the problems. It’s quite possible that the last-modified header is changed due to content that is dynamically created, content that is moved with ftp to another server, copying without preserving date/time or even a misconfigured webserver. What they should be doing is comparing the contents of the page to the contents the last time they indexed. It wouldn’t be totally accurate (depending on how often they index the page), but it would at least give a real representation of when the contents were changed. They would have to ignore dynamic things like ads and current date displays (via pattern matching) but it wouldn’t be that complicated.

Hopefully it’s just a matter of time…

On the topic of search engines, I came across a few new Google features while researching for this entry that I didn’t know about: