Natural keywords and categories

Adam Kalsey has done some fine work on creating lists of related entries for Movable Type based on the contents of your blogs.

Not to undermine it, but this still doesn’t go far enough towards discovering natural relations between entries, and won’t work unless we write in a restricted style with a restricted vocabulary–that goes against the grain of blogging, which is personal and spontaneous. If I mention Donald Rumsfeld in one blog entry and the Secretary of Defense in another, clearly they’re related (although the person with that title can change, making that equation more complicated). How can this be made to work?

The first problem is extracting potential keywords from “noise” words. A first-order effort would be to have a canned list of noise words, and filter those out–this would be a simple, fast process. A second-order effort would be to filter out any words that are used very frequently by the blogger–this would be much slower, and perhaps should be handled asynchronously (the results of this could be used to refine the first-order noise-word list to speed things up in general).

The two Big Bens of Blogistan (Trott and Hammersley) have worked out the ingenious more like this from others. This has the germ of something interesting: using an outside reference.

Something like the Open Directory already represents a pretty extensive hierarchical library of keywords. To take my prior example, the first hit for a search on “Donald Rumsfeld” at dmoz is found in the category “Regional > North America > United States > Government > Executive Branch > Departments > Defense”. That gives you some excellent keywords to take home. (It also seems possible that if a candidate keyword generates scattered search results, it might not be a good keyword, and should be added to the noise-word list.) The most specific are at the end, and “Defense” is a very useful keyword to equate to Rumsfeld. It gets better: that category contains subcategories with very useful terms (Armed Forces, Defense Agencies, Department of Defense Field Activities, Intelligence, Joint Chiefs of Staff, Office of the Secretary of Defense, Unified Combatant Commands) as well as related categories (Science > Technology > Military Science; Regional > North America > United States > Government > Military > Installations > Pentagon). These could be used to generate a high-quality list of “alternative keywords.”

So the process of finding and using alternate keywords would go something like this:

  1. Create potential keyword list
    1. Winnow out noise words
    2. Winnow out other frequently-used words
  2. Search dmoz or other directory for keywords
  3. Collect categories for search results, as well as subcategories and related categories
  4. Assemble new list of alternative keywords
  5. Search blog corpus for alternate keywords, create links when found

The process of constructing a list of alternative keywords clearly involves a fair amount of work–but that’s what we’ve got computers for. And it obviously won’t always be perfect–but that’s what we’ve got brains for.