Natural keywords and categories

Adam Kalsey has done some fine work on creating lists of related entries for Movable Type based on the contents of your blogs.

Not to undermine it, but this still doesn’t go far enough towards discovering natural relations between entries, and won’t work unless we write in a restricted style with a restricted vocabulary–that goes against the grain of blogging, which is personal and spontaneous. If I mention Donald Rumsfeld in one blog entry and the Secretary of Defense in another, clearly they’re related (although the person with that title can change, making that equation more complicated). How can this be made to work?

The first problem is extracting potential keywords from “noise” words. A first-order effort would be to have a canned list of noise words, and filter those out–this would be a simple, fast process. A second-order effort would be to filter out any words that are used very frequently by the blogger–this would be much slower, and perhaps should be handled asynchronously (the results of this could be used to refine the first-order noise-word list to speed things up in general).

The two Big Bens of Blogistan (Trott and Hammersley) have worked out the ingenious more like this from others. This has the germ of something interesting: using an outside reference.

Something like the Open Directory already represents a pretty extensive hierarchical library of keywords. To take my prior example, the first hit for a search on “Donald Rumsfeld” at dmoz is found in the category “Regional > North America > United States > Government > Executive Branch > Departments > Defense”. That gives you some excellent keywords to take home. (It also seems possible that if a candidate keyword generates scattered search results, it might not be a good keyword, and should be added to the noise-word list.) The most specific are at the end, and “Defense” is a very useful keyword to equate to Rumsfeld. It gets better: that category contains subcategories with very useful terms (Armed Forces, Defense Agencies, Department of Defense Field Activities, Intelligence, Joint Chiefs of Staff, Office of the Secretary of Defense, Unified Combatant Commands) as well as related categories (Science > Technology > Military Science; Regional > North America > United States > Government > Military > Installations > Pentagon). These could be used to generate a high-quality list of “alternative keywords.”

So the process of finding and using alternate keywords would go something like this:

  1. Create potential keyword list
    1. Winnow out noise words
    2. Winnow out other frequently-used words
  2. Search dmoz or other directory for keywords
  3. Collect categories for search results, as well as subcategories and related categories
  4. Assemble new list of alternative keywords
  5. Search blog corpus for alternate keywords, create links when found

The process of constructing a list of alternative keywords clearly involves a fair amount of work–but that’s what we’ve got computers for. And it obviously won’t always be perfect–but that’s what we’ve got brains for.

Trojan-horse spamming

In June, I speculated that this would happen. Now it actually has: spammers are distributing trojan horses that infect other computers to relay spam for them.

This is clearly illegal, of course. But spammers have long been exploiting “open relays”–unsecure mail-servers–which should be considered illegal as well.

Perhaps my most recursive metablog post to date

Via the Movable Type support board, I learned of the blogideas site. “When you don’t know what to Blog about.”

Now, there are lots of different forms that blogs can take, and they’re all valid, I suppose, but if I don’t have anything to write about in my blog, I don’t write anything. I don’t feel some obligation to tap away, uninspired, for the dubious benefit of my adoring public. Some suggested topics from the site: “Experiment: how many fishsticks fit in your mouth?”; “An ode to your couch”; “Why do dogs sniff each other in the ass?”

If you’re reduced to writing about that, better take a day off.

Also, in the interest of completeness, that discussion exposed me to the memes list–which is really more a themes list (organizational conceits like “Friday Five”).

Macintouch gets with the program, sort of

Macintouch, the best Mac news site, is finally publishing an RSS feed. Excerpts only, which is fair: they need to get people to come to the site, since they’re advertising-supported.

A bigger problem is that Macintouch never had permalinks for individual stories, and yet an RSS feed requires a permalink, or something like it. The solution Macintouch is using appears to be very ad-hoc: there are indeed anchor tags for individual stories, but they don’t seem intended to scale: they read like <a name="itools7">. This is adequate for one day’s worth of news, but not for providing a permanent ID. I’m guessing that Macintouch has been running on a homebrewed content-management system that wasn’t designed with permalinks in mind; now Ric Ford is locked in, and retrofitting newfangled contraptions like permalinks is turning out to be hard.

How not to fix a problem

Blogshares seems like a fun idea that has attracted quite a bit of attention. Unfortunately, it includes a ticker–a mere frill–that is coded in Javascript that causes all Mac browsers (at least, that I’ve tried) to lock up. If you use a Mac, the only way to visit the site is to disable Javascript first. This has been mentioned repeatedly on the discussion board there.

The brilliant solution? A “ticker on/ticker off” switch. Implemented in Javascript. The only way to turn off the ticker is to turn Javascript on. If you do that, the browser locks up, making it impossible to turn off the ticker.

I think Joseph Heller wrote a book about this type of situation.

CSS rant

CSS is great, but it’s too hard. When even the guy who wrote the books on CSS admits that it has “made the veins in my forehead throb”, you know there’s a problem.

All the stuff I’ve done on the web for the past year or so has been in CSS, and I’ve been gradually re-working older stuff to bring it into the new millennium. So I’ve drunk the kool-aid.

One thing that would make it at least a little easier would be a hierarchy to stylesheet documents. I’m not talking about the cascade effect (which is really neat). I’m talking about a way to organize a single stylesheet document.

As it is, there’s no enforced, preferred, suggested, inherent, or obvious way to structure a CSS document. You’ve just got “selector soup.” This makes a complicated stylesheet hard to read and hard to write. CSS offers contextual selectors, which would be an obvious candidate for hierarchical organization. Rather than having

div#main {margin:2px;}
div#main h2 {color:red;}
div#main h3 {color:blue;}

It would seem eminently sensible to have

div#main {margin:2px;} [
    h2 {color:red;}
    h3 {color:blue;}
    ]

or something like that.

With a structure like this, it would be possible to distill an HTML document down to its tags, and generate a structured list of selectors, to create a stylesheet skeleton.

The other bear, for me, is the layout of major page elements. The whole box model is pretty powerful, but it is unintuitive and there are some things it just can’t do. Imagine a page laid out like this:

header:left header:right
main content navbar
footer:left footer:right

Near as I can tell, this is almost impossible using straight CSS. It might be possible if the header and footer areas are fixed height, probably meaning they contain mostly graphics. It would probably require a lot of extraneous DIV tags. DIV tags are fine up to a point, but nesting a bunch of DIV tags just to get a page to lay out correctly goes against the spirit of structured HTML. Might as well use a table-hack layout after all.

It seems like something that was cooked up to be elegant on a theoretical level without much regard for the kinds of layouts people might actually want to achieve. The layout of this page was achieved through some code that I consider inelegant and brittle.

More on fighting spam

I recently wrote about some thoughts to combat spam, that would involve cash micropayments for some e-mail. I wasn’t entirely happy with that idea, and I’ve refined it a little.

My idea, as it currently stands, is this:

Everyone who receives e-mail would maintain a blacklist (of pestilential senders) and a whitelist (of good senders). Anyone not on either of these lists is said to be on the “graylist.” Everyone who sends e-mail would be required to put a small amount of money–perhaps one dollar–into an escrow account.

When Alice sends Bob a piece of mail, before Alice’s mailserver actually delivers the message, it checks to see whether she is on Bob’s blacklist or whitelist.

If Alice is on neither, that is, she’s on the recipient’s graylist, a “hold” is placed on one cent in her escrow account by her mailserver, and the message is delivered. When Bob receives her e-mail, he can choose whether her e-mail is legitimate or not. If it is legit, the hold on that penny is released. If not, the penny is deducted from her account (perhaps paid to Bob, his ISP, a charity, or some combination). When declaring a message to be legit, Alice is added to Bob’s whitelist. If not, it goes on his blacklist (this process could be simplified a bit so that responding to a message automatically whitelists the sender).

If she is on the whitelist, the message is delivered without involving the escrow account at all.

If she is on the blacklist, the message is not delivered and one penny is automatically deducted from her escrow account.

The problem with this is that it adds quite a lot of overhead to graylist and blacklist correspondences, and some overhead to whitelist ones. The Internet hasn’t had any successful micropayment systems yet.

What about using a non-cash system? In theory, this system could work using valueless certificates. One would apply to a “trusted certificate-issuing authority” for a bundle of, say, 100 certificates. The process could be designed to thwart scripts that would simulate human action, and one could be prevented from receiving more than 100 certificates/month (for example). The authority would deposit 100 signed and encrypted certificates in your “escrow account,” and in all other respects, the system would function similarly. When Bob’s mailserver (or perhaps Bob’s own mail software) receives Alice’s message, it checks the certificate against the certificate-issuing authority; if it is valid, the message is presented to Bob (who can still choose to blacklist it, if he wants). If not, the message is bounced.

Note that this system is pretty similar to the authenticated e-mail that some people would like us to use anyhow. This would also involve a similar amount of processing overhead. And in fact, the cash-based system would need to use pretty much the same system of signed and encrypted certificates.

There are some broader differences between the cash-based and cashless systems. If everyone can receive a bounty for identifying spam, even a tiny one-penny bounty, more people are likely to actually do it, making spam less tenable. (A system like this would also make it attractive for technically savvy users to create “honeypot” e-mail addresses to attract spam, automatically blacklist it, and collect lots of pennies.) And the idea of a certificate-issuing authority is problematic, as they would, in effect, be gatekeepers: if you can’t get your certificates, you cannot communicate by e-mail. The authority could charge money for issuing certificates, or otherwise abuse this power. If these certificates were issued automatically and solely as digital representations of pennies, the system should be less prone to abuse. There would need to be more than one authority.

So if I’m saying that the whole validation process should be added on without a surcharge being imposed (which I am), how would this be funded? All e-mail host operators should pitch in to fund the system. I have no idea what the numbers on this would look like, but I suspect that they would save more thanks to reduced traffic than the system would require them to contribute.

Side note: there’s now an official Anti-Spam Research Group. Maybe I should try to get this idea in front of them.

[Later] Interesting to note that Robert Cringely came up with somewhat similar micropayment system for fighting spam.

[Later still] David Nunez pointed out this article on a “spam tax.”

Lazyweb: help me pick a CMS

I’m trying to find a content-management system, mostly for running a limited-access discussion forum. I’ve already looked at Drupal, which is pretty nice, but I have not yet figured out how to make it fully support Japanese, which is a make-or-break feature.

I can get limited success writing Japanese in Drupal, but (in the best-case scenario so far) the characters wind up being escaped to numeric Unicode entities, which are not editable after being posted, and apparently are not searchable.

I also looked at the Slash engine, but A) the installation instructions assume you are root, assume you are more technically adept than I am, and are very sketchy all around, and B) has the same problems with Japanese.

I suspect the Japanese-handling problems in both Drupal and Slash could be fixed with some minor coding changes, but I don’t know where to start. Messing with the template charsets doesn’t do it.

Other desirable features:
Threaded message structure
Moderation/karma points a la Slashdot
Flexible permissions setup
Straightforward templating for admin

None of the other CMSs or web-BBSs I’ve looked at so far have forum threading,

[Later] Seems that I can almost get Japanese working right in Drupal (why do I think of towering drag-queens every time I say that?) after all: there was a configuration issue that was causing the Japanese to be escaped. I turned that off, and now everything is close to hunky-dory, except for one thing: searching on Japanese terms always returns no results.

[Still later] Turns out a Japanese drupal user has developed a patch. This helps some of the problems I was having (and was already groping towards solving). Still doesn’t enable Japanese searching, though.

Drupal

Drupal is a general-purpose content-management system (CMS) for running news-and-discussion sites like (to use the most obvious example) Slashdot, though perhaps not such busy ones. I’d been considering toying with something like this for some time, and finally got around to installing it today.

The only hitch in something like this is that you need to set up a MySQL database, which can be intimidating for non-nerds, and configure a file to find that database (which took me a while to get right, mostly because of my inability to follow instructions). But for the most part, installation is a snap. After that comes configuration. Drupal, like many of its ilk, is endlessly configurable, and has numerous add-on modules available. It gets a little tricky because it is based on some rather abstract and non-obvious mental models, and the docs are not as clear as they should be. But after some messing around, I started getting it to do what I wanted it to.

I’ve been using Movable Type for some time now, and that’s become my point of reference. MT is a very sophisticated tool for one kind of task: blogging. Individual content management. MT is narrow but deep. Beyond that, it can be used for wider purposes thanks to its flexibility, but it becomes increasingly difficult to keep up the farther you get from straight blogging.

Drupal, by contrast, is relatively shallow but wide. Blogging is just one module in it, although its blogs are not as sophisticated as MT’s. And in some ways, the customization threshold is higher. MT has its own HTML-like language of tags, so if you can write HTML, you can create your own templates in MT. With Drupal, it seems that you need to know some PHP in order to do more than shuffle around pre-made modules.

Is Google too big?

Google’s recent buyout of Pyra set the whole blogosphere abuzz, but it also seems to have prodded some people to wonder whether we should worry about Google being too important, too big, too valuable, too secretive.

At Austin’s blogger meetup the other night, Prentiss asserted that private projects like Google and archive.org were too important to leave in private hands (archive.org is basically a hobby of Brewster Kale’s). He suggested that the Library of Congress should be given funding to develop and maintain resources equivalent to these.

Citing privacy concerns, the BBC’s Bill Thompson suggests that Google is “a public utility that must be regulated in the public interest,” and that the British Government should establish an “Office of Search Engines” (or to use his Orwellian term, OfSearch).

Both points have some merit, although both have weaknesses. Regulating a search engine strikes me as a potentially heavy-handed. And if privacy is an issue, I’d be especially unwilling to see the U.S. Government in its current form operating a popular and all-encompassing search engine–that could easily be a back-door to Poindexter’s Total Information Awareness.

So what’s the solution? I’m not sure. But I think that if Google (or to be exact, the services it offers) is too important to leave to Google, it’s too important to leave to any one entity. Better to seed the technology widely. The open-source community might be able to come to the rescue, if it could develop and disseminate smart search-engine code, and license it under strict terms that permitted a nonprofit organization to inspect the books at licensees to make sure they weren’t misusing data they captured, etc. Result-rigging could be caught be setting up a meta-search engine that compared results from different installations of the same engine.

[Later] So how do you come up with a good search engine? Obviously part of the problem is having the bandwidth to crawl the Net frequently and thoroughly. Part of it no doubt comes down to efficient indexing. But perhaps the trickiest is results ranking. I was speculating on ways to refine the matching algorithm, and perhaps a tournament approach would be the way to go.

Here’s what I mean: Develop a bunch of matching algorithms. By default, site users would just see whichever is the preferred algorithm du jour. But willing users could see a “tournament view” where results from two different engines were presented side-by-side. They could then express their preference as to which set of results seemed most useful. With N algorithms, there would be N2-N possible tournament combinations. With a large user base, it shouldn’t be hard to generate meaningful results. This could also be part of the feedback loop in a genetic-algorithm approach, although I don’t understand genetic algorithms well enough to really develop that angle any further.

Scroll to Top