net stuff

Correct run-in headings

I’ve recently noticed a couple of blogs that use an awkward “span” kludge to create “run-in” headings. These are both by smart guys who should know better. Instead of using structurally correct headings and paragraphs, the heading text is part of the paragraph, and is just bracketed with SPAN tags so that it can be styled differently.

CSS-2 does include a “run-in” display style that achieves exactly what these guys want, but it is not universally supported. There are a couple of possible workarounds, both of which I’ve documented. One is to float the header; the other is to style the header and the paragraph immediately following as “inline.”

Nifty browser trick

I’ve only tried this in Safari, but imagine it would work in some other browsers.

Safari allows you to set a custom base CSS stylesheet. In fact, this is the only way to turn off link underlining in Safari. Since I prefer this, I had already set one up. Simply create a text file, call it “mystyles.css” (or whatever) and drop it in ~/Library/Safari. Put the appropriate CSS in the file, quit Safari, and restart. For example, to turn off underlined links, I used the following:

a:link { text-decoration: none; }
a:active { text-decoration: none; }
a:visited { text-decoration: none; }
a:hover { text-decoration: underline; }

It occurred to me that I could use the often-ignored attribute-matching selector capability of CSS to create a primitive ad blocker. Banner ads are normally 468 x 60 pixels. Using CSS, it is possible to select images that have declared height and width values, and style them as invisible. Here’s how:

img[width="468"][height="60"] {visibility: hidden;}

You can add variations on this with different dimensions for the tall sidebar ads one occasionally sees, use it with different tags, etc. For example the New York Times hides gigantic sidebar ads inside an IFRAME, and use javascript to indirectly load an image (actually, I think it’s a flash animation). This makes it hard to block the image if you have javascript turned on, but you can just block the IFRAME instead

iframe[width="352"][height="852"] {visibility: hidden;}

I’d be eager to hear any other uses for this trick. It would be nice if CSS allowed partial matches, so that we could match on, say *[href="*doubleclick.net*"] As far as I know, this isn’t possible.

Tribe: another social network

Tribe.net is yet another social network, still in beta. Unlike Friendster or Ryze, this one seems to be an all-purpose site, for helping people find each other based on interest or proximity, for whatever purpose they want. Nice interface. I’m signed up, just for fun. Still too soon to say how it will evolve.

Spam in my name

I previously hypothesized that we’d eventually see viruses/trojan horses used to relay spam, and later reported that it was, in fact, happening. Now it is happening in my name.

There are plenty of Outlook viruses that infect computer A, mine the address book, and then send out infectious e-mail to parties B, C, D, and E, but pretending to be someone else from the address book, making it much more difficult to trace back to the infected computer and fix the problem. It would be simple for one of these virus writers to substitute spam for infectious e-mail (and probably add in hooks for updating the spam messsage remotely).

I have just received a bounce message for a piece of spam that purports to come from me, and was apparently sent to an invalid address. It is also interesting to note that the entire message text is base64-encoded, which no doubt helps it slip past spam filters.

Needless to say, I am chagrined. For those who care, I have posted the raw text of the bounce message (e-mail addresses changed to protect the innocent).

Meanwhile in related news, Spamotomy looks like a good clearinghouse of information on spam.

Transfer

The crossroads.net domain-name registration was about to expire, so on a tip from Prentiss, I transferred the registration from the spooks at Network Solutions to Go Daddy. A stupid name, but for $7.75 instead of $35, I’ll put up with it.

Anything involving domain-name changes is always fraught with the potential for problems, and I was concerned when I saw what appeared to be a glitch in the making, so I called the Go Daddy domain-support line. A human being answered the phone on the first ring. Go back and re-read that sentence, and think about it for a moment. The guy was helpful and gave me relevant information. I’m sold.

Mr Popular

Friendster seems to be growing explosively. I can almost sit there and hit the reload button in my browser, and watch the number of people in my network increase. It’s grown by about 250 in about 5 hours today. Apparently I am connected through 4 degrees to Gwen Stefani (if we can take the entry at face value), along with nine other Gwens, none of whom are the Gwen I’m seeing. It would be interesting to see a map of the connections between people. I suggest that some people are major nodal points (one guy lists 676 friends, which is kind of unbelievable). I am apparently 4 degrees removed from Jack Black, by way of some guy who lists 881 friends. Zoinks!

No surprise that it’s already spawned Fiendster and Enemyster as parodies.

Spam report 2

Over the past 7 days, I’ve received 506 511 (some came while writing this) pieces of spam. Of these, spamassassin correctly tagged about 450, a 90% hit rate, with no false-positives that I could see. Interestingly, mail.app’s internal junk-filtering rules gave me three false-positives. One of these was mailing-list mail with a spamassassin score of -9, two of them were paypal notices, one of which had a spamassassin score of -98! Interesting to note how disparate the two are.

Spam report

Over the past eight days, I have received 397 pieces of spam. 328 were flagged by Spamassassin and dropped in my spam-box before I ever saw them; one of these was arguably not spam (it was bulk, commercial e-mail that I didn’t particularly want, but I have bought stuff from the sender before, so they had obtained my e-mail address legitimately). Only about ten messages had subject lines that might fool me into thinking they weren’t spam.

I don’t have exact numbers, but spam accounted for well over half the total e-mail I received in this period–possibly over three-quarters.

Social networks

There’s been a lot of interest lately in social software. A related phenomenon is the way the Internet can make social networks explicit.

I like playing around with this. I recently created a FOAF file (see my badge-zone). And there’s a brilliant “FOAF explorer” (where you can see I really need to flesh mine out).

One problem with FOAF is that it’s nerdy, and while I think it’s a good approach, not everyone will bother putting FOAF files on their websites (oh wait–not everyone even has a website). Friendster answers that–it approximates FOAF’s functionality, but lets the user sign in and point to friends rather than post a file with arcane formatting. It would be nifty if Friendster could read FOAF files, and conversely, if Friendster had an interface for feeding information into FOAF files.

None of this is particularly new. Six degrees did roughly the same thing as Friendster back in 1995, I think. But the Internet is big enough that network effects make the idea more viable. It’s also interesting trolling through Friendster–so far, the only friends I’ve found in there are part of my fire-freak circle of friends, so all the same faces keep popping up. It would be interesting to find someone from a different circle there and be the point of intersection between circles.

Later: Seems that Ben Hammersly had the same idea.

Natural keywords and categories

Adam Kalsey has done some fine work on creating lists of related entries for Movable Type based on the contents of your blogs.

Not to undermine it, but this still doesn’t go far enough towards discovering natural relations between entries, and won’t work unless we write in a restricted style with a restricted vocabulary–that goes against the grain of blogging, which is personal and spontaneous. If I mention Donald Rumsfeld in one blog entry and the Secretary of Defense in another, clearly they’re related (although the person with that title can change, making that equation more complicated). How can this be made to work?

The first problem is extracting potential keywords from “noise” words. A first-order effort would be to have a canned list of noise words, and filter those out–this would be a simple, fast process. A second-order effort would be to filter out any words that are used very frequently by the blogger–this would be much slower, and perhaps should be handled asynchronously (the results of this could be used to refine the first-order noise-word list to speed things up in general).

The two Big Bens of Blogistan (Trott and Hammersley) have worked out the ingenious more like this from others. This has the germ of something interesting: using an outside reference.

Something like the Open Directory already represents a pretty extensive hierarchical library of keywords. To take my prior example, the first hit for a search on “Donald Rumsfeld” at dmoz is found in the category “Regional > North America > United States > Government > Executive Branch > Departments > Defense”. That gives you some excellent keywords to take home. (It also seems possible that if a candidate keyword generates scattered search results, it might not be a good keyword, and should be added to the noise-word list.) The most specific are at the end, and “Defense” is a very useful keyword to equate to Rumsfeld. It gets better: that category contains subcategories with very useful terms (Armed Forces, Defense Agencies, Department of Defense Field Activities, Intelligence, Joint Chiefs of Staff, Office of the Secretary of Defense, Unified Combatant Commands) as well as related categories (Science > Technology > Military Science; Regional > North America > United States > Government > Military > Installations > Pentagon). These could be used to generate a high-quality list of “alternative keywords.”

So the process of finding and using alternate keywords would go something like this:

  1. Create potential keyword list
    1. Winnow out noise words
    2. Winnow out other frequently-used words
  2. Search dmoz or other directory for keywords
  3. Collect categories for search results, as well as subcategories and related categories
  4. Assemble new list of alternative keywords
  5. Search blog corpus for alternate keywords, create links when found

The process of constructing a list of alternative keywords clearly involves a fair amount of work–but that’s what we’ve got computers for. And it obviously won’t always be perfect–but that’s what we’ve got brains for.

Trojan-horse spamming

In June, I speculated that this would happen. Now it actually has: spammers are distributing trojan horses that infect other computers to relay spam for them.

This is clearly illegal, of course. But spammers have long been exploiting “open relays”–unsecure mail-servers–which should be considered illegal as well.

Perhaps my most recursive metablog post to date

Via the Movable Type support board, I learned of the blogideas site. “When you don’t know what to Blog about.”

Now, there are lots of different forms that blogs can take, and they’re all valid, I suppose, but if I don’t have anything to write about in my blog, I don’t write anything. I don’t feel some obligation to tap away, uninspired, for the dubious benefit of my adoring public. Some suggested topics from the site: “Experiment: how many fishsticks fit in your mouth?”; “An ode to your couch”; “Why do dogs sniff each other in the ass?”

If you’re reduced to writing about that, better take a day off.

Also, in the interest of completeness, that discussion exposed me to the memes list–which is really more a themes list (organizational conceits like “Friday Five”).

Macintouch gets with the program, sort of

Macintouch, the best Mac news site, is finally publishing an RSS feed. Excerpts only, which is fair: they need to get people to come to the site, since they’re advertising-supported.

A bigger problem is that Macintouch never had permalinks for individual stories, and yet an RSS feed requires a permalink, or something like it. The solution Macintouch is using appears to be very ad-hoc: there are indeed anchor tags for individual stories, but they don’t seem intended to scale: they read like <a name="itools7">. This is adequate for one day’s worth of news, but not for providing a permanent ID. I’m guessing that Macintouch has been running on a homebrewed content-management system that wasn’t designed with permalinks in mind; now Ric Ford is locked in, and retrofitting newfangled contraptions like permalinks is turning out to be hard.

How not to fix a problem

Blogshares seems like a fun idea that has attracted quite a bit of attention. Unfortunately, it includes a ticker–a mere frill–that is coded in Javascript that causes all Mac browsers (at least, that I’ve tried) to lock up. If you use a Mac, the only way to visit the site is to disable Javascript first. This has been mentioned repeatedly on the discussion board there.

The brilliant solution? A “ticker on/ticker off” switch. Implemented in Javascript. The only way to turn off the ticker is to turn Javascript on. If you do that, the browser locks up, making it impossible to turn off the ticker.

I think Joseph Heller wrote a book about this type of situation.

CSS rant

CSS is great, but it’s too hard. When even the guy who wrote the books on CSS admits that it has “made the veins in my forehead throb”, you know there’s a problem.

All the stuff I’ve done on the web for the past year or so has been in CSS, and I’ve been gradually re-working older stuff to bring it into the new millennium. So I’ve drunk the kool-aid.

One thing that would make it at least a little easier would be a hierarchy to stylesheet documents. I’m not talking about the cascade effect (which is really neat). I’m talking about a way to organize a single stylesheet document.

As it is, there’s no enforced, preferred, suggested, inherent, or obvious way to structure a CSS document. You’ve just got “selector soup.” This makes a complicated stylesheet hard to read and hard to write. CSS offers contextual selectors, which would be an obvious candidate for hierarchical organization. Rather than having

div#main {margin:2px;}
div#main h2 {color:red;}
div#main h3 {color:blue;}

It would seem eminently sensible to have

div#main {margin:2px;} [
    h2 {color:red;}
    h3 {color:blue;}
    ]

or something like that.

With a structure like this, it would be possible to distill an HTML document down to its tags, and generate a structured list of selectors, to create a stylesheet skeleton.

The other bear, for me, is the layout of major page elements. The whole box model is pretty powerful, but it is unintuitive and there are some things it just can’t do. Imagine a page laid out like this:

header:left header:right
main content navbar
footer:left footer:right

Near as I can tell, this is almost impossible using straight CSS. It might be possible if the header and footer areas are fixed height, probably meaning they contain mostly graphics. It would probably require a lot of extraneous DIV tags. DIV tags are fine up to a point, but nesting a bunch of DIV tags just to get a page to lay out correctly goes against the spirit of structured HTML. Might as well use a table-hack layout after all.

It seems like something that was cooked up to be elegant on a theoretical level without much regard for the kinds of layouts people might actually want to achieve. The layout of this page was achieved through some code that I consider inelegant and brittle.

More on fighting spam

I recently wrote about some thoughts to combat spam, that would involve cash micropayments for some e-mail. I wasn’t entirely happy with that idea, and I’ve refined it a little.

My idea, as it currently stands, is this:

Everyone who receives e-mail would maintain a blacklist (of pestilential senders) and a whitelist (of good senders). Anyone not on either of these lists is said to be on the “graylist.” Everyone who sends e-mail would be required to put a small amount of money–perhaps one dollar–into an escrow account.

When Alice sends Bob a piece of mail, before Alice’s mailserver actually delivers the message, it checks to see whether she is on Bob’s blacklist or whitelist.

If Alice is on neither, that is, she’s on the recipient’s graylist, a “hold” is placed on one cent in her escrow account by her mailserver, and the message is delivered. When Bob receives her e-mail, he can choose whether her e-mail is legitimate or not. If it is legit, the hold on that penny is released. If not, the penny is deducted from her account (perhaps paid to Bob, his ISP, a charity, or some combination). When declaring a message to be legit, Alice is added to Bob’s whitelist. If not, it goes on his blacklist (this process could be simplified a bit so that responding to a message automatically whitelists the sender).

If she is on the whitelist, the message is delivered without involving the escrow account at all.

If she is on the blacklist, the message is not delivered and one penny is automatically deducted from her escrow account.

The problem with this is that it adds quite a lot of overhead to graylist and blacklist correspondences, and some overhead to whitelist ones. The Internet hasn’t had any successful micropayment systems yet.

What about using a non-cash system? In theory, this system could work using valueless certificates. One would apply to a “trusted certificate-issuing authority” for a bundle of, say, 100 certificates. The process could be designed to thwart scripts that would simulate human action, and one could be prevented from receiving more than 100 certificates/month (for example). The authority would deposit 100 signed and encrypted certificates in your “escrow account,” and in all other respects, the system would function similarly. When Bob’s mailserver (or perhaps Bob’s own mail software) receives Alice’s message, it checks the certificate against the certificate-issuing authority; if it is valid, the message is presented to Bob (who can still choose to blacklist it, if he wants). If not, the message is bounced.

Note that this system is pretty similar to the authenticated e-mail that some people would like us to use anyhow. This would also involve a similar amount of processing overhead. And in fact, the cash-based system would need to use pretty much the same system of signed and encrypted certificates.

There are some broader differences between the cash-based and cashless systems. If everyone can receive a bounty for identifying spam, even a tiny one-penny bounty, more people are likely to actually do it, making spam less tenable. (A system like this would also make it attractive for technically savvy users to create “honeypot” e-mail addresses to attract spam, automatically blacklist it, and collect lots of pennies.) And the idea of a certificate-issuing authority is problematic, as they would, in effect, be gatekeepers: if you can’t get your certificates, you cannot communicate by e-mail. The authority could charge money for issuing certificates, or otherwise abuse this power. If these certificates were issued automatically and solely as digital representations of pennies, the system should be less prone to abuse. There would need to be more than one authority.

So if I’m saying that the whole validation process should be added on without a surcharge being imposed (which I am), how would this be funded? All e-mail host operators should pitch in to fund the system. I have no idea what the numbers on this would look like, but I suspect that they would save more thanks to reduced traffic than the system would require them to contribute.

Side note: there’s now an official Anti-Spam Research Group. Maybe I should try to get this idea in front of them.

[Later] Interesting to note that Robert Cringely came up with somewhat similar micropayment system for fighting spam.

[Later still] David Nunez pointed out this article on a “spam tax.”

Lazyweb: help me pick a CMS

I’m trying to find a content-management system, mostly for running a limited-access discussion forum. I’ve already looked at Drupal, which is pretty nice, but I have not yet figured out how to make it fully support Japanese, which is a make-or-break feature.

I can get limited success writing Japanese in Drupal, but (in the best-case scenario so far) the characters wind up being escaped to numeric Unicode entities, which are not editable after being posted, and apparently are not searchable.

I also looked at the Slash engine, but A) the installation instructions assume you are root, assume you are more technically adept than I am, and are very sketchy all around, and B) has the same problems with Japanese.

I suspect the Japanese-handling problems in both Drupal and Slash could be fixed with some minor coding changes, but I don’t know where to start. Messing with the template charsets doesn’t do it.

Other desirable features:
Threaded message structure
Moderation/karma points a la Slashdot
Flexible permissions setup
Straightforward templating for admin

None of the other CMSs or web-BBSs I’ve looked at so far have forum threading,

[Later] Seems that I can almost get Japanese working right in Drupal (why do I think of towering drag-queens every time I say that?) after all: there was a configuration issue that was causing the Japanese to be escaped. I turned that off, and now everything is close to hunky-dory, except for one thing: searching on Japanese terms always returns no results.

[Still later] Turns out a Japanese drupal user has developed a patch. This helps some of the problems I was having (and was already groping towards solving). Still doesn’t enable Japanese searching, though.

Drupal

Drupal is a general-purpose content-management system (CMS) for running news-and-discussion sites like (to use the most obvious example) Slashdot, though perhaps not such busy ones. I’d been considering toying with something like this for some time, and finally got around to installing it today.

The only hitch in something like this is that you need to set up a MySQL database, which can be intimidating for non-nerds, and configure a file to find that database (which took me a while to get right, mostly because of my inability to follow instructions). But for the most part, installation is a snap. After that comes configuration. Drupal, like many of its ilk, is endlessly configurable, and has numerous add-on modules available. It gets a little tricky because it is based on some rather abstract and non-obvious mental models, and the docs are not as clear as they should be. But after some messing around, I started getting it to do what I wanted it to.

I’ve been using Movable Type for some time now, and that’s become my point of reference. MT is a very sophisticated tool for one kind of task: blogging. Individual content management. MT is narrow but deep. Beyond that, it can be used for wider purposes thanks to its flexibility, but it becomes increasingly difficult to keep up the farther you get from straight blogging.

Drupal, by contrast, is relatively shallow but wide. Blogging is just one module in it, although its blogs are not as sophisticated as MT’s. And in some ways, the customization threshold is higher. MT has its own HTML-like language of tags, so if you can write HTML, you can create your own templates in MT. With Drupal, it seems that you need to know some PHP in order to do more than shuffle around pre-made modules.

Is Google too big?

Google’s recent buyout of Pyra set the whole blogosphere abuzz, but it also seems to have prodded some people to wonder whether we should worry about Google being too important, too big, too valuable, too secretive.

At Austin’s blogger meetup the other night, Prentiss asserted that private projects like Google and archive.org were too important to leave in private hands (archive.org is basically a hobby of Brewster Kale’s). He suggested that the Library of Congress should be given funding to develop and maintain resources equivalent to these.

Citing privacy concerns, the BBC’s Bill Thompson suggests that Google is “a public utility that must be regulated in the public interest,” and that the British Government should establish an “Office of Search Engines” (or to use his Orwellian term, OfSearch).

Both points have some merit, although both have weaknesses. Regulating a search engine strikes me as a potentially heavy-handed. And if privacy is an issue, I’d be especially unwilling to see the U.S. Government in its current form operating a popular and all-encompassing search engine–that could easily be a back-door to Poindexter’s Total Information Awareness.

So what’s the solution? I’m not sure. But I think that if Google (or to be exact, the services it offers) is too important to leave to Google, it’s too important to leave to any one entity. Better to seed the technology widely. The open-source community might be able to come to the rescue, if it could develop and disseminate smart search-engine code, and license it under strict terms that permitted a nonprofit organization to inspect the books at licensees to make sure they weren’t misusing data they captured, etc. Result-rigging could be caught be setting up a meta-search engine that compared results from different installations of the same engine.

[Later] So how do you come up with a good search engine? Obviously part of the problem is having the bandwidth to crawl the Net frequently and thoroughly. Part of it no doubt comes down to efficient indexing. But perhaps the trickiest is results ranking. I was speculating on ways to refine the matching algorithm, and perhaps a tournament approach would be the way to go.

Here’s what I mean: Develop a bunch of matching algorithms. By default, site users would just see whichever is the preferred algorithm du jour. But willing users could see a “tournament view” where results from two different engines were presented side-by-side. They could then express their preference as to which set of results seemed most useful. With N algorithms, there would be N2-N possible tournament combinations. With a large user base, it shouldn’t be hard to generate meaningful results. This could also be part of the feedback loop in a genetic-algorithm approach, although I don’t understand genetic algorithms well enough to really develop that angle any further.