April 08, 2003
RSS feeds, for real
Like I said before, polling sucks rocks. Can't stand it, and I consider it an indication of a bad or badly thought out design. In this month alone (and it's only 7.5 days old, more or less) there've been 6273 requests for the index.rdf file, of which 4631 got 304'd, from 305 unique IP addresses. While that's flattering, it's also insane--there's no reason for all those queries. I just don't write that much stuff. And who knows how many of the page requests are from folks checking to see if anything's changed. And I can't imagine how many bits get flung across the wires for just to find that Boing Boing hasn't been updated. Though, given how obsessively Cory seems to update the thing, perhaps most of them do get new stuff. Still, how much of the feeds is duplicated?
The system as it stands also doesn't serve readers or users of aggregators that well. I know they suck for me. I read a few blogs and comment occasionally on others. What I want, as a user, is to be pinged when what I care about changes--a posted trackback, or comment or blog entry. Sometimes all three, but often just one. I can't do that without regularly polling (and in some cases not really at all) which is annoying too.
So, on the provider end, it sucks. On the consumer end, it sucks. Arguably sucking less than not existing at all, but still... there's much suckiness to go around. That bugs me, and I find it troublesome.
What'd be better? I think an INN/newsfeed sort of system to transport the ping information. I'm not going to deal with the actual data, since there are a while host of legal and technical issues involved there. Going with a feed system, though, adds in intermediate transport hosts, and initial upload hosts, plus potential distribution and authentication issues. Which I think I have solutions for. More complex than the current system, but robust and lower overhead, too.
I shall dump out the details in a bit, give or take some.
Posted by Dan at April 8, 2003 05:39 PM
| TrackBack (0)
I've heard this song and dance before. It sounds vaguely 1995ish, when it was called "push media".
Maybe there's finally a need for push. :-)
Have you taken a look at blo.gs' services? It's still polling, but at least you're only polling one source.
If you already have a server, you can use the cloud interface: blo.gs/cloud.php
There's a few alternative approaches floating around, like sending updates via the Jabber protocol.
I've written up some thoughts on the matter, and put together a proposal of one approach that could be used to solve the problem. If you are interested check out: http://www.owlfish.com/thoughts/cnws-2003-04-08.html
Aw, c'mon, ziggy, your memory is longer than that! :) Before push media we called it Usenet News, and before that we called it e-mail. Both of which work very nicely as push systems. Granted, both have problems these days, but the problems can be dealt with by setting up a good system archiitecture--their problems are mostly because of assumptions of trustworthiness and maintenance built into them. Alas, these days we can't count on either.
As for blo.gs, this weblog does ping it, but unfortunately (unless I missed something) it doesn't pick up comment postings or trackback additions. Dunno how well the cloud stuff scales, as nobody seems to have hooked into it.
Finally, Colin, I think you underestimate the bandwidth taken by a conditional HTTP GET by about a factor of three, if not more. In addition to the actual data passed back and forth (the GET request and the response) you need to account for TCP and IP overhead. There's the three-way handshake, ACK packets, and connection teardown. I should dig up the RFCs, but I think the total data exchanged is more in the range of 1K than 200 bytes. Still not much for a 1000 client/hour poll, but what if you're the New York Times, CNN, or BBC? Yeah, they have the bandwidth, but do they want to, and do they want to pass on their whole RSS file each time a new article gets posted? (Which is a separate issue for sites with frequent updates--their RSS files will get snagged with some frequency, but most of the data in it is redundant with each fetch, on top of the redundancy that XML imposes. But XML is a rant for another day)
Well, I'm not talking about the cloud being directly integrated, but rather that it would be relatively easy to leverage a second layer of services that took the blo.gs information, and managed updates.
Its still a many-to-one (or many-to-few) scenario, but many to many seems unfeasible with the given requirements. As for comments & trackbacks, if one wished to track such information, its all a matter of a little programming. :)
call me crazy, but wouldn't a tuned mailing list (one that allowed only follow-ups to be posted, except by the list owner) serve this purpose perfectly well?
New stories could be broadcast with a trivial load on the weblog server, there is a rich body of software for working with e-mail and mailing lists, the format is familiar and it maps almost perfectly to the current weblog "post and comment" format.
The trick then would be to bridge comments /from/ the weblog back /to/ the mailing list. Maintaining the "reply-to:" threading wouldn't be hard, but allowing anonymous posts to the site - which would roughly correspond to letting non-members post to the list - could present awkward problems...:
* how would list spam be handled? Could only authorized users post to the list, whereas anyone could post to the post-story "comments" sections?
* how would comment spam be handled? Would some form of login (which could then allow centralized banning) be required?
* if the mailing lists were open-subscription, valid e-mail addresses could be easily and automatically be extracted for spam purposes (or worse, harassment and the like).
This last issue brings up a point: ideally, the identity of a poster could be preserved across sites. This can be accomplished non-authoritatively by tracking the Name and URL comment fields: if someone has the same name and is pointing towards the same site (and, usually, is in the same subnet as the origin of the bulk of his or her comments), it's reasonable to infer that the identity is consistent.
Not everyone who comments on weblogs has a personal site, though, so this isn't a satisfactory solution. E-mail addresses are far more common and useful, but they are also vulnerable to spam if they are sent out unprotected.
Here's the final question: where's the division of responsibility? Is it up the weblog to provide secure e-mail mangling? Is it up to the user? If the user does not authenticate with the site, should said site allow the user to comment?
It sounds like, as is usually the case, the most pressing issues are social and political rather than technical.
Oh, and as for conditional gets of the RSS file, why not use additional HTTP headers? ("because that's an awful solution", of course) The last dozen or so entries could be identified in the header by their byte ranges, and then only the necessary sections could be retrieved by a polite client. Hackish but effective.
While e-mail has some of the characteristics I want, it lacks others, and I don't think it's a good fit, any more than I think the current situation is good, or that NNTP is a good fit. Ideas from parts of them are definitely usable, but for the whole... nah, doesn't work well.
This, I think, is going to take some thought, a bit of hashing out, and a chunk of code on my part (I am, after all, going to have to write a server node, feed library, and client library, otherwise this is just more useless blog wanking) but it seems a small and tractable problem. I'm not sure if folks'll be up for a new protocol and potentially new port in use, but...
I've been thinking about this problem for a little while, and have a kind of different take on the whole thing to stuff I've read anywhere else. I planned on writing a small article about it, but haven't really found the time.
What we need is some effective method of pulling a piece of data, that could be updated at any time at arbitrary intervals, and preferably have a system which handles this in a scalable way. Broadcast isn't the solution here, I don't think - it works for small sites, but not for the biggest (could Boing Boing ping everyone interested in a sufficiently small amount of time that the interval between pings would be measured in less than an hour?!). So, something distributed is probably the way to go.
Why not DNS? You get to set caching information. You can set TTL, it gets distributed by caching servers, it's bandwidth frugle anyway. Someone could run a piece of software that polls every second, and they will just hit their local resolver until the record expires.
All you need really is the time-last-updated, which could then trigger an RSS pull. Some simple scheme, such as:
last-update.rss.example.com IN A 2003.4.9.19.0
Okay, so what are the problems here? Well, you'd need DDNS support. Plus, you need a form of URL->FDQN map for last update information on multiple pages (or, servers with many blogs). But none of these problems are intractable.
DNS is still pull. The user won't know that there's an update until a query for it is made. Local query, sure, but a query nonetheless.
I'm definitely not proposing a single broadcast host for each blog or set of blogs, as there's no way that'll scale. That's one of the things I want to take from Usenet News--the multiple loosely connected server concept. If done right, and it is done pretty right now, you get batching, caching, and propagation across a complex and messy graph of nodes, which is what we really want. Well, what I really want, at least... :)
What about something like an SMS gateway relay? A blog could send an update message to an SMS gateway which would then broadcast out to a list of phones subscribed to the updates. Trying to architect a system that would get around the limitations of the current set-up would take a bit of time...why not use infrastructure that is already in place and is convenient for almost anyone with a cell phone, especially if they have one equipped with a browser?
I'm not much for RSS aggregators either as most of them are cumbersome and clumsy. A blog newsfeed could be interesting to try though.
AN SMS gateway relay would work just fine, as would an e-mail gateway relay or pager gateway relay. (Though I'd really hope that pagers subscribed to feeds would be for really, really important feeds...) A leaf node on the relay network could bridge to any sort of system.
I know that I, for one, would use an e-mail bridge, as it'd be nice to get mail I could deal with at a later date if a post I was watching on another site got a comment or trackback posted, the way I've got things set here to mail me on comment posting.
I don't mean this to sound like I'm doggedly adhering to my first idea, but I'm curious: which aspects of the functionality that you want are not provided by a mailing-list back-end? Which specific issues would your solution address?
I agree that a loosely-connected and messy node graph is probably the best way for this kind of system to be organized. The focus should be on the data, not on the transmission - moving the bits is best left to back-ends that could bridge to HTML+HTTP, SMS, SMTP, NNTP or the format du jour.
So I guess the next question is: are you more interested in the data or in the transports?
There are a couple of issues.
First, it's a major load on a server that has a lot of folks monitoring it. Sending e-mail is a relatively high-overhead operation, with 2-4K per message, at least. Probably more, if you want reasonable bounce handling.
Second, mailing lists are resource intensive, relative to the data (well, metadata) being sent. You're going to peg a mail message per blog or feed entity per update (comment, post, edit, trackback)
Third, sending mail doesn't help the end-user, because most of the folks who're going to want this are going to use it as an aggregator of sorts, not to have a half-zillion mail messages dropped in their inbox. Setting up automagic mail parsing is a non-trivial task, especially for folks living on systems where they aren't running a MTA locally but instead grab mail via POP/IMAP to a local machine. Parsing that mail to find the feed sources that have changed is also an expensive task.
Fourth, this stuff just isn't important enough to warrant clogging my, or most any other, system's mail spool. I'm happy to have the server churn when mail pegs one of the lists I host, but I don't particularly think it's worth it just to note a blog entry was posted/edited/commented/tracked.
Fifth, it makes end-user management harder. If I want to unsubscribe to a feed, I just don't go fetch it from the node I'm sucking down, and at some point the feed may age out if nobody else is grabbing it and its not acting as an intermediate transport for another node in the graph.
Finally, I'm interested in the data and the transports. Concentrating on one but not the other will get a system that sucks around as much as what is in place now. Replacing one lousy system with another isn't much of a win.
Someone mentioned Jabber up near the top of this thread, but it didn't seem to get much notice. Personally, I think that there's potentially a great fit between weblogs and the sort of lightweight identity management, pub/sub infrastructure, and "messy graph" services you're talking about.
What you need to do is get one of the major "web services" catering to the weblog crowd (and here I'm thinking of the weblogs.com ping, or something similar) to start accepting Jabber pub/sub clients, and passing messages over the wire from there.
There's nothing like an actual stream of useful data to get people working on novel front-end views and tools to work with it.