April 21, 2003

Better blog distribution, part 1 -- the assumptions

Okay, I've been thinking about this some, and I might as well get this down so folks can go rip it to shreds as they want. (Yes, I know, I should write a continuations thing. Maybe tomorrow, but probably not)

The problem, if you'll remember, is that I loathe the current "poll for RSS" scheme of seeing when blogs update, along with an utter lack of notification for things like trackback links and comment pings. What I'm proposing is an NNTP-style system with a set of loosely connected server systems that take notifications from the end-blogs and pass them around to whoever's watching, with the notifications ultimately hitting clients connected to those servers waiting for notifications of changes. Nothing fancy, just your standard store-and-forward multicast system. We've been doing this with news for decades, with some success. And some failure, too, of course. Forgetting the failure would be a bad thing.

I think the project is too big to go in one blog entry, so this time it's just the general assumptions.

The first assumption is that, for any sufficiently large group of people, some of them will be scum and do abusive things. This is the single most important assumption. I don't particularly like it, but pretending it is otherwise leads to the current state of email and Usenet News, where trust and sense (or at least an available admin with a clue-by-four) is assumed and definitely not really present. So, people will try to spoof, abuse, and hijack the system, either to get spoofed data out in the wild or to do abusive things to some data source's system.

The second assumption is that we are not distributing content. We are distributing notifications of content changes. We may, at some point, talk about content change distribution, but not this time. This means that we aren't sending article contents, excerpts, titles, or whatever around, just notifications that action X happened to URL Y with rider data Z. Maybe. We might toss the rider data. (Though sending around trackback counts and comment counts would be useful)

The third assumption is that the protocols should be efficient. That means the message for a URL change might be something like"\0\23POST\n/archive/0005.html" rather than whatever monstrosity would result if we XML encoded it. (Yes, there are two bytes of binary data marking a length prepended, though I can see dropping it to one byte, and I can see forcing the message type to exactly N bytes with no terminator, for some value of N) Remember--if I was keen on XML the last thing I'd be doing was grumbling about bandwidth usage.

The fourth assumption is that each data source will be in a channel, which one can subscribe to so your immediate upstream data provider can get the feed somehow. Hopefully not from the ultimate provider, though.

The fifth assumption is that, while the messages aren't that important, neither are they entirely meaningless (otherwise why bother in the first place?) so we need some form of store-and-forward system.

The final assumption (that I'm admitting to, at least) is that the protocols should be simple. Someone should be able to bodge together a client or feed submitter in a reasonably simple perl/python/ruby/scheme/unlambda module. Well, maybe not unlambda. Still, no ties to one language, and a simple enough protocol that one could probably do it by hand with a telnet client. Which argues against my proposal for message length. Damn.

I think that covers the assumptions. It's possible that something like Jabber or IRC can handle the middle server stuff, which would be just fine, though the store and forward thing may shoot that down. We'll see.

Next time is the feed end and channel stuff, I expect.

Posted by Dan at April 21, 2003 06:06 PM | TrackBack (5)

this really shouldn't be too hard to do. you're talking a minimal binary protocol, a message routing system, and some means of 'secure' data entry.

the data entry side of things could be test implemented as an MT trackback ping url. have your blog ping blogs.perl.org or whereever when you post an entry and thing propogate out to clients.

i'd love to help implement this if you're looking for helpers. a prototype in POE should be fairly easy to do if i'm reading this right.

i'm sungo on rhizomatic if you want to chat about this at some point :)

Posted by: sungo at April 21, 2003 06:47 PM

hmms.. ALthough I share your view that XML encoding would be a wastefull use of resources, why not use NNTP and a simple rfc822 message?

We could even just use the header's for now, leaving the body for content at a latter stage.

The advantage in my view, is that NNTP is everywhere. Every ISP out there has a NNTP server or gives access to one to their subscribers.

Regarding detecting changes, it's easy to set up a ping server, and make your blog "ping" it whenever the content changes. Thaat ping server would then post a NNTP group.

The only problem with this all, is that a single usenet group for all the blog's in the world will not scale. Some sort of hashing should be available.

Best regards,

Posted by: Pedro Melo at April 21, 2003 07:59 PM

Um, isn't jabber like pretty much 100% XML centric?

Won't the costs of establishing connections pretty much outweigh the parsing costs? Remember, we are not talking about something like BitTorrent where we have long running connections.

If you want large numbers of people to participate, you will probably can't presume much more than CGI.

Posted by: Sam Ruby at April 21, 2003 10:42 PM

Don't bother using a binary format - the idea should be that you send notification to those that care, so the size of the delivery isn't that critical. Parsing XML is easy and everyone can do it, so it's better than inventing yet another encoding.

As far as abuse goes - a simple Jabber like server callback system can probably solve that...

Posted by: Colin Stewart at April 21, 2003 11:10 PM

(without fully understanding your whole proposed weblog change-notification dissemination system)
isn't that already how blo.gs works with weblogs.com?

1) bloggers send a ping to weblogs.com after they update their content
2) weblogs.com provides regular batch updates of changed sites -- think incremental DNS zone xfers -- which blo.gs (or other news aggregators, presumably) can operate on
3) those aggregators update upon receipt

this setup completely reverses the "polling paradigm" you dislike; people don't busy-wait polling your site, YOU ping a change-logging site and it notifies subscribers of batched changes to sites at regular intervals. your original update to the change-logger site (weblogs) basically "echoes" out, from weblogs, to teh aggregators, to end users, and finally back to you, in the form of them visiting your site, if they choose -- not your rss/rdf feed.

does the above system fulfill your criteria?

Posted by: jm3 at April 22, 2003 01:53 AM

Nb. http://blo.gs/cloud.php

Posted by: jm3 at April 22, 2003 01:55 AM

Newswire does exactly this, and more. See also today's entry on All Things Distributed.

Posted by: Werner Vogels at April 22, 2003 09:43 AM

If newswire does it, and the protocols are open so other folks can write things that publish, subscribe, and transit things I'm all for it.

I want a solution. I'll write one if I can't find something, but I'm all for working things that I don't have to do. :)

Posted by: Dan at April 22, 2003 12:04 PM

This isn't quite what blo.gs does, though it isn't too far off. blo.gs serves as a central notification point (though I still want comments and trackback posts) but what it doesn't do is serve as a way to notify the end user. It's a partial solution for part of the problem space, and no solution for the other. (Though I'm not knocking it, as it does what it says it does, and that's a good thing)

The current system falls down in authentication, attack survivability, and scalability, so I don't think it'll do what I want.

Posted by: Dan at April 22, 2003 12:08 PM

authentication? attack survivability? you're talking about DOS attempts, here? i'm curious as to what your threat model is.

Posted by: jm3 at April 23, 2003 04:31 PM

I'm talking about in-system attacks. I'm not concerned with lower-level protocol attacks (ping floods, SYN attacks, stuff like that) as there are already mechanisms available to thwart those. Rather I'm thinking of the case where someone hijacks a popular channel to point to their own stuff, or issues many bogus notificiations to induce a flood of client requests on someone else's server. (Or their own, in some attempt to jigger ranking somewhere).

If this gets used, I fully expect someone will try to use it to take down someone's data source, redirect that data source (to discredit the original source, for commercial purposes (probably porn spam), or as a way to try and plant trojans on people's machines), disable that data source, or just flat out try and screw with the system because they can.

There's no true technical solution, of course, as there are no technical solutions to social problems (and juvenile and/or slimy behaviour is definitely a social problem) but we can try and make it tough enough that they go bother someone else.

Posted by: Dan at April 24, 2003 10:22 AM

I posted a few thoughts to my weblog:



Posted by: Simon Carstensen at April 25, 2003 07:02 AM

Simon, I read your blog entry, but I don't feel like having yet another account somewhere, so I'm responding here.

You're fundamentally missing the point--nothing that I'm proposing doing (well, besides dumping XML, as there's just no need for it with this) has anything to do with what you're talking about, nor does it forbid it. Clever aggregators, discussion trackers, and suchlike things will still work, and are still doable, the only difference with them is that if they use the notification scheme they'll only go get RSS data when it changes, not at their polling interval. So, besides being more timely and less demanding of data provider resources, there won't be a difference.

Posted by: Dan at April 25, 2003 09:18 AM


Posted by: Dan at October 2, 2004 08:30 PM