April 24, 2003

Data sources, for blog distribution

Okay, let's talk for a moment about providing data for this proposed blog notification system. What is there, and what does it look like?

Data streams, I'm proposing, is divided up into channels. Each channel has three things associated with it:

  1. Titile
  2. Base URL
  3. channel ID
  4. Public key

The title is, as you might expect, the blog title. No big deal, other than being text so there are all those pesky character set issues to deal with. (Yes, I know, Unicode is the answer and will save us all! I think not) Blog titles are restricted to no more than 1023 octets. How many characters that is depends on the encoding, but worst case you're in full UTF-8, with room for 170, which ought to be more than enough.

The base url is the url that all contents vector off of. The channel should present something meaningful here, if queried. URL is limited to 255 characters. I think. Should be enough.

The channel ID is a base64 MD5 checksum of the original title and URL for a channel. If a channel changes title or URL, the MD5 checksum used for that channel doesn't change. The title and URL should be slammed together with no extra characters.

The public key is the public key of the channel. Everything that comes from this channel, or reports itself as from this channel, can be validated off this key. All outgoing messages are signed, so that clients and transit servers can run their signatures against this key and see if they're real messages. Or so is the plan, at least.

So, when you look at a channel, you may see:

Title: Squawks of the Parrot
Base URL: http://www.sidhe.org/~dan/blog/
Channel key: H0OQJSfvne/3yQ2lkISmvg
public key: SOMERANDOMSTRINGOFDIGITSANDLETTERS

(This isn't how it goes over the wire, just the data itself. We'll touch on wire format messages later. Maybe in this entry, maybe not. Dunno yet, and who edits blog text anyway? :)

Now, when you send a message across noting a change, it's one of:


  • New entry
  • Changed entry
  • New comment
  • changed comment
  • new trackback
  • changed trackback

Note that change messages include deletions.

Each message has exactly four things in it:

  1. Message type
  2. channel key
  3. relative URL
  4. signature

The message type is one of the above things--new entry, new comment, new trackback.

The channel key is the MD5 checksum of the base channel's original data (title and URL)

The relative URL is the URL tacked onto the base channel URL (just a straight slam together) to get the full path to the data. Note that, since comments and trackbacks are considered modifications of the base data element, the URL for a comment would be the same as the URL for the thing being commented on.

The signature is the public key signature of the message. When the message itself is run through the channel's key it should match this key. (I'm a bit fuzzy on the mechanics of asymmetrical public key crypto systems, so we'll put off the decision of what PK system is used, and just assume that something is)

If a system is something like the Lambda weblog or an Everything engine system, where each comment is a node in its own right, a new comment generates a new post, and things get odd. I'm not sure what to do in that case, other than perhaps have a "response to" message with the URL being responded to and the URL of the response.

Limits of the system

There are some limits here, of course.

All data must hang off the base URL. I don't think this is an issue for anyone, though I can see it being a problem if there are multiple data sources sharing a base URL, or of the base URL needs things removed from it. (Chopping off the index.php, for example, to get the base for the relative URLs) Don't do that, at least not for now.

There's required PK crypto, or at least secure verifiable digesting. I expect this may run afoul of a number of laws in various countries. I'm up for alternate validation methods if anyone has one, but I don't know of any.

There's no way to sub-divide a channel. For a blog this may not be a problem, but I can see someone like the New York Times wanting one big "times channel" with a bunch of sub-channels for each section. (NYC Metro, Tech, Science, Sports, whatever) Too bad, we don't do that right now.

There's the issue of changing channel data as well, which could be... interesting. Punting for now, but that may involve verifying based on out-of-band data. (Key files on the original URL/machine or something)

I think, though, that what's here is sufficient for what I want, at least from a source end. Have at it, though, as I don't want to do any technical details until I'm sure that what's being proposed is semantically sufficient.

Posted by Dan at April 24, 2003 07:36 PM | TrackBack (0)
Comments

Food for thought: some data sources have multiple authors. Some data sources have multiple subjects. Should timestamps also be included?

Posted by: Sam Ruby at April 24, 2003 08:08 PM

Irritatingly naïve question:

Why not drop the length limits and use YAML or XML as an on-wire format?

Unless I've got my quoting rules incorrectly...
"""
Title: Squawks of the Parrot
Base URL: http://www.sidhe.org/~dan/blog/
Channel key: H0OQJSfvne/3yQ2lkISmvg
public key: SOMERANDOMSTRINGOFDIGITSANDLETTERS
"""
(excluding the pythonesque triple quotes) is valid YAML.

Posted by: Nathan Sharfi at April 25, 2003 03:41 AM

Timestamps are definitely needed--I forgot that one. D'oh! (I suppose mandating a 64 bit timestamp of 100 nanosecond ticks since November 17, 1858 would be pushing it just a bit, though)

As for going with YAML and XML (which I like and dislike, respectively) I'm shooting for as dense a representation as possible. XML is many things, but dense isn't one of them. YAML is generally better, but still, a good chunk of what would get transferred is redundant. (The labels in the log entry were purely for explanatory purposes, and I wasn't proposing actually using them for the data)

Posted by: Dan at April 25, 2003 09:15 AM

Another round of thoughts posted to my weblog :)

http://bcuni.net/blog/one-entry?entry%5fid=1112

/Simon

Posted by: Simon Carstensen at April 26, 2003 06:31 PM

You bobbled the URL there, Simon. :P (And I see you don't do trackbacks. Tch...)

Digging through to the entry I presume you mean, I've become even more convinced that XML is a profoundly bad way to go about this. Bletch--your example XML there about doubles (possibly triples) the size of the message without adding any benefit, and requiring that all the servers and clients involved have a full working XML parser. Plus it obscures the data, so it's harder for a person to read (And, more importantly, poke into a server via a telnet to the server port) than something simpler like:


NEWCMNT
H0OQJSfvne/3yQ2lkISmvg
archives/000157.html
BIGLONGPOSSIBLYOPTIONALFORDEBUGGINGSIGNATURE

Yeah, I'd need to cut-n-paste the channel key, and have software generate the signature if the server I was talking to required it, so it's not entirely bereft of code in the request generation, but still...

Also, more importanly than the new data post end is the query for new data and info that the clients will end up using, which may look something like:

CHANINFO
H0OQJSfvne/3yQ2lkISmvg

or

POSTSINCE
H0OQJSfvne/3yQ2lkISmvg
1233100243

With that last number being the timestamp we're querying against.

It's possible that somewhere along the line there'll be a data payload, and I suppose XML might be worth it there, but since everything we're talking about is metadata....

Maybe I'm just getting old, but I have had to telnet to SMTP, FTP, HTTP, and NNTP servers to do manual remote debugging. Their simple (well, simple-ish, in some cases) protocols made it possible. XML, well, that doesn't help make it simple.

Posted by: Dan at April 26, 2003 07:29 PM

Just some comments.

Use SHA1 not MD5... You can avoid using a Channel ID which is based on metainfo and instead SHA1 hash the public key of the channel and then base32 it. This will allow you to have a URI for the channel based on the public key which has a lot more entropy to base off of. You can then use this for the ID.

This is essentially what I plan on doing in NewsMonster....

Posted by: Kevin Burton at April 27, 2003 11:08 PM

I don't much care whether it's SHA1 or MD5, as long as the results are of fixed length and the implementation's freely available. I chose MD5 since that's what I use on a daily basis, but as long as things are the same everywhere it's fine with me. The use of MD5 in this case isn't to get any sort of security so much as a fixed length short key that we have a reasonable chance of being universally unique.

The reason for the channel ID is to give each channel an identifier that was fixed at channel creation time. I fully expect that a channel's public and private keys may change over time--machines do get compromised and keys broken, after all, so being able to change the key may be necessary. I'm thinking that it's possible that the base channel URL itself might change over time as well and, while this scheme would prevent anyone from ever starting another data feed in the same spot as an older one that moved, I don't think it's that big a problem. (Worst case you can start it elsewhere on the server, get the channel key for that, and them move it)

I may be misunderstanding--if you're talking about basing the channel ID off a hash from the original channel URL and channel key and still using the channel ID, then what I said above is meaningless and I think it's a good idea. :)

Posted by: Dan at April 28, 2003 10:42 AM