February 18, 2004

mod_gzip is your friend. Really

As is whatever the module for Apache 2.0 is.

I finally broke down and snagged an RSS content aggregator thingie (which, in addition to confirming my feelings that polling for RSS feeds sucks more than Cygnus X-1, now makes me want RSS feeds on places that don't have them. EurekAlert and The New Scientist spring to mind) since I've got enough places I go infrequently that I was starting to lose track of which ones I'd been to lately and which I hadn't. 'Tis keen, though I'd like to be able to twiddle more stuff than the tool allows. No joy there, but as it's freeware I can't rightly complain.

Anyway, the tool (NetNewsWire Lite) has a keen little statistics window you can pull up to take a look at bandwidth stats, 304 counts, and whatnot. Included in this little gizmo is a count of how many times you got gzipped data back from the site in question.

Now, I'd forgotten that I'd enabled mod_gzip on the server ages back. I did it for response-time reasons--I've only got a 128K upstream link, and a friend's got a pretty image-heavy and markup-heavy website hanging off this box, and the full-text feeds for this blog (Which some annoying folks are getting every time. Sheesh, people, welcome to the 21st century! Either HEAD the feed or introduce yourself to the If-Modified-Since header!) get big. Smushing the data, even a for a few people, helps response time for everyone if you've got CPU time to burn, and I generally do. I didn't realize how much of a difference it makes, though. According to the stats, it cuts down the feed size by about 75% or so.

What surprised me is the number of sites that don't have compression enabled.

Now, I can understand not doing it for a number of reasons. It does put an extra load on your server, so if you're CPU bound it may not be a win. Still... you'd think more folks would do it. Setup's dead-simple, so far as I can tell there aren't any side-effects for folks that can't handle it, and it's a one shot set-and-forget deal. It definitely makes a difference in the amount of data transferred, both out of the server and into the client. Less load for you and snappier response for your readers, 'specially if they're on slow links.

If you've not done it, and your provider lets you, try it. You may well like it, and if you pay by the megabyte transferred your bank balance may definitely like it. Won't shrink the images, but every bit helps. XML, XHTML, CSS, and HTML all smush down quite nicely, FWIW.

(And a late addition--if you're writing an RSS syndication tool and it doesn't accept compressed streams... consider making it so it does. There's plenty of code around to uncompress the compressed data)

Posted by Dan at February 18, 2004 12:45 PM | TrackBack (0)
Comments

Most desktop aggregators do. Here's an up-to-date list: http://www.sauria.com/blog/articles/aggregator-gzip.html

The Apache 2.0 module is called mod_deflate.

Good server-side tutorials here: http://www.webcompression.org/

Posted by: Mark at February 18, 2004 01:13 PM

I didn't realize that so many of the desktop aggregators already do gzip compression. Cool. I should dig and see if LWP does compression, since I see a lot of hand-rolled lwp-based aggregators pinging me as well.

Posted by: Dan at February 18, 2004 01:32 PM

According to the headers I get back from LWP::UserAgent, it should accept gziped content:


GET / HTTP/1.1
TE: deflate,gzip;q=0.3
Connection: TE, close
Host: xxxxx
User-Agent: libwww-perl/5.68

Posted by: Timm Murray at February 18, 2004 02:03 PM

Liferea is missing from that list, and I suspect it doesn't support compression (only guessing here though).

That aside, what I find surprising is that noone seems to have gone the other way around. The Apache modules compress on the fly — why not store (semi) static content in compressed form (after generation) and only decompress on delivery if necessary for the client in question? Zlib inflation is basically just a glorified memcpy() — decompression is blazing fast. That would allow even heavily loaded servers to serve compressed content and hardly take any hit.

And where it's also beneficial to configure the server so it doesn't try to compress media files on the fly, with on-the-fly decompression that would fall out naturally by not having compressed images files around. Or, since you're only paying the penalty once, you could decide to just compress them for the heck of it. 74 bytes saved on a file can still add up.

Of course it's not quite as low-maintenance as compressing on the fly, but that could be alleviated by moving the responsibility up the publishing chain — like having the FTP server compress any files uploaded to the webspace or something.

Posted by: Aristotle Pagaltzis at February 18, 2004 03:33 PM

New Scientist RSS feed: http://myrss.com/f/n/e/newscientistU137ps0.rss

Posted by: Simon Brunning at February 19, 2004 08:02 AM

Looks like thttpd actually will send already-compressed versions of files if it detects that the client can handle it:

http://rekl.yi.org/thttpd/

I might use this to send static content and then send requests for dynamic content on to an Apache mod_perl server running on another port.

Kinda wish khttpd would do the same, but that might be running dangerously close to feeping creaturism.

Posted by: Timm Murray at February 19, 2004 10:28 AM

mod_gzip does handle local gzipped versions of the data it sends, as I recall from its setup docs. Still, I'm not sure it's truly worth the hassle, as you then run into the issue of having to compress the files before uploading them and making sure the server knows that these are compressed versions of real files rather than actual archives.

You could, if you wanted, have the compressor cache the compressed version automagically, which has less room for "whoops" issues, but then you have the issue of needing a cache, or allowing the webserver to write into the directories with the uncompressed files.

On the fly's not great, but it's distinctly possible it's actually faster, in real-time/latency terms, than using a cache, since you don't need two fetches of metadata from the filesystem. (Though the metadata is likely in cache anyway, so it's not as much of a hit in that case)

Posted by: Dan at February 19, 2004 01:18 PM

you then run into the issue of [...] making sure the server knows that these are compressed versions of real files rather than actual archives

I imagine (and that'd be part of the system's beauty) you'd simply use the regular content negotiation mechanism: just like you'd tack .en, .de, or other stuff in order to have the server pick an appropriate language version of a file, you could just tell the server that .gz can be served to comrpression savvy clients. What you'd then have to tell the server is when not to decompress (.tar.gz etc).

In contrast, caching adds a lot of complication (in the code that handles serving compressed data to savvy clients), although for dynamic but infrequently changing content it can't be avoided.

As for the hassle: if you are intent on doing this, it's easy to come up with simple automatisms like a cronjob running every hour which compresses any file in the webspace with an extension other than .gz, .png, .jpg etc.

Didn't know mod_gzip provides for this approach, though; that's cool. I need to take another look.

Posted by: Aristotle Pagaltzis at February 19, 2004 06:07 PM

Major bugs to watch out for: does it apply the Vary: header to outgoing non-compressed content? Certain installations of mod_gzip can result in a situation where IE refuses to cache content. Also, be aware that compressed CSS and JS files can be corrupted on the client-side, if using IE earlier than the latest stuff.

Posted by: Richard Soderberg at February 22, 2004 05:31 AM

I'm actually OK with breaking if folks look at the site with an older version of IE. Not, in this case, because I have any particular dislike of IE, but because all the non-current versions of IE in circulation do, so far as I know, have active exploits against them. Consider it an extra incentive to upgrade to a less-busted version of the browser. :)

Posted by: Dan at February 23, 2004 10:01 AM

Richard,

You wrote:

"Major bugs to watch out for: does it apply the Vary: header to outgoing non-compressed content? Certain installations of mod_gzip can result in a situation where IE refuses to cache content. Also, be aware that compressed CSS and JS files can be corrupted on the client-side, if using IE earlier than the latest stuff."

I guess you are refering to Q823386 (http://support.microsoft.com/default.aspx?scid=kb;en-us;823386&Product=ie600) which describes some problems I've been having with a site that runs over SSL and mod_deflate. Look at the technical description at the botttom and you can see why it breaks.

We find that occasionally our largest js file corrupts on login (with a clean cache). We find that when working locally this almost never occurs. But when accessing the website from the US, the problem occurs one in 10 times!

Unfortunately installing the new urlmon.dll (which is also available from critical security update MS03-048) didn't fix the problem. It just occurs less often.

To compund the problem, despite the following apache settings, IE decides to not send the If-modified-since when logging in a day after a successful login.


Header add Cache-Control "private, post-check=22800"

AddOutputFilterByType DEFLATE text/css text/xml text/html application/x-javascript
ExpiresActive On
ExpiresByType image/gif "access plus 8 hours"
ExpiresByType image/jpg "access plus 8 hours"
ExpiresByType text/html "access plus 8 hours"
ExpiresByType text/plain "access plus 8 hours"
ExpiresByType text/css "access plus 8 hours"
ExpiresByType text/xml "access plus 8 hours"
ExpiresByType application/x-javascript "access plus 8 hours"

I did have
Header append Vary User-Agent

But decided to take it out.

Any other ideas how toget IE to cache properly, or perhaps not corrupt files?

Thanks

David Roussel

Posted by: David Roussel at March 11, 2004 09:40 AM