Improve HTML compression ratio

This is my first post on web performance. I have been wanting to write it for a while, and preparing the data for it has taken quite some time. Hope you like it, please give me feedback.

Squeezing the bytes

We can do all kinds of stuff to our page source to improve its performance. But if the page source was given, what could we do to minimize its pay load? Just a few examples to get things warmed up:

  • Remove whitespace
    An easy fix would be to remove whitespace since it don’t affect browser rendering. Whitespace is highly redundant in itself, but removing it will have an impact.
  • Remove comments
    Also an easy fix, and if you are not already doing it, you should be doing it.
  • Externalize/Combine JavaScript/CSS
    This would not only reduce the number of requests, but comes with the possibility to reduce the page source itself a bit.
  • Remove quotes in some cases
    While it may be a bit extreme, and leave the result non-valid, there are certainly cases where all those quotes are not needed – at least from the browsers perspective. mod_pagespeed comes with a filter, Elide Attributes, that does this.

While these kinds of optimizations may not mean a lot – typically only 1-2% – it still adds up, especially on mobile devices with low bandwidth and high latency.

What else can we do?

I think there is more to be done than outlined above. The data inside a HTML document can be categorized as either:

  • elements, like <html>
  • attributes, like id=”value”
  • whitespace
  • comments
  • content

Reducing the amount of elements can only be done to a certain degree, but is pretty hard to do. Whitespace and comments on the other hand is a no-brainer – they should simply go. Focusing on the content is also a bit out of scope, so lets assume that the page contains exactly the content that the page should present. No more, no less.

This leaves us with attributes, and their importance and weight haven’t really had that much attention. I did a quick count on some pages to check the share taken up by attributes, and the results surprised me a bit.

Share of attributes on various sites
Website Page Size Attributes (1+) Attributes (2+) 251.972 79.948 (31,73%) 58.967 (23,40%) 41.682 17.348 (41,62%) 13.593 (32,61%) 119.297 9.662 (8,10%) 7.704 (6,46%) 47.687 29.230 (61,30%) 27.159 (56,95%) 121.751 64.748 (53,18%) 42.034 (34,52%) 103.188 69.396 (67,25%) 55.439 (53,73%) 94.297 56.881 (60,32%) 42.450 (45,02%) 13.789 8.364 (60,66%) 6.394 (46,37%)
Fø 56.027 26.007 (46,42%) 21.054 (37,58%) 80.078 36.340 (45,38%) 25.268 (31,55%) 62.369 37.472 (60,08%) 24.590 (39,43%)

When calculating this, I took the length of the entire attribute – both key, value, equals sign and quotes if they were present. The sum is made twice, one for attributes on all elements, and one for elements with two or more attributes. I will explain why in a while.

The sum of it all is that attributes on average make up approx. 35% of a document! And I don’t think we give them enough attention. So lets do that.

Compression: Another aspect to this

By now everyone know that compression should be enabled to minimize the payload. And that goes for about 76% of the resources delivered. I am no expert on compression, but I do know that it takes the redundancies in a text, I know it from a high level perspective: it looks for redundancies and “reuses” them in the output.

Compression is good for our payload, and we should try to get the most benefit from it. If we want to maximize the compression ratio, we should aim to maximize the number and length of patterns found by the compression algorithm.

We are locked on most of our document, but lucky for us the W3C specified in the recommendation for XML that the order of attributes is not significant. This means that we can move around about 35% of the page size without too many constraints. The individual elements must maintain their semantical meaning, which essential leads to a few contraints:

  • Two attributes with the same key would have to remain in the order they were specified
  • Obviously, we can’t remove any attributes

But does attributes matter in a compressed document?

To test if the order of attributes matters, I set up an experiment. Based on the same websites as above, I tried shuffling all attributes on all elements – making the order totally random – and compressing it all. As a control, I used the unaltered compressed original.

Experimenting with order of attributes
Website Control Experiment
Size Gain/Loss 56.637 57.290 653 (1,15%) 11.905 12.201 296 (2,49%) 32.350 32.415 65 (0,20%) 9.446 10.282 836 (8,85%) 27.327 27.789 462 (1,69%) 15.026 16.348 1.322 (8,80%) 3.673 3.758 85 (2,31%) 20.041 21.123 1.082 (5,40%)
Fø 15.425 15.684 259 (1,68%) 18.578 18.997 419 (2,26%) 11.049 11.889 840 (7,60%)

As seen above, the order of attributes is important, and the increase in compressed page size seems to correlate to the total size of attributes – at least to some degree. All in all, attributes are very important if we want to squeeze the number of bytes as low as possible.

Can ordering of attributes improve compression ratio?

Since we now know the order of attributes can hurt a compressed page, it should also be possible to use this for our advantage. What if there were some dominant way of ordering attributes? This would allow us to use output mechanisms like mod_pagespeed or Servlet Filters to apply this ordering for us, and improve compression ratio.

So I sat down, and thought about possible strategies for ordering the attributes, and this is the list I came up with:

  • byName
    Sorts attributes by name
  • byValueLength
    Sorts an elements attributes by the length of their value. The idea here is to get attributes with short values to appear first
  • analytic-x
    Analyses the document for elements with more than X attributes, and sorts them based on the number of unique values. If two attributes share the same number of unique values, they are sorted by name. X can be either 0 or 2.

And the results are in…

Effect of strategies for arranging attributes
Website byName byValueLength analytic-0 analytic-2 -89 (-0,16%) 40 (0,07%) -34 (-0,06%) -70 (-0,12%) -9 (-0,08%) 62 (0,52%) -7 (-0,06%) -15 (-0,13%) -11 (-0,03%) 17 (0,05%) 9 (0,03%) 8 (0,02%) 45 (0,48%) 220 (2,33%) 294 (3,11%) 296 (3,13%) -1 (0,00%) 48 (0,18%) 39 (0,14%) 45 (0,16%) -164 (-1,09%) 155 (1,03%) 0 (0,00%) 2 (0,01%) -20 (-0,10%) 84 (0,42%) 58 (0,29%) 41 (0,20%) -12 (-0,33%) 1 (0,03%) -2 (-0,05%) -1 (-0,03%)
Fø -62 (-0,40%) -35 (-0,23%) -52 (-0,34%) -39 (-0,25%) -58 (-0,31%) -7 (-0,04%) -61 (-0,33%) -90 (-0,48%) -76 (-0,69%) -20 (-0,18%) 5 (0,05%) -2 (-0,02%)

Sorting by length of value is – maybe not surprisingly – the worst of the strategies. analytic-2 does really well and the big joker is byName, which provides pretty solid results with a less complex algorithm.

The results also show that there is no dominant strategy since re-ordering does not improve compression ratio in all cases. As for the effect of doing it in the first place, some of the websites listed above is among the most optimized worldwide.

The lack of a dominant strategy kind of disappoints me, but I have a few ideas that I will try out and see how the results look. Please leave a comment if you liked my post.

This entry was posted in Web Performance and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s