Improve HTML compression ratio

This is my first post on web performance. I have been wanting to write it for a while, and preparing the data for it has taken quite some time. Hope you like it, please give me feedback.

Squeezing the bytes

We can do all kinds of stuff to our page source to improve its performance. But if the page source was given, what could we do to minimize its pay load? Just a few examples to get things warmed up:

Remove whitespace An easy fix would be to remove whitespace since it don’t affect browser rendering. Whitespace is highly redundant in itself, but removing it will have an impact.
Remove comments Also an easy fix, and if you are not already doing it, you should be doing it.
Externalize/Combine JavaScript/CSS This would not only reduce the number of requests, but comes with the possibility to reduce the page source itself a bit.
Remove quotes in some cases While it may be a bit extreme, and leave the result non-valid, there are certainly cases where all those quotes are not needed - at least from the browsers perspective. mod_pagespeed comes with a filter, Elide Attributes, that does this.

While these kinds of optimizations may not mean a lot - typically only 1-2% - it still adds up, especially on mobile devices with low bandwidth and high latency.

What else can we do?

I think there is more to be done than outlined above. The data inside a HTML document can be categorized as either:

elements, like <html>
attributes, like id=“value”
whitespace
comments
content

Reducing the amount of elements can only be done to a certain degree, but is pretty hard to do. Whitespace and comments on the other hand is a no-brainer - they should simply go. Focusing on the content is also a bit out of scope, so lets assume that the page contains exactly the content that the page should present. No more, no less.

This leaves us with attributes, and their importance and weight haven’t really had that much attention. I did a quick count on some pages to check the share taken up by attributes, and the results surprised me a bit.

Share of attributes on various sites
Website	Page Size	Attributes (1+)	Attributes (2+)
Amazon.com	251.972	79.948 (31,73%)	58.967 (23,40%)
Facebook.com	41.682	17.348 (41,62%)	13.593 (32,61%)
Google.com	119.297	9.662 (8,10%)	7.704 (6,46%)
Wikipedia.org	47.687	29.230 (61,30%)	27.159 (56,95%)
Yahoo.com	121.751	64.748 (53,18%)	42.034 (34,52%)
YouTube.com	103.188	69.396 (67,25%)	55.439 (53,73%)
Bilka.dk	94.297	56.881 (60,32%)	42.450 (45,02%)
DanskSupermarked.dk	13.789	8.364 (60,66%)	6.394 (46,37%)
Føtex.dk	56.027	26.007 (46,42%)	21.054 (37,58%)
Netto.dk	80.078	36.340 (45,38%)	25.268 (31,55%)
Salling.dk	62.369	37.472 (60,08%)	24.590 (39,43%)

When calculating this, I took the length of the entire attribute - both key, value, equals sign and quotes if they were present. The sum is made twice, one for attributes on all elements, and one for elements with two or more attributes. I will explain why in a while.

The sum of it all is that attributes on average make up approx. 35% of a document! And I don’t think we give them enough attention. So lets do that.

Compression: Another aspect to this

By now everyone know that compression should be enabled to minimize the payload. And that goes for about 76% of the resources delivered. I am no expert on compression, but I do know that it takes the redundancies in a text, I know it from a high level perspective: it looks for redundancies and “reuses” them in the output.

Compression is good for our payload, and we should try to get the most benefit from it. If we want to maximize the compression ratio, we should aim to maximize the number and length of patterns found by the compression algorithm.

We are locked on most of our document, but lucky for us the W3C specified in the recommendation for XML that the order of attributes is not significant. This means that we can move around about 35% of the page size without too many constraints. The individual elements must maintain their semantical meaning, which essential leads to a few contraints:

Two attributes with the same key would have to remain in the order they were specified
Obviously, we can’t remove any attributes

But does attributes matter in a compressed document?

To test if the order of attributes matters, I set up an experiment. Based on the same websites as above, I tried shuffling all attributes on all elements - making the order totally random - and compressing it all. As a control, I used the unaltered compressed original.

Experimenting with order of attributes
Website	Control	Experiment
Website	Control	Size	Gain/Loss
Amazon.com	56.637	57.290	653 (1,15%)
Facebook.com	11.905	12.201	296 (2,49%)
Google.com	32.350	32.415	65 (0,20%)
Wikipedia.org	9.446	10.282	836 (8,85%)
Yahoo.com	27.327	27.789	462 (1,69%)
YouTube.com	15.026	16.348	1.322 (8,80%)
DanskSupermarked.dk	3.673	3.758	85 (2,31%)
Bilka.dk	20.041	21.123	1.082 (5,40%)
Føtex.dk	15.425	15.684	259 (1,68%)
Netto.dk	18.578	18.997	419 (2,26%)
Salling.dk	11.049	11.889	840 (7,60%)

As seen above, the order of attributes is important, and the increase in compressed page size seems to correlate to the total size of attributes - at least to some degree. All in all, attributes are very important if we want to squeeze the number of bytes as low as possible.

Can ordering of attributes improve compression ratio?

Since we now know the order of attributes can hurt a compressed page, it should also be possible to use this for our advantage. What if there were some dominant way of ordering attributes? This would allow us to use output mechanisms like mod_pagespeed or Servlet Filters to apply this ordering for us, and improve compression ratio.

So I sat down, and thought about possible strategies for ordering the attributes, and this is the list I came up with:

byName Sorts attributes by name
byValueLength Sorts an elements attributes by the length of their value. The idea here is to get attributes with short values to appear first
analytic-x Analyses the document for elements with more than X attributes, and sorts them based on the number of unique values. If two attributes share the same number of unique values, they are sorted by name. X can be either 0 or 2.

And the results are in…

Effect of strategies for arranging attributes
Website	byName	byValueLength	analytic-0	analytic-2
Amazon.com	-89 (-0,16%)	40 (0,07%)	-34 (-0,06%)	-70 (-0,12%)
Facebook.com	-9 (-0,08%)	62 (0,52%)	-7 (-0,06%)	-15 (-0,13%)
Google.com	-11 (-0,03%)	17 (0,05%)	9 (0,03%)	8 (0,02%)
Wikipedia.org	45 (0,48%)	220 (2,33%)	294 (3,11%)	296 (3,13%)
Yahoo.com	-1 (0,00%)	48 (0,18%)	39 (0,14%)	45 (0,16%)
YouTube.com	-164 (-1,09%)	155 (1,03%)	0 (0,00%)	2 (0,01%)
Bilka.dk	-20 (-0,10%)	84 (0,42%)	58 (0,29%)	41 (0,20%)
DanskSupermarked.dk	-12 (-0,33%)	1 (0,03%)	-2 (-0,05%)	-1 (-0,03%)
Føtex.dk	-62 (-0,40%)	-35 (-0,23%)	-52 (-0,34%)	-39 (-0,25%)
Netto.dk	-58 (-0,31%)	-7 (-0,04%)	-61 (-0,33%)	-90 (-0,48%)
Salling.dk	-76 (-0,69%)	-20 (-0,18%)	5 (0,05%)	-2 (-0,02%)

Sorting by length of value is - maybe not surprisingly - the worst of the strategies. analytic-2 does really well and the big joker is byName, which provides pretty solid results with a less complex algorithm.

The results also show that there is no dominant strategy since re-ordering does not improve compression ratio in all cases. As for the effect of doing it in the first place, some of the websites listed above is among the most optimized worldwide.

The lack of a dominant strategy kind of disappoints me, but I have a few ideas that I will try out and see how the results look. Please leave a comment if you liked my post.