Vincent Gable’s Blog

January 26, 2009

Compressibility of English Text

Filed under: Research | , , ,
― Vincent Gable on January 26, 2009

Theory:

Some early experiments by Shannon67 and Cover and King68 attempted to find a lower bound for the compressibility of English text by having human subjects make the predictions for a compression system. These authors concluded that English has an inherent entropy of around 1 bit per character, and that we are unlikely ever to be able to compress it at a better rate than this.

67 C. E. Shannon, “Prediction and entropy of printed English”, Bell Systems Technical J. 30 (1951) 55. (Here’s a bad PDF scan)

68 T.M. Cover and R. C. King, “A convergent gambling estimate of the entropy of English”, IEEE Trans. on Information Theory IT-24 (1978) 413-421

Signal Compression: Coding of Speech, Audio, Text, Image and Video
By N. Jayant

Shannon says 0.6-1.3 bits per character of English — 0.6 bits is the lowest value I have seen anyone claim.

Practice:

Just as a datapoint I tried gzip --best on plain-text file of The Adventures of Sherlock Holmes, weighing in at 105471 words, and using 578798 bytes. The compressed file was 220417 bytes.

If we assume the uncompressed version used one byte (8 bits) per character, then gzip --best used about 3 bits per character.

Best so Far

The state-of-the-art in the Hutter Prize, a challenge to compress 150 MB of Wikipedia content, is 1.319 bits per character. But that’s with a program tuned just for that data set, and it took 9 hours to run.

2 Comments »

  1. Tried 7zip? I’m curious if it’s any better.

    Comment by Jason Petersen — January 26, 2009 @ 5:42 am

  2. 7za -mx5 reduced the file to 184040 bytes. Again assuming each byte of the uncompressed file is a character of English, that would be roughly 2.5 bits per character.

    Comment by Vincent Gable — January 26, 2009 @ 9:15 am

RSS feed for comments on this post.

Leave a comment

Powered by WordPress