Vincent Gable’s Blog

January 26, 2009

Compressibility of English Text

Filed under: Research | , , ,
― Vincent Gable on January 26, 2009

Theory:

Some early experiments by Shannon67 and Cover and King68 attempted to find a lower bound for the compressibility of English text by having human subjects make the predictions for a compression system. These authors concluded that English has an inherent entropy of around 1 bit per character, and that we are unlikely ever to be able to compress it at a better rate than this.

67 C. E. Shannon, “Prediction and entropy of printed English”, Bell Systems Technical J. 30 (1951) 55. (Here’s a bad PDF scan)

68 T.M. Cover and R. C. King, “A convergent gambling estimate of the entropy of English”, IEEE Trans. on Information Theory IT-24 (1978) 413-421

Signal Compression: Coding of Speech, Audio, Text, Image and Video
By N. Jayant

Shannon says 0.6-1.3 bits per character of English — 0.6 bits is the lowest value I have seen anyone claim.

Practice:

Just as a datapoint I tried gzip --best on plain-text file of The Adventures of Sherlock Holmes, weighing in at 105471 words, and using 578798 bytes. The compressed file was 220417 bytes.

If we assume the uncompressed version used one byte (8 bits) per character, then gzip --best used about 3 bits per character.

Best so Far

The state-of-the-art in the Hutter Prize, a challenge to compress 150 MB of Wikipedia content, is 1.319 bits per character. But that’s with a program tuned just for that data set, and it took 9 hours to run.

Powered by WordPress