Theory:
Some early experiments by Shannon67 and Cover and King68 attempted to find a lower bound for the compressibility of English text by having human subjects make the predictions for a compression system. These authors concluded that English has an inherent entropy of around 1 bit per character, and that we are unlikely ever to be able to compress it at a better rate than this.
67 C. E. Shannon, “Prediction and entropy of printed English”, Bell Systems Technical J. 30 (1951) 55. (Here’s a bad PDF scan)
68 T.M. Cover and R. C. King, “A convergent gambling estimate of the entropy of English”, IEEE Trans. on Information Theory IT-24 (1978) 413-421
—Signal Compression: Coding of Speech, Audio, Text, Image and Video
By N. Jayant
Shannon says 0.6-1.3 bits per character of English — 0.6 bits is the lowest value I have seen anyone claim.
Practice:
Just as a datapoint I tried gzip --best
on plain-text file of The Adventures of Sherlock Holmes, weighing in at 105471 words, and using 578798 bytes. The compressed file was 220417 bytes.
If we assume the uncompressed version used one byte (8 bits) per character, then gzip --best
used about 3 bits per character.
Best so Far
The state-of-the-art in the Hutter Prize, a challenge to compress 150 MB of Wikipedia content, is 1.319 bits per character. But that’s with a program tuned just for that data set, and it took 9 hours to run.
Tried 7zip? I’m curious if it’s any better.
Comment by Jason Petersen — January 26, 2009 @ 5:42 am
7za -mx5
reduced the file to 184040 bytes. Again assuming each byte of the uncompressed file is a character of English, that would be roughly 2.5 bits per character.Comment by Vincent Gable — January 26, 2009 @ 9:15 am