{"id":224,"date":"2009-01-26T00:05:38","date_gmt":"2009-01-26T05:05:38","guid":{"rendered":"http:\/\/vgable.com\/blog\/2009\/01\/26\/compressibility-of-english-text\/"},"modified":"2009-01-29T11:06:10","modified_gmt":"2009-01-29T16:06:10","slug":"compressibility-of-english-text","status":"publish","type":"post","link":"https:\/\/vgable.com\/blog\/2009\/01\/26\/compressibility-of-english-text\/","title":{"rendered":"Compressibility of English Text"},"content":{"rendered":"<h3>Theory:<\/h3>\n<blockquote><p>Some early experiments by Shannon<sup>67<\/sup> and Cover and King<sup>68<\/sup> attempted to find a lower bound for the compressibility of English text by having human subjects make the predictions for a compression system.  These authors concluded that <strong>English has an inherent entropy of around 1 bit per character<\/strong>, and that we are unlikely ever to be able to compress it at a better rate than this.<\/p>\n<p><sup>67<\/sup> C. E. Shannon, &#8220;Prediction and entropy of printed English&#8221;, <em>Bell Systems Technical J<\/em>. 30 (1951) 55. (<a href=\"http:\/\/www.princeton.edu\/~wbialek\/rome\/refs\/shannon_51.pdf\">Here&#8217;s a bad PDF scan<\/a>)<\/p>\n<p><sup>68<\/sup> T.M. Cover and R. C. King, &#8220;A convergent gambling estimate of the entropy of English&#8221;, <em>IEEE Trans. on Information Theory<\/em> <strong>IT-24<\/strong> (1978) 413-421\n<\/p><\/blockquote>\n<p>&#8212;<a href=\"http:\/\/books.google.com\/books?id=KySFiuq-ZykC&#038;pg=PA208&#038;lpg=PA208&#038;dq=compressibility+of+english+text&#038;source=web&#038;ots=BPZ-ubtaeU&#038;sig=S5Jey_lZWd2JlBs0hqVVRGKlz1o&#038;hl=en&#038;sa=X&#038;oi=book_result&#038;resnum=2&#038;ct=result\"><em>Signal Compression: Coding of Speech, Audio, Text, Image and Video<\/em><br \/>\nBy N. Jayant<\/a><\/p>\n<p>Shannon says 0.6-1.3 bits per character of English &#8212; 0.6 bits is the lowest value I have seen anyone claim.<\/p>\n<h3>Practice:<\/h3>\n<p>Just as a datapoint I tried <code>gzip --best<\/code> on plain-text file of <em>The Adventures of Sherlock Holmes<\/em>, weighing in at 105471 words, and using 578798 bytes.  The compressed file was 220417 bytes.<\/p>\n<p>If we assume the uncompressed version used one byte (8 bits) per character, then <strong><code>gzip --best<\/code> used about 3 bits per character.<\/strong><\/p>\n<h3>Best so Far<\/h3>\n<p>The state-of-the-art in the <a href=\"http:\/\/prize.hutter1.net\/\">Hutter Prize<\/a>, a challenge to compress 150 MB of Wikipedia content, is 1.319 bits per character.  But that&#8217;s with a program tuned <em>just<\/em> for that data set, and it took 9 hours to run.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Theory: Some early experiments by Shannon67 and Cover and King68 attempted to find a lower bound for the compressibility of English text by having human subjects make the predictions for a compression system. These authors concluded that English has an inherent entropy of around 1 bit per character, and that we are unlikely ever to [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[187,318,317,188],"class_list":["post-224","post","type-post","status-publish","format-standard","hentry","category-research","tag-english","tag-entropy","tag-information-theory","tag-linguistics"],"_links":{"self":[{"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/posts\/224","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/comments?post=224"}],"version-history":[{"count":0,"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/posts\/224\/revisions"}],"wp:attachment":[{"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/media?parent=224"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/categories?post=224"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/tags?post=224"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}