Vincent Gable’s Blog

March 2, 2009

Initial Findings: How Long is an (English) Word?

Filed under: Research | , , ,
― Vincent Gable on March 2, 2009

My brief research into the English language revealed the average character count of a word is eight. Throw together a bunch of a smaller and bigger words, some single spaces and punctuation and you roughly end up with the average 140-character tweet being somewhere between 14 and 20 words. Let’s call it 15.

Rands in Repose

That contradicts the common wisdom I’ve heard: the average word is 5 letters, so divide your character count by 6 to get a word count.

But that was a rule of thumb from the days of typewriters. Hypertext and formatting changes things. For example, every time you see something in boldface on my blog, there are an extra 17 characters for the HTML code, <strong></strong>, that makes the text bold.

Just to poke at the problem, I used wc to find the number of characters per word in a few documents. What I found supports the 6 characters per word rule of thumb for content, but not for HTML code. The number of characters per word in HTML was higher then 6, and varied greatly.

The text of the front page article on today’s New York Times was 5880 characters, 960 words: 6 characters per word.

The plain text of Rand’s webpage claiming 15 chars per word was 6794 characters, 1175 words: 6 words per character. By plain text, I mean just the words of the HTML after it was rendered, so formatting, images, links, etc were ignored. The HTML source for the page, however, was 15952 characters, meaning 14 words per character.

What about technical stuff? The best paper I read last year was Some thoughts on security after ten years of qmail 1.0 (PDF). It has no pictures, just 9517 formatted words. A PDF represents it with 161496 bytes (17 bytes per word), but ignoring formatting it is 62567 characters (7 characters per word).

I’m still looking into how long English words are in practice. Please share your research, if you have an opinion.

January 26, 2009

Compressibility of English Text

Filed under: Research | , , ,
― Vincent Gable on January 26, 2009

Theory:

Some early experiments by Shannon67 and Cover and King68 attempted to find a lower bound for the compressibility of English text by having human subjects make the predictions for a compression system. These authors concluded that English has an inherent entropy of around 1 bit per character, and that we are unlikely ever to be able to compress it at a better rate than this.

67 C. E. Shannon, “Prediction and entropy of printed English”, Bell Systems Technical J. 30 (1951) 55. (Here’s a bad PDF scan)

68 T.M. Cover and R. C. King, “A convergent gambling estimate of the entropy of English”, IEEE Trans. on Information Theory IT-24 (1978) 413-421

Signal Compression: Coding of Speech, Audio, Text, Image and Video
By N. Jayant

Shannon says 0.6-1.3 bits per character of English — 0.6 bits is the lowest value I have seen anyone claim.

Practice:

Just as a datapoint I tried gzip --best on plain-text file of The Adventures of Sherlock Holmes, weighing in at 105471 words, and using 578798 bytes. The compressed file was 220417 bytes.

If we assume the uncompressed version used one byte (8 bits) per character, then gzip --best used about 3 bits per character.

Best so Far

The state-of-the-art in the Hutter Prize, a challenge to compress 150 MB of Wikipedia content, is 1.319 bits per character. But that’s with a program tuned just for that data set, and it took 9 hours to run.

September 26, 2008

Simple English

Filed under: Quotes,Usability | , , , ,
― Vincent Gable on September 26, 2008

There are 400 million native English speakers, but over a billion people who speak English as a second language. … At any given instant on this planet, most people who are speaking English are not native speakers.

Perhaps we should take a good look at common forms of incorrect grammar and see if they actually make our language easier to learn. Maybe we should give a loose leash to those who are trying to make English more accessible.

I am going to try to use simple language and limited slang in my writing. When one considers the population of the world, it seems rather rude to address only the native English speakers.

Aaron Hillegass

Powered by WordPress