{"id":269,"date":"2009-03-02T14:30:56","date_gmt":"2009-03-02T19:30:56","guid":{"rendered":"http:\/\/vgable.com\/blog\/2009\/03\/02\/initial-findings-how-long-is-an-english-word\/"},"modified":"2009-03-02T14:30:59","modified_gmt":"2009-03-02T19:30:59","slug":"initial-findings-how-long-is-an-english-word","status":"publish","type":"post","link":"https:\/\/vgable.com\/blog\/2009\/03\/02\/initial-findings-how-long-is-an-english-word\/","title":{"rendered":"Initial Findings: How Long is an (English) Word?"},"content":{"rendered":"<blockquote><p>My brief research into the English language revealed <strong>the average character count of a word is eight<\/strong>. Throw together a bunch of a smaller and bigger words, some single spaces and punctuation and you roughly end up with the average 140-character tweet being somewhere between 14 and 20 words. Let&#8217;s call it 15.<\/p><\/blockquote>\n<p>&#8212;<a href=\"http:\/\/www.randsinrepose.com\/archives\/2009\/03\/02\/the_art_of_the_tweet.html\">Rands in Repose<\/a><\/p>\n<p>That contradicts the common wisdom I&#8217;ve heard: <strong>the average word is 5 letters, so divide your character count by 6 to get a word count<\/strong>.<\/p>\n<p>But that was a rule of thumb from the days of typewriters. Hypertext and formatting changes things.  For example, every time you see something in <strong>boldface<\/strong> on my blog, there are an extra 17 characters for the HTML code, <code>&lt;strong&gt;&lt;\/strong&gt;<\/code>, that makes the text bold.<\/p>\n<p>Just to poke at the problem, I used <code><a href=\"http:\/\/developer.apple.com\/DOCUMENTATION\/Darwin\/Reference\/ManPages\/man1\/wc.1.html\">wc<\/a><\/code> to find the number of characters per word in a few documents.  What I found supports the <strong>6 characters per word<\/strong> rule of thumb for content, but not for HTML code.  The number of characters per word in HTML was higher then 6, and varied greatly.<\/p>\n<p><a href=\"http:\/\/www.nytimes.com\/2009\/03\/03\/business\/worldbusiness\/03markets.html?_r=1&#038;hp=&#038;pagewanted=print\">The text of the front page article on today&#8217;s <em>New York Times<\/em><\/a> was 5880 characters, 960 words: <strong>6 characters per word<\/strong>.<\/p>\n<p>The <em>plain text<\/em> of <a href=\"http:\/\/www.randsinrepose.com\/archives\/2009\/03\/02\/the_art_of_the_tweet.html\">Rand&#8217;s webpage claiming 15 chars per word<\/a> was 6794 characters, 1175 words: <strong>6 words per character<\/strong>.  By plain text, I mean just the <em>words<\/em> of the HTML after it was rendered, so formatting, images, links, etc were ignored.  The HTML source for the page, however, was 15952 characters, meaning <strong>14 words per character<\/strong>.<\/p>\n<p>What about technical stuff? The best paper I read last year was <a href=\"http:\/\/cr.yp.to\/qmail\/qmailsec-20071101.pdf\"><cite>Some thoughts on security after ten years of qmail 1.0<\/cite> (PDF)<\/a>.  It has no pictures, just 9517 formatted words. A PDF represents it with 161496 bytes (<strong>17 bytes per word<\/strong>), but ignoring formatting it is 62567 characters (<strong>7 characters per word<\/strong>).<\/p>\n<p>I&#8217;m still looking into how long English words are in practice. Please share your research, if you have an opinion.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>My brief research into the English language revealed the average character count of a word is eight. Throw together a bunch of a smaller and bigger words, some single spaces and punctuation and you roughly end up with the average 140-character tweet being somewhere between 14 and 20 words. Let&#8217;s call it 15. &#8212;Rands in [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[187,188,91,359],"class_list":["post-269","post","type-post","status-publish","format-standard","hentry","category-research","tag-english","tag-linguistics","tag-twitter","tag-writing"],"_links":{"self":[{"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/posts\/269","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/comments?post=269"}],"version-history":[{"count":0,"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/posts\/269\/revisions"}],"wp:attachment":[{"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/media?parent=269"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/categories?post=269"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/vgable.com\/blog\/wp-json\/wp\/v2\/tags?post=269"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}