Rural Newsprint Vocabulary Size

Jan 04, 2023

I saw a recent tweet by Richard Dawkins repeat a silly anecdote that 19th century peasants had an average vocabulary of only 250 words. There are many ways to disprove this obvious falsehood.¹ One way would be to look at linguistic materials that a particular demographic used. Newspapers were a key source of information and communication in the 19th century: they included a wide variety of content – news, ads, entertainment, correspondence, domestic tips, etc. – made accessible to reach audiences across all classes (often in different papers). Even non-readers often heard papers read aloud in the household or community. By identifying the number of unique words printed in a newspaper during a set time window, we can estimate the size of the vocabulary familiar to its readers.

I picked two newspapers with substantial working class and rural readership in order to focus on the size of their vocabulary: the Belmont Chronicle of St. Clairsville, near the eastern border of Ohio, and the Perrysburg Journal, one county south of Toledo. Rowell’s Newspaper Directory indicates that both papers had a largely agricultural readership, while the Journal also had manufacturing. In the 1870s, the average number of individuals per household was slightly more than 5. The Journal had a circulation of 775 out of a town population of 1,835 and a county population of 24,596 – 1 in every 6.3 county households, an impressive figure considering that the county had at least 6 papers. The Chronicle had a circulation of 960 out of a town population of 1,056 and a county population of 39,714 – 1 in every 8.3 county households, where there were also at least 6 papers. With their target demographics and high market penetration, both papers clearly reached lower class rural readers.²

For starters, it’s worth considering what a vocabulary of 250 words would actually look like. In English, we use a small number of words with disproportionate frequency: basic verbs, articles, prepositions. As it turns out, these words add up pretty quickly. This first figure shows the 500 most frequent words in the Journal’s vocabulary and the number of occurrences for each in a one-year period. It’s hard to imagine getting on without “father.”

I grouped text data from these newspapers by issue to identify a familiar vocabulary. If this was done by month or year it would result in a larger vocabulary, but looking at issues will provide a more conservative estimate and reflect the fact that readers can often skip issues or read selectively. I filtered out all words that do not occur in a basic 25,000-word dictionary so as to remove Optical Character Recognition (OCR) errors, numbers, and proper nouns.³ I then also used a singularizing function so that plural forms of words would not count as distinct from their singular forms.

The average number of unique words per issue for both papers remained consistently in the 2,800 to 3,800 range. The sudden spikes up or down are the result of OCR errors. I purposefully picked two comparatively clean papers, but some level of scanning and preservation errors are inevitable. The Belmont Chronicle reached its lowest number of unique words between 1860 and 1865, during the Civil War – surely no accident given that many newspapers struggled to maintain staff and produce copy during these years. In 1890, the Perrysburg Journal doubled its length from four pages to eight pages but still had roughly the same number of unique words, which puts the trend of page count expansion in another light.

I was struck by how similar and consistent these two newspapers were, so I tried random samples of 3,940 issues of any newspaper from 1835 to 1860 and 3,912 issues from 1865 to 1890 (35-year periods with five years of separation for the Civil War, which was not kind to newsprint).⁴ The number of unique words per issue is highly normally distributed for the postbellum period, with a mean of 2,903 and a median of 2,900 words. The number of unique words in antebellum papers is a bit more skewed, with a mean of 2,816 and a median of 2,970, but still roughly normally distributed. Both are slightly thrown off by an irregular cluster of issues with fewer than 750 unique words: these are issues packed with OCR errors, which reduce the number of dictionary-recognizable words.⁵

A two-sample t-test corroborates that the true difference between the mean number of unique words differs slightly from the antebellum to the postbellum period: we can say with 95% confidence that the true difference in the mean lies between 16 and 96 fewer words in the later period.⁶ The first takeaway reaffirms a longstanding narrative that newsprint style grew more succinct and snappy over the 19th century in the effort to draw in ever more subscribers. Yet the second takeaway is that, while the true difference in means is non-zero, it is very small: newspaper vocabularies did not shrink by very much.

The difference between rural and urban papers is more pronounced. One way to test this without putting in much work (this is a blog post after all) is to define “rural” simply as a town population less than 2,000 and “urban” simply as a town population larger than 20,000, leaving out the large grey area between.⁷ Doing so gives us a rural average of 2,669 unique words per issue and an urban average of 3,278. A t-test reports with 95% confidence that the true difference in the average is between 545 and 674 fewer unique words per issue in rural newspapers than their urban counterparts, so defined.

All of these numbers are way (ten times!) more than that elusive “250 words.” But concessions can be made for the benefit of the doubt. Let’s imagine that the average newspaper reader/listener only cared about half of the newspaper and didn’t know any of the words in the other half; let’s also imagine that they knew half of the words printed in what they did read/hear, inferring or skipping over the rest. The first assumption is semi-reasonable, since people read selectively even though they probably weren’t utterly ignorant of every unique word in, say, the politics column they skipped every week. The second assumption is even more wild, since it would be excruciating to read/listen that way. Yet even dividing the rural average by half and half again – which I consider impossibly draconian – would still leave us with 667 unique words, not counting any plurals, proper nouns, or delightful words not “fit to print.”

Let me come back to Dawkins. On the basis of the invented 250-word vocabulary, he muses “does that figure of 250 make origin of language seem less mysterious?” I don’t really have a horse in the origin of language race. I do feel confident, however, that you can’t make a useful argument in this area on the basis of supposedly-simple people using supposedly-simple vocabulary.

Another way would be to consult actual 19th century lexicography, which became a popular science precisely on the basis of the discovery that the language use of peasant, rural, regional, and ethnic groups outside urban centers was quite rich. A famous example – this isn’t my specialty but it’s an exercise of shooting fish in a barrel – is John Jamieson’s influential 1818 Etymological Dictionary of the Scottish Language. ↩︎
“Peasant” is used in several different contexts, which leaves me open to quibbles; I’m treating it broadly as the rural lower classes. Its primary use refers to the large class under feudalism that farmed land that it did not own. As Europe capitalized and industrialized, the economic conditions and rights of these subsistence farmers became more varied, but tenantry predominated. The U.S. never had an era of feudalist production, and because Euro-American settlers seized land from Native tribes they had much higher rates of land ownership. Yet many American subsistence farmers were tenants, especially along advantageous railroad lines, and many more relied on loans and mortgages that put them in a position of dependency. Of course, the nations (and regions within them) did vary in public education, literacy, and the press. All this is to say that while my categorization is a bit broad it is a parallel; and besides, narrowing the definition too far in the 19th century ends up begging the question (e.g., “peasants are only the dumb ones”). ↩︎
This last point could be debated – proper nouns are part of an individual’s learned vocabulary after all – but I leave them out for the sake of making a more conservative estimate. ↩︎
My source for newspaper data as always is Chronicling America. ↩︎
The Chronicle and Journal have unusually high quality OCR, so this is part of why they tend to have more unique words than average (though still well within the curve). It’s also worth noting that any words hyphenated over a line break won’t count either, yet another conservative influence on my estimates. ↩︎
For the t-test, I excluded all issues with fewer than 750 unique words on the basis of OCR error. ↩︎
I really ought to come back and tweak a few things to make this tighter if I find the time. I didn’t correct for page length, which is a debatable point; the sampling process should exclude dailies to make a broader urban sample; and a more thorough definition of rural papers would be to take papers from counties that don’t have an urban center in them. ↩︎