MED Magazine - Issue 57 - April 2010

Feature
How many words do you need to know?
by Michael Rundell

Vocabulary lists have been popular since Michael West’s General Service List was published almost 60 years ago. Not surprisingly; it’s comforting for students (and their teachers) to have an idea of how much vocabulary they need to learn in order to perform well at a given level. In any language, there is a lot of peripheral vocabulary which most people don’t need to know (or at least, don’t need to spend time learning), so a carefully selected wordlist offers efficiency gains for learners with limited time. But what is the optimum size for such a list, and what criteria should we use to identify its contents?

Core and sublanguages

It’s useful to think of vocabulary as belonging either to the ‘core’ or to one of many ‘sublanguages’. Core vocabulary refers to words, meanings, and phrases which are common to – and necessary for – all forms of communication (from academic monograph to tweet). Surrounding this central core are numerous sublanguages. Whether you are a beekeeper, a neurosurgeon or a language teacher, there will be a range of vocabulary items specific to your field. To you and your colleagues, these will be frequent and familiar usages, to everyone else, they will be largely unknown. Think of this in the context of our own community; in the discourse of language teachers and linguists, there are ‘terms’ (deictic, colligation, fricative) and specialized meanings of common words (drill, aspect, mood, aspiration). A word like collocation is a high-frequency item for us, yet would probably be unknown to most English speakers.

Why is core language so important? It has been known since at least the 1930s that the vocabulary in a language is distributed unevenly. In simple terms, there are a small number of very frequent words, and a large number of very infrequent ones. For English, the consequence is that, in most non-technical texts, nearly 50% of all the words belong to the 100 most-frequent items (like go, from, out, the), while about 83% belong to the top 3000 in a frequency list (including words like break, search, clear and experience). Furthermore, the more common words tend to have multiple uses and appear in all sorts of recurrent combinations (phrasal verbs, compounds, collocations – ‘chunks’ of every type). The latter point fits with Sinclair’s well-known ‘idiom principle’: the notion that words (or at least, core words) are best seen not as autonomous bearers of meaning, but as participants in a range of semi-preconstructed phrases. Thus the project for identifying a core vocabulary rests on the hypothesis that a learner who ‘knows’ the core vocabulary of a language will be well-placed to understand, and produce, a wide range of mainstream texts. (In this context, ‘knowing’ means knowing core meanings, and core uses and combinations.)

What dictionaries do

What is the optimum size for a core vocabulary? Almost all the advanced learners’ dictionaries (ALDs) assemble for their users a set of core headwords, using typography to identify them. But there are striking disparities in the numbers. The ALDs of Oxford, Longman, Cobuild, and (most recently) Merriam-Webster all highlight around 3000 words which (to quote the Oxford blurb) ‘should receive priority in vocabulary study because of their importance and usefulness’; the Macmillan English Dictionary, uniquely, identifies a core vocabulary of around 7500 words (shown in red in the dictionary). The selection criteria are broadly the same for all (apart from the corpus-averse Merriam-Webster): some combination of frequency, range (occurrence in a wide range of text-types), and a more subjective notion of usefulness in a language-learning context.

But 3000 words seems too low a target for an advanced learner, arguably representing not so much the words users need to know, as the words they know already. Among words ranked between 3000th and 6000th most frequent, there is a huge amount of vocabulary that is pretty much essential to anyone working in an EAP or ESP context (as so many advanced learners are), such as:

abnormal, admiration, allegation, ambitious, ambiguous, arbitrary, bargain, bias, boom, bureaucracy, compromise, comparable, compatible, complexity, condemn, cooperate, corrupt

. . . and of course thousands more. To ensure comprehension of an unseen text, learners need to know a high percentage of the words in it; estimates vary between 95% and 98% (Schmitt 2008). However, the top 3000 words give a ‘coverage’ of only 84% at most, whereas the top 7500 words make up 92% or more. Armed, additionally, with a basic grasp of word formation rules, a learner with a core vocabulary of 7500 words will usually get pretty close to the ‘comprehension threshold’. Our contention at Macmillan is that 7500 words represents a more appropriate target vocabulary for an advanced learner.

Notes

Review article: Instructed second language vocabulary learning’, Language Teaching Research, N. Schmitt (2008) Vol. 12: 329–363

This article was first published in Cardiff Conference Selections, IATEFL 2009. We would like to thank IATEFL for permission to reprint this article.

Copyright © 2010 Macmillan Publishers Limited
This webzine is brought to you by Macmillan Education