Literature review
Nation (2001) classified words into four levels: high-frequency words, academic words, technical words, and low-frequency words. The classic high frequency word list is the General Service list of English Words (West, 1953). It is agreed that high-frequency words are important for all learners of English, because of their wide and consistent coverage of texts of most genres. Curriculum designers and graded readers have used the 2000 high frequency words as a basis for writing textbooks and set up learning goals.
Moving to higher learning needs, Coxhead (2000) developed the Academic Word List, which contains 570 word families from a corpus covering a wide range of academic topics and materials. The AWL covers around 10% of all the words in the academic corpus, and thus, is important for all learners studying English for academic purposes. For words found in specialised contexts, Chung and Nation (2003) used two different texts, one on anatomy and the other on applied linguistics, to estimate the size of technical vocabulary. They found that technical words account for a larger coverage of the language than the 5% suggested by Nation (2001). The low frequency words comprise the vast majority of vocabulary in English that does not fit into any of the previous categories. Nation (2001) suggested that these words are far too numerous for classroom learning, and that learners should use strategies to cope with these words.
This literature review has two aims. The first is to determine the goals and purposes of existing word lists. The goal and purpose directly influence the type of corpus that the words are chosen from, and the criteria that were used to select the target words. The second aim is to look at these criteria, and their implications for the present word list.
1.1 High frequency words
High frequency words refer to the words which are used very often in most uses of the language. Hwang and Nation (1995) used the criteria of frequency, coverage and range to look for high frequency words from LOB (Lancaster-Oslo-Bergen corpus) and Brown corpora, each containing 1,000,000 running words. They, then, compared the words chosen from the two corpora with the GSL, and provided evidence that 2,000 words was the reasonable cutting point for high frequency words. The cumulative coverage of this group of words in the corpora is 83.4%. This suggested that a general service vocabulary of around 2,000 words is appropriate for learners for general language use purposes, but that a more specialised vocabulary is needed for specific learning purposes (Nation & Hwang, 1995).
1.1.1 The General Service List of English Words
The classic word list for high frequency words is the General Service List of English words (West, 1953). West developed the GSL list based on the 1936 Interim Report on Vocabulary Selection, and incorporated the semantic counts. The GSL consists of around 2,000 word families, and covers around 80% of general texts (Nation, 2001). The GSL has considerable overlap with other high-frequency word lists built from other corpora (Nation & Hwang, 1995).
Because the purpose of developing the GSL is to select words for teaching of English to foreigners, the selection of words takes into consideration the learning aspects of language. The words in the list were selected from a corpus as large as 5,000,000 running words, based on written materials from various sources, such as encyclopedias, magazines, textbooks, novels, essays, biographies, etc (West, 1953, pp.xi). The word list consists of semantic counts of each meaning of a given word to estimate the frequency of usage of that meaning. The classification of meaning was based on the thirteen-volume The Oxford English Dictionary. One notable point about the GSL is that it outlined both the frequency of occurrence of each word and the percentage of occurrence of each meaning of the word. Through this, the list attempted to help teachers recognise the necessity of the learning different words and reduce the learning burden of multiple meanings of polysemous words. The semantic counts were based on several criteria in addition to frequency of occurrence. They include:
1. Ease of learning: Easy words are selected over the difficult ones.
2. Necessity: A low frequency word may be selected if it covers a certain range of necessary ideas.
3. Cover: A frequency word may not be selected if it has close resemblance in meaning to another word.
4. Stylistic level: Common words are selected over its literary or colloquial counterparts.
5. Intensive and emotional words: Words to express emotions were of secondary importance, thus, they were not selected. (West, 1995, pp. x)
These criteria accentuate the purpose of the word list; that is, to help learners at the early stage of learning English. Also, it is innovative, in the sense that it took into consideration the learning burden of polysemous words. The necessity criterion gave preference to words with multiple meanings that cover several necessary ideas. Recent studies on cognitive aspects of vocabulary acquisition suggest that the learning burden of a polysemous word could be largely reduced by introducing the mental link between the core and peripheral senses (see e.g. Verspoor & Lowie, 2003; Tyler & Evans, 2004). When learners encounter new words, it is easier for them to learn these words’ multiple meanings, rather than learning several words each with one meaning. This is because, in most cases, there are links between the core or conceptual sense, and the meanings.
However, the criteria also present some problems. Firstly, the stylistic level criterion means that learners keep following the “middle path”, using words with little or no stylistic preference. However, in real usage of English, people are more likely, for example, to use “fellow” instead of “person” to refer to someone in their conversations. The exclusion of colloquial words may make learners sound funny in real-use of the language. Secondly, intensive and emotional words were not included in the list. West used this criterion based on the idea that learners study the language in order to express ideas, not emotions. This may be true for learners at the survival level of learning; however, as they advanced, it is inevitable that they will need more subtle words to express their emotions and ideas. Despite these issues, the GSL is still a useful guide for learners at the beginners’ level. But for those who are more advanced, a word list with more vivid and precise words is required.
1.1.2 The BNC 2000
A more recent effort to develop a word list is the written and spoken word frequency list, based on British National corpus (Leech, Rayson & Wilson, 2001). There are two purposes of this word frequency list. Firstly, it aims to address the criticism of the older word lists, like the GSL. The GSL has been the subject of much criticism since the 1970s (e.g. Richard, 1974;Hwang, 1989). This is partially due to the age of the list, and in particular the age of the materials used to build the list (Leech, Rayson & Wilson, 2001). Most of the materials that the GSL used were collected in early 1990s. There is a need for a more updated word list which can reflect the current usages of the language. Secondly, the word list attempts to make use of the modern computer technology to develop a word frequency list for research on style, register and psychological processing of the language. These purposes lead to two major differences between the two word lists.
The first difference is the size of the corpus and the sources of language samples. The BNC word frequency list is based on the British National Corpus (hereafter the BNC). The BNC is twice the size of the corpus s used to develop the GSL, and uses samples from more varied sources. The corpus is said to be “a finite, balanced, sampled corpus” (Leech, Rayson & Wilson, 2001). The BNC corpus consists of materials from both written and spoken sources. The written component includes both informative texts and imaginative texts. The informative texts make up 80% of the written texts, and are classified into 8 domains: world affairs, social science, leisure, applied science, arts, commerce, natural science, belief and thought. The sample taken into the corpora attempts to reflect the book published in these fields. On the other hand, the imaginative texts were sorted into different genres of poetry, prose and drama.
The spoken sources of the corpus consist of conversational data and task-oriented data. For the conversational data, over 1,000 speakers were asked to record their daily conversational interactions with other interlocutors. In contrast, for the task-oriented data, it recorded more formal activities of daily life, for example, lectures, consultations, sermons, broadcasting, etc. The majority of texts dated from 1985 to 1994, with spoken data dating no earlier than 1991. Together, the written and spoken sources make up the 100,000,000 running words corpus, which could be regarded as the most comprehensive representation of the present-day usage of English.
Another difference between the BNC word frequency list and the GSL is the criteria used to develop the list. The BNC word frequency list uses frequency, coverage and dispersion to reflect the usage of words in present-day English. Leech et al. (2001) sorted the 100 million words corpora into 100 one million word sub-corpora, with similar types of texts in each one. All the words in their list have been lemmatized. A lemma includes the base form of a word and its inflections. For example, the words run, ran, runs, running, belong to one lemma, run. Each lemma and its inflections in the list have been accompanied by three figures. These are for frequency, which shows how many times the lemma and its inflections have occurred in the whole corpora; coverage, how many sub-corpora the lemma and its inflections appear in; and, dispersion, how evenly the lemma and its inflections have been used across all the sub-corpora.
Using frequency, coverage and dispersion of the headwords, Nation (2004) identified the 3000 most frequent words in the BNC. One difference between Leech et al.’s word frequency list and Nation’s word lists is the unit of counting for an entry. Instead of using lemma, Nation (2004) used word families. A word family is a word, plus related inflections and derivations. For example, access, accesses, accessible,
inaccessible, and accessibility are all included in the word family under the headword access. Nation took over 6,500 headwords with a frequency equal to, or higher than, 10,000 in the whole 100 million token corpus. This step ensures that the words chosen are of the highest frequency in the whole corpus. He then used the range criterion to remove those words which appeared in less than 98 of the 100 sub-corpora. This ensures that the words chosen are widely used in all types of texts. Finally, dispersion was used to further remove those words which are lower than the threshold. This step ensures that all the lemmas in the list are evenly distributed in the 100 one million sub-corpora. Nation also developed the second and third 1000 words in the same way with the remaining 6500 headwords.
1.1.3 Conclusion
This literature review has examined previous studies on high frequency words, with a specific focus on the word lists developed for the learning needs of non-native speakers of English. Special attention was also given to the size and sources of the corpus, and the criteria used to select potential words for the word lists. The two word lists examined, the General Service List (West, 1953) and the BNC 2000 (Nation, 2004), have been developed for somewhat different purposes, which has some implications for the present study.
Firstly, both lists use word families as the unit of counting words, while the target audiences of the two lists are all learners of English. Therefore, the unit of counting words needs to take into consideration how learners organise words in their mental lexicon. There is research evidence showing that learners psychologically group members of a word family together (Nagy et al., 1989; Schmitt & Zimmerman, 2002). Knowing a word in the word family can also greatly assist the learning of other words in the family. In light of this, the present study will also use word families (further definitions in the methodology section).
Secondly, the GSL used stylistic level as a criterion to select potential words. Stylistic level is important because it helps distinguish the type of words selected for the list. GSL preferentially includes words that could be used in common settings to avoid any clear stylistic preferences, because common words are versatile and would ease the learning burden of beginning L2 learners. The present study attempts to identify words that are used in more casual settings for advanced learners, and therefore, will use stylistic level as a criterion to select potential words.
Finally, both GSL and BNC 2000 are reliable word lists for high frequency words. They are used as a criterion in the present study to exclude any potential words present in these lists, as words in these two lists are used for general purposes of English and should be already known to all advanced learners of English.
1.2 Technical vocabulary
When beginner learners of English have mastered the first 2000 high frequency words, they need vocabulary for more specialised purposes. An example is academic research in an English context where a specific subject area requires a more specialised range of vocabulary. Therefore, learners face the need to know a more specialized range of vocabulary. Nation (2001) introduced two types of specialised vocabulary: technical words and academic words. In this section, the size and types of words in the technical vocabulary are covered. The methods used to create this group of words, and the methodology and purposes of the Academic Word List (Coxhead, 2000) are discussed in the next section.
Technical vocabulary includes subject related words that are important for learners of that particular subject. There is much dispute on what type of words could be defined as technical words. Moreover, categorising technical vocabulary requires subject related knowledge to distinguish words that carry special meanings in that subject. Despite this, there is a wide range of technical vocabulary, and it is indispensible for relevant students (Chung & Nation, 2003).
The coverage of technical words in a text varies by subjects. Chung and Nation (2003) compared two texts on different subjects, anatomy and applied linguistics. The coverage of technical vocabulary in the anatomy text was twice that in the applied linguistic text (37.6% and 16.3% respectively). This report used a four-level rating scale to identify technical words. These four scales distinguished two types of technical words: those used only in the specific type of text, and those also used in general texts. Step one of the rating scale includes function words that are not related to the subject area, while step two includes words that are minimally related to the subject area. Step three includes words closely related to the subject area, and step four includes words unlikely to be known in general use of the language. This scale was later examined in a studying comparing the reliability of four methods used to identify technical words (Chung & Nation, 2004). These four methods were the rating scale, using a technical dictionary, using clues provided in the text, and using a computer-based approach with term extraction software. Rating scale was found to have the highest reliability in identifying technical terms.
1.3 The Academic Word List
The most widely used academic word list in the past decade is the Academic Word List (AWL) developed by Coxhead (1998, 2000). The AWL consists of 570 word families. The list is not simply made up of past word lists, as University Word List was (Xue and Nation, 1984). Rather, it is built from a large, specially complied academic corpus, with 3,513,330 running words that cover written academic texts from four academic disciplines and 28 subject areas.
The AWL was developed for two reasons. Firstly, Coxhead (1998) reviewed past research on academic word lists, and found that they were developed from small corpora of scientific English. This was not large enough to be representative of academic use of the language. Further, those word lists only used a limited range of criteria to select words. Secondly, the AWL is designed to help learners of English cope with tertiary study in an English context. These dual purposes of the AWL are reflected in the size of the corpus, and the criteria for word selection.
1.3.1 Academic Corpus
The AWL is based on the Academic Corpus of Written English (hereafter Academic corpus), which was specifically compiled for the research. The corpus was based on three criteria: size, range and representation (Coxhead, 1998). With a total of 3,513,330 words, the corpus is almost ten times the size of previous academic corpora (Campian & Elley, 1971; Praniskas, 1972). This size ensures that the corpus includes sufficient samples of academic texts. The Corpus incorporates four major academic disciplines of similar size: the arts, commerce, law, and science. Within each of the four disciplines, there are seven subject areas. Thus, in total, there are 28 subject areas. The corpus also includes a wide range of academic texts of different genres to assimilate authentic academic uses of the language. The corpus collected samples from textbooks, as other academic corpora did, and also from other academic readings
that the learners are likely to encounter in their studies (e.g. laboratory manuals, book chapters from set readings, work books, lecture notes, journal articles and technical reports). This ample variety of academic texts ensures that the Academic Corpus achieves a high representation of words encountered in academic study.
1.3.2 Four criteria: specialised occurrence, frequency, range, uniformity
The AWL used four criteria to choose words: specialised occurrence, range, frequency, and uniformity. The first criterion, words included in the GSL are excluded, ensures that the AWL is for the special purposes of academic study, instead of being a general word list which is needed by all learners of English. Additionally, it can be assumed that all learners studying at tertiary level in an English speaking country should already have good receptive knowledge of the GSL words. The second criterion is that words included in the AWL should appear in all four disciplines and also in 15 of the 28 subject areas. This ensures that the words are useful to academic study in most of the subject areas and disciplines.
The third criterion is that the words included in the AWL occur at least 100 times in the whole corpus, in other words, once in every 35,000 running words. This cut-off point is for word families rather than lemmas. The AWL gives special concern to word families with only one member, like forthcoming, because the frequency of these words could be underestimated compared to other word families with many members. As such, the cut-off point for word families with only a single word is set at 80. A notable point is that frequency is secondary to range, allowing for the bias that may be caused by relying too much on frequency. This consideration enhances the validity of the AWL in that frequency is heavily context-dependent. A word that appears frequently in one subject may not be useful to another. For example, collage may be frequently used in a text talking about arts; however, it is hard to expect the same word with high frequency in an economic text.
The fourth and final criterion is that the words included in the AWL should appear at least ten times in any one of the sub-corpora. This resembles the dispersion criterion in BNC 2000, which also aims to find the words that are used widely across the corpus. This is to ensure that the words chosen are useful to all learners studying for academic purpose regardless of their specific subject areas.
1.3.3 Conclusion
The Academic Word List is important for learners studying at a tertiary level in most subject areas, and especially those looking for best return of their efforts in learning words for academic purposes. This section examined the corpora and the criteria used to select the potential words for the list. The development of the Academic Word List has several implications for the present study.
The first implication is that the size of the Academic Corpus ensures that the language samples collected could reflect the real life use of the language. The size of the corpus is important, in that it should be large enough to include substantial language samples consisting of ample frequently used words. The Academic Corpus is almost ten times the size of the corpora used to build other previous academic word lists. The present study uses dictionaries compiled from the two largest corpora to ensure that the words selected will be representative of real life use of English.
Secondly, the AWL used specialised occurrence to remove those words that are part of general language usage. The AWL excludes any words that appear in the GSL to ensure that all the words on the list are used for academic purposes. Hwang and
Nation (1995) suggest that learners with special language needs should study the vocabulary beyond the most frequent 2,000 words. The present study removes all the words that appear in the AWL to ensure that the words selected are not used specifically for academic purposes, but for communication in daily life scenarios.
1.4 Low frequency words
This group of words is by far the largest group of English words, and contains many thousands of words. Until now, this category has been understudied. Practically speaking, as English teaching time is limited, it is impossible to go through the low frequency words one by one. It is therefore advisable that learners be equipped with learning strategies to handle this enormous group of words (Nation, 2001). However, given the size of this category, and the needs of advanced learners, there should be studies into this level of vocabulary. McCarthy (2001) calls for study of the “long-tail vocabulary”, which refers to all the infrequent words, stating that this group of vocabulary is too massive in size to be left for learners to handle on their own.
1.4.1 The size of low frequency words
The definite size of low frequency words has not been clearly stated, because this size depends on many aspects. One of the aspects is the definition of a word itself (Nation, 2001). Defining a word in lemma or in word family would no doubt affect the size of the vocabulary. For example, using the lemma definition, enjoyment would be a different word from enjoy, because they belong to different word classes. However, using the word family definition, enjoy and enjoyment are the same word. Further, the number of low frequency words may depend on the group of people it refers to. Low frequency words for native speakers and non-native speakers are likely to be dramatically different due to the difference in the vocabulary size of the two groups. Nagy and Anderson (1984) estimated that there were 88,500 word families based on the American Heritage Word Frequency Book compiled by Carroll et al. (1971). This number is what the native speakers might be exposed to in the school materials. It is estimated that five-year-old children may have a vocabulary size of 4,000 to 5,000 word families at the start of the schooling. The average vocabulary size of a university graduate is around 20,000 word families (Waring & Nation, 1997). However, non-native speakers studying in EFL context would be highly unlikely to gain such an enormous vocabulary. The concept of low frequency words here is for non-native speakers learning English as a foreign language. In SLA, it is now agreed that low frequency words refer to those words which are not in the most high frequency 2,000 words (Nation, 2001).
1.4.2 Type of words at the low frequency level
The category of low frequency words include many different kinds of words(Nation, 2001); for example, middle frequency words (not frequent enough to be listed as the first 2000 most frequent words), technical words for different fields of interests, proper nouns and words that are genuinely low frequency and seldom used. There are two things to note about the words in this category. First, the compiling of any frequency list is based on the language samples in the given corpus, and therefore, the frequency varies from corpus to corpus. If the language samples collected for the corpus are from spoken sources, words like er, yeah, and got, would be among the most frequency words used. These words would be much less frequent
in written corpus (Leech, et al., 2001). A word not included in the high frequency list of one corpus would not mean that its significance for the learners should be underestimated. On the other hand, as the difference in vocabulary size between native speakers and language learners is huge, words infrequency used by language learners may be frequently used by native speakers, and vice versa. If the purpose of learning the language is to use the language like a native speaker, then, it is important to be cautious about the boundary between the high and low frequency words.
1.4.3 Existing study into low frequency words
The role of low frequency words have been played down pedagogically; as such, past research produced little findings compared to that on other word categories. Also, the sheer size and diversity of the low frequency words makes studying this group a daunting and demanding task.
Until now, the mostly used low frequency word list in the relevant research is the BNC-20. This list is based on the British National Corpus, and made up of twenty one-thousand words, from the highest to the lowest frequency. The word list was developed in a similar way to BNC 2000 by Nation (2004). As such, frequency, range and dispersion were used as the criteria to choose words for each frequency level. This word list has been used in developing vocabulary size tests to make learning plans for students, and as a research tool for studies.
1.4.4 Conclusion
The number of low frequency words is too big to be learnt in any program. However, this does not mean that these words are not necessary for learners. Advanced learners, who have mastered the most frequent 2,000 words and academic words, need to study low frequency words to expand their vocabulary size and improve their proficiency in language learning. However, the present studies into low frequency words do not yield many findings to help learners learn such words efficiently. Without a valid word list grouping words for special needs together, learners must pick up such words in incidental learning, which is ineffective. Current low frequency word lists, like BNC 20, have their drawbacks. The size of the word list is much bigger than the high frequency words lists or the AWL to be realistically learnt. For a certain group of learners, some words are more useful for them than others. For example, for a learner entering a bakery, the high frequency word bread is clearly not specific enough. Words, like éclair and ciabatta, are probably more useful in this case. However, these two words are dispersed in the frequency list, and are therefore unlikely to be learned together if the learning materials are developed only based only on frequency level of words.
1.5 Implication for the present study
This literature review evaluates research on word lists developed for three levels of vocabulary. Both basic and academic needs of learners have been analyzed, with word lists developed accordingly. For beginner learners, the GSL and BNC 2000 are great guides to master essential words for general English use. For learners with academic needs, the AWL is an excellent source of words which will be useful in a wide range of academic fields and topics. There is, however, a need for a new word list, catering to the needs of advanced learners, who require additional words for daily use in English-speaking contexts.
1.5.1 Survival Learning Syllabus
Previous research has focused on the lexical items necessary for communicative needs of foreign visitors to the English-speaking countries. The survival learning syllabus (Crabbe & Nation, 1991) was developed for visitors with little or no knowledge of English, who need only the most basic words and expressions to help them accomplish basic conversations during short term visits. The focus of this syllabus was on survival, travel, and social needs (Crabbe & Nation, 1991:192). The 120 expressions are classified into eight sections: greeting, bargaining, reading signs, getting to places, finding accommodation, ordering food, talking about yourself and to children, and controlling and learning language. However useful for survival needs, the items on the list would not be helpful for engaging and maintaining learners in conversations with locals. As such, a new word list is needed to satisfy the higher social and communicative needs of advanced learners.
1.5.2 The need for a practical word list
The difference between the size and composition of L1 and L2 learners’ lexicon is largely due to the quality and quantity of input. The major source of input for L2 learners is written materials in textbooks, many of which are abridged to suit the proficiency levels of the learners. This results in a blank area in the learners’ lexicon, i.e. the daily life words. Daily life words are used in everyday communication, and cover words used in various social settings, such as conversations between doctors and patients, when grocery shopping, in television programs and movies, at leisure events, and in talks between friends. Without these words, the ease and pleasure of natural language use will be compromised. Learners who do not know these words will have difficulty with both language input and output. They may only understand partial information in messages, or constantly use discourse strategies, like confirmation checks and clarification requests which undermine the fluency and naturalness of communication. For output, learners may have to resort to communicative strategies like paraphrasing which compromise the clarity of messages.
Surprisingly, there is no existing research in the SLA field that documents these words. Advanced learners of English already have a vocabulary greater than the 2000 most frequent words level, and need practical words that can help them achieve ease and pleasure in communication, while avoiding frustration and embarrassment. These words, though highly practical, may not be frequently used (e.g., snot), and may therefore not be easily picked up in natural use situations. These could be words that have been acquired by native speakers in childhood or through schooling, but have become less often used in adulthood. They could also be words from specialised domains of use that have entered into daily use, but have remained low frequency (e.g., sty). Singling out these words would have tremendous practical implications for advanced learners, and users of English living in a native speaker environment. 1.6 Research Question
The main task of this study is to develop and validate a word list for practical use in daily life by advanced learners/users of English. Advanced learners are operationally defined in this study as those learners from various L1 backgrounds, who are currently enrolled in an MA or PhD program in New Zealand, Australian, or British
universities. The major research question is:
Which lexical items are perceived as practical in daily life and therefore known to all native speakers and yet unknown to most advanced non-native speakers living in an English-speaking context?
因篇幅问题不能全部显示,请点此查看更多更全内容