Wordle with Teeth: U of Quebec’s Vocab Profiler
Contents:
1) Dry but necessary (and interesting, really) background to computer-assisted “corpus linguistics”;
2) Application of Vocab Profiler to a little “Scribe 2.0″ medieval satire I had fun writing way back when;
3) Tutorial on some uses of Vocab Profiler to aid in scaffolding classroom reading comprehension;
4) Caveats and Take-Aways
Dan Meyer just posted a refreshing little jeremiad against Wordle and Animoto, with which I largely agree - they’re fun, sure, but require all the mental and creative activity of slipping some coins into a Coke machine and then pulling the can from the tray.
Anyway, that post spurred me to share an old tool I’ve used since teaching ESL in Shanghai a few years ago. It’s one of a muscular suite of tools made freely available at the Université du Québec à Montréal’s “Compleat Lexical Tutor” called VocabProfiler.
To appreciate what the VP does, it helps to have a bit of background in corpus linguistics, the computer-assisted analysis of word frequency in the English language.
OVERVIEW OF CORPUS LINGUISTICS FOR VOCABULARY TEACHING:
By scanning millions of words from a broad representative sample of English texts - newspapers, textbooks, novels, and more - and then counting the frequency of both words and word families (like Latinate words and their derivations), linguists could identify which words in the English language are necessary to know in order to comprehend an average adult text.
ESL research posits that for a text to be suitable for a reader (at the reader’s “instructional level”), that reader should know 95% of the words in the text. In other words, if the reader is unfamiliar with more than 1 out of 20 words in the text on average, comprehension breaks down, and the reader is cursed with a text at “frustration level.”
Corpus linguistics helps us figure out which words and word families are necessary to know to reach that golden 95% of lexical knowledge that enables reading comprehension. What it shows is that learning the 1,000 most frequent words and word families in the English language allows a reader to comprehend around 74% of words in an academic text (and the percentage is higher for newspapers, fiction, and conversational English).
While 74% lexical comprehension is a powerful payoff for learning those first 1,000 words (the “1k list”), you’ll note that it still leaves us 21% short of the magic 95%.
Learning the second most frequent 1,000 words and word families, though (the “2k list”), covers an average of 5% more (in academic texts). So knowing the the first 2,000 most frequent word families in the language brings us up to 79% lexical familiarity with academic texts. That’s powerful and efficient vocab work, but it still leaves us 16% short, still doomed to frustration-level reading - “barking at text” without comprehending the meaning.
And here’s where things get interesting. Logically, you’d think learning the 2,001st-3,000th most frequent words would be the sensible thing to do after learning the first 2k, but that’s not true. You only gain a percentage point or two. (Think of the Long Tail applied to English vocabulary: we all use the top 2k, but after that, the hundreds of thousands of other words in the language flatten out.) Instead, your next big vocab-learning investment should be in a group of 570 word families that, once learned, add another whopping 8.5% of lexical comprehension to academic texts. This group is called the Academic Word List (AWL).
The AWL is that group of words that show up frequently across the entire range of academic disciplines - words like this list from a text of my own I just crunched: “accurate, achieve, conclude, conduct, create, deny, error, final, focus, ignorant, instance, labour, normal, plus, task, tradition.” [Update: An excellent set of online quizzes for all 570 of these AWL word families, arranged in sublists from easier to harder, can be found at this University of Victoria site.]
So let’s recap the math:
Lexical Coverage of Average Academic Text from Learning the 2,000 Most Frequent Word families + the 570 Academic Word List:
1k list ≈ 73.5%
+2k list ≈ 4.6%
+AWL ≈ 8.5%
total comp: 86.6%
(Here’s a nice graphic (source) showing the coverage of those word/word families for conversational English, fiction, and newspapers, respectively:
Before moving on to the tool I promise to show you, let’s deal with the question: How do we know what words to teach to bridge the gap from 87% to 95%?
The short answer: we don’t (or I don’t, anyway). One helpful approach is to focus on the specialized academic language specific to any discipline. Words like metaphor, simile, paradox, hyberbole, and so forth, for example, are not on the AWL because they’re specialized literary vocabulary, so if you’re a literature teacher, you know the specialized literary words that are high-frequency to your content area. Teachers in math, science, social studies, art, music, and so forth similarly know the high-frequency vocabulary of their disciplines. Those specialized subject-area lists are an obvious first step to bridging that gap. BUT HERE’S THE BEAUTY (cue segue music): you can use the vocabulary profiler to identify any words not in the 2k and AWL that are in your course readings, and pre-teach them to enable 95% comprehension of your specific class texts. (You can use it for much more, but one thing at a time.)
So let’s move on to the Vocabulary Profiler.
Let’s pretend that you are assigning my post “Adventures of Scribe 2.0” - a bit of satiric fiction I abandoned after that post in the second month of this space, when only Diane Cordell, Patrick Higgins, and Christopher Watson were reading me
.
Here’s the text:
Monk Expelled for Creating “Devil’s Workshop”
29 December, Anno Domini 1527
Wittenberg, Saxony
The Catholic PressVisionius Neocogitus, a 21-year-old neophyte in the hallowed Benedictine Monastery in Wittenberg, was expelled from the Order yesterday for disobeying the Abbot, dishonoring the time-honored traditions of the ancient Order, and “making pacts with the Devil.”
The young neophyte was charged by his Abbot, Father Orthodoxius Paleologus, with shirking his sacred duties in the scriptorium, malingering, and spreading heretical ideas.
“Young Neocogitus is not suited to holy work,” said Paleologus. “From the moment he entered the Brotherhood, he was a force of discord and disobedience. Not an hour passed without Neocogitus doing something to disrupt the solemn traditions of the Order. After much soul-searching, fasting, and praying over the problems caused by this wayward youth, the Holy Spirit finally spoke to me, and said, ‘For the good of the Order, Neocogitus must go’.”
A Promising Beginning
The trouble began on the first day the young neophyte was brought for training in the scriptorium, the vast chamber in which monks of the Order have been hand-copying the Holy Writ for the last 600 years.
According to Friar Heironymous Tuck, the monk charged with training young Neocogitus in the science of holy transcription, the neophyte was arrogant, sarcastic, and insubordinate within a minute of entering the hall.
“Neocogitus beheld the glorious sight of these dozens of God’s servants, backs bowed and heads down in pious, meditative labor, dutifully performing God’s work,” said Tuck. “And he had the audacity to snicker. I smelt a whiff of sulfur, and knew we had a heretic in our ranks.”
But Neocogitus mastered his spleen, said Tuck, and went on to prove himself an unusually able apprentice.
“There’s no denying the young man was uncommonly quick to learn,” Tuck continued. “All of the finer points of the book-copying arts–copying in neat script, in straight lines, spelling correctly, and above all, staying awake and alert–Neocogitus mastered within a day.”
Indeed, by the end of his first eight-hour duty, the young man had produced a flawless reproduction of the Book of Genesis–a task that took three times as long for far more experienced scribes.
More remarkable still, by the end of his first month as an apprentice scribe, Neocogitus achieved what had never been done before: he had produced an entire copy of the Bible–all 66 books, plus the Apocrypha.
“It was a miracle,” said Tuck. “It had never been done before. And it was perfect, flawless: I personally checked each and every line for errors–and there were none.”
Abbe Paleologus heard the news with joy. “It seemed a sign from heaven,” Paleologus said. “There were so many heathens living in the darkness, helpless to see the light without God’s word. Yet, because there were so many more heathens than there were monks to copy the Holy Writ–and because it normally took six months for a scribe to produce one correct copy–it seemed we would never be able to rescue all the heathens from their ignorance.”
“But this young monk, Neocogitus,” the Abbot continued, “seemed sent to improve our chances. I had heard that his conduct was often troublesome, irreverent, and lacking in humility. But at the time, I thought this was one more instance of God’s mysterious ways. This whelp would help us spread God’s word with godspeed.”
“Little did I know,” he concluded, “that this was not God’s work at all–but the Devil’s.”
The Devil’s Work
Neocogitus’ first Bible aroused curiosity throughout the monastery. The Order was abuzz with talk about its inerrancy, legibility, elegant script. Above all, however, the talk focused on this question: how had the young man produced it so fast? Was it really possible to produce an accurate copy of the entirety of the scripture in one short month?
To get to the bottom of this mystery, Father Paleologus summoned Neocogitus to his chambers for a private interview.
[WHAT is the secret to Neocogitus’ miraculous powers? LEARN THIS, and more, in the NEXT EPISODE! ON SALE AT BLOGSTANDS SOON!!!]
–There’s a lot of specialized vocabulary here from Medieval history, religious studies, and more, so you know your students won’t be familiar with much of it. Solution: crunch it through the Vocabulary Profiler. After copy/paste/submit, this is what you get on top of the screen: an overall breakdown of the percentages of words from the lexical bands relative to the entire text, as you see here (click this and all further images for larger view):
How is this helpful? Several ways, primary among which is the simple breakdown it gives you of the percentage of words not in the high frequency bands. If that percentage is high, then maybe it’s simply not a text suitable for the readability level of your class’s age group. This is a common problem in high school, where content teachers untrained in literacy research go overboard by assigning college-level texts to students incapable of comprehending their lexical (and syntactic, but that’s a different beast) complexity.
Scrolling down that same screen, the next thing you see is a color-coded breakdown of where each word in the text falls in the word frequency range from corpus linguistics. (Blue = 1k, Green = 2k, Yellow = AWL, Red = Lower-frequency “off-list” words) (click image for larger view):
While it ain’t as pretty as Wordle, it’s much more useful for teaching (and learning, as I’ll show later). You can simply have students scan the red words themselves to look them up before reading, or you can prep the pre-teaching vocab lesson yourself.
“But wait a minute,” you say. “What if it’s a long text - say, a 20 page chapter from a novel?” Good question, and VP helps you with the following breakdown as you scroll down the screen (click image for larger view):
–what you see in this “type list” (”types” are the individual words in the text) is an alphabetized list of each word, grouped in the four frequency bands, and followed by the number of appearances each word makes in the text. This is key . In the above example, we can see that the words neophyte, monk, heretic, and abbot in the red “off-list” (low-frequency) range show up several times in this short text, and thus decide to give them more emphasis in pre-teaching the text’s vocabulary. (If this were a longer text, this frequency count of off-list words would serve the same purpose.) You’ll also note it gives you a handy list of the general purpose academic vocabulary from the AWL that will benefit all students across the curriculum.
TAKE-AWAYS, CAVEATS, and REQUEST FOR CAVILS:
What are some other take-aways from this little tutorial?
Students can use this tool to analyze (learn about) their own lexical sophistication. Have them drop their latest essay or story into the profiler, and let them see how many of their word choices are sophisticated enough to fall outside of the top 2000 words. The colors won’t lie.
Teachers (and students) can create vocab quizzes on Quizlet (better than Mystudiyo, I’ve decided - check it out) or elsewhere using this to help them identify which words to prioritize in vocabulary study. If you don’t know any words in the blue list, by George you should. Ditto the green and yellow. Beyond that, we’re talking less frequent, thus less crucial words to know by heart.
Corpus linguistics ain’t perfect. Some words - homonyms, for example, but also words with varying meanings depending on context - skew the results in the analysis. But that’s life for now.
Syntax is still important. Knowing vocab isn’t enough. Sentence structure and grammatical functions need attention too (duh - but you’d be amazed how many people think language learning is all about vocab).
Fancy SAT words are well-and-all, but many students don’t know many words in the 2k + AWL, and they’re more important. Teachers can help them fill the gaps in their knowledge of these words with the VP. “Utlilize” is less important (and to my demotic tastes, less elegant) than “use.” English teachers take note.
This tool is especially helpful for online texts. Copy-paste novel chapters, textbook chapters, etc into the profiler, and your pre-reading activities are instantly enhanced and guided by literacy research.
The Compleat Lexical Tutor is great for making cloze exercises and all sorts of other things.
This post is way too long. Sorry. But it should be a spell more useful than Wordle.
What about you - anything to add?
References:
Nation, P. (2001). Learning vocabulary in another language. New York: Cambridge University Press.
14 Comments
-
At July 15, 2008, Arthus Erea wrote:
Wow! Thanks for sharing this... it looks incredibly useful and far more statistically justifiable than Wordle. There's no point in having a pretty visualization if the underlying statistics aren't useful.
Also, I'm with you there on Quizlet: I've been using it for a couple of years for all of my low-level-subject quiz testing. However, the one thing it doesn't support is multiple choice questions (which Mystudiyo does).
Finally, just for kicks I plugged in a few blog posts into the profiler (both by me and others): over 15-20% of words were on the Off-list, but this is skewed by the pervasiveness of buzzwords (Twitter, digital footprint) which the average person doesn't know, but anyone worth their hosting in the Edublogosphere does.
Arthus Ereas last blog post..The 140 Character Lesson
-
At July 15, 2008, Clay Burell wrote:
Arthus,
Glad you see the value. And good point about buzzwords and proper nouns. You can exclude them by entering them in the "exclude" box on the site before submitting to get more valid results, and for purposes of readability level evaluation, maybe should.
Or you could just count the number of buzzwords, subtract them from the breakdown, and mentally calculate the new ratio of offlist words, I guess. I'd have to ask somebody who can add, subtract, and figure percentages to do that for me, though.
The really cool thing about Quizlet, as I'm sure you know, is that it was created by a high schooler, not a British Ed. D. candidate. ;-)
-
At July 15, 2008, Arthus Erea wrote:
Clay,
Yup, I've actually talked with him on quite a few occasions. He might possibly be the only high schooler on the 'net who is smarter than me. :P (just kidding: we all know that is Lindsea)
Arthus Ereas last blog post..The 140 Character Lesson
-
At July 15, 2008, Tod Baker wrote:
For English language learners and native speakers, this looks like a tool that can help teachers differentiate effectively. I'll look into it further. Thanks.
Tod
Tod Bakers last blog post..Layout Frustrations
-
At July 16, 2008, Robb McCollum wrote:
Thanks for the explanation and link (my wife passed on your blog from a twitter feed she got). I've been trying to use corpus research more in my teaching including Mark Davies corpus.byu.edu
However, it's nice to have a fast and easy-to-use tool so that graduate ESL students can start taking responsibility for their own vocabulary development. WordCruncher is powerful, but web-accessible and perhaps too complex for casual users. Besides, the connection to the AWL and other word lists is very helpful. Great tool!
-
At July 16, 2008, Corrie Bergeron wrote:
Excellent! Sending it along to my English and medical terminology faculty (big nursing/allied health programs here), as well as the manager of the ESL tutors. Good stuff!
-
At July 16, 2008, diane wrote:
A little gift from a long-time admirer ;-)
http://animoto.com/play/YrgMD4CVkDohywg3QPhV9g?autostart=false
dianes last blog post..Where in the World? Part 2
-
At July 16, 2008, Nate Stearns wrote:
Interesting. I plopped in Act I, Scenes 1-2 of Macbeth and got a cool little red list (alarum_[1] anon_[1] assault_[1] attendants_[1] bade_, etc.) of vocab words to preteach. Vocabulary has never been my bag, baby, as I was traumatized by stupid lists as a child. Still, I do like the idea of using the Vocab profile to have kids check out vocab list beforehand. Many of the major works we teach are often online in full text form and--theoretically-you could have kids drop in the chapter, do a post on the vocab, do the reading, and then revisit the vocab...Doesn't sound like a total blast, but worth trying.
Nate Stearnss last blog post..Edu-flash Mobs, why not?
-
At July 16, 2008, OLDaily ~ by Stephen Downes wrote:
[...] to develop quizzes, among many possible uses. -HJ Clay Burell, Beyond School, July 15, 2008 [Link] [Tags: none] [...]
-
At July 16, 2008, Clay Burell wrote:
Nate,
Shakespeare is so close to a foreign language in comparison with contemporary English, it's no wonder you're going to have a massive list of offlist (red) words.
Couple that with the research on readability, and you have the main factor in kids not loving Shakespeare like we do: far less than 95% lexical familiarity.
I _always_ pre-teach words like "anon, e'er, soft, aye," and a million other high-frequency Elizabethan words - let's not forget thee, thou, thine, the royal we and our, on and on - before even starting any play. I put them on a word wall and revisit them regularly.
But think about what would happen if you input the ENTIRE play into the profiler: you'd get a list of the most frequent words in the play, and that should be a valuable guide in helping you identify which words in that sea deserve most emphasis.
Gotta run.
-
At July 16, 2008, Biblical Studies and Technological Tools: Vocabulary Profiling wrote:
[...] To understand what this tool does, it helps to understand a bit of corpus linguistics, but this blog posting will give you a quick background. I've talked about this kind of stuff before with reference to [...]
-
At July 19, 2008, A World History Book for All Ages and Reading Levels: Gombrich’s “A Little History of the World” | Beyond School wrote:
[...] Like many English language learners, they’re wonderfully bright, but challenged by the readability level of their assigned high school texts. And like many students generally, all their years of [...]
-
At July 21, 2008, PaulV8 wrote:
Thanks for posting this, Clay. I've enjoyed your blogging. You are an inspiring teacher. I'll have to try this out since I may have 11th and 12th grade remediation English classes this year. Heck, this will help any classes.
PaulV8s last blog post..iPhone Updates: Prologue and Mdot
-
At July 21, 2008, Clay Burell wrote:
@Paul,
Such a nice gesture to receive in a rough week. Thanks for that.
Clay



