Wednesday, June 28, 2006

To get some sense of my progress, I've been trying to estimate how many words I know in Thai. I have approached this in a couple of different ways. First, I've thought through related sets of words. I know twelve months, ten colors, seven days of the week, fourty-four consonants, etc. Secondly, I've glanced over word lists, such as the vocabulary index of a Thai textbook and the entries in a learner's dictionary. By using the size of a list and the percentage of words I recognize, I can produce an estimate. My best estimate is that I know about a thousand words.

In doing this exercise, I've realized that it's actually not clear to me what it means to "know" a word. While very frequent words are clearly "known", and words I've never heard are "unknown", there is a whole range of other possibilities. There are words that I understand when listening but cannot correctly use. There are words that I recognize and understand only in context, and there are words for which my sense is still emerging and incomplete. Even taking into account this ambiguity, I think 1000 words is reasonably accurate.

Now that I know what I know, I'd like to find statistics for Thai showing that the most frequent 1000 words cover x% of spoken language, the most frequent 2000 words cover y%, etc. This lexical coverage information is easy to find for English, but I have been unable to find anything for Thai. So I've resigned myself to trying to estimate for Thai by using what is known for English.

One consideration in trying to apply English lexical coverage to Thai is that Thai morphology is not as productive as that of English. An ESL learner who acquires a word like "create" also acquires a whole family of words, including "creates", "created", "creative", "creation", and "recreate". In Thai, there are no such families of words. Other words function in place of morphology. For example, to say "created", a Thai speaker would say "create already". Word families in Thai are families of one.

I found some research on Marlise Horst's website showing that, with a vocabulary of the thousand most frequent word families in English, students understand about 85% of spoken language. To increase that comprehension to 98%, a vocabulary of 6000-7000 word families is needed. Due to the difference in morphology, statistics for word families in English might give a rough approximation of statistics for individual words in Thai. This jibes with my experience. With my thousand word vocabulary, I think it's accurate that I understand about 85% of spoken Thai. This assumes an idealization where the only impediments to following a dialogue are vocabulary and grammar. The ability to listen to spoken dialogue at a normal rate of speed in a variety of regional accents is a separate issue.

The Linguist, an interesting ESL website, has another way to measure proficiency in a second language using the number of known words.

Beginner a) 2,000 b) 3,500 Intermediate a) 5,000 b) 7,500 Advanced a) 10,000 b) 12,500 (source: The Linguist blog)

This system is for English, and every word in a word family is counted, so an attempt to apply it to Thai would again require taking into account the difference in morphology. Playing with the numeric data from Horst's site, it appears that there is an average of two words in an English word family, with the most frequent families being the largest. Since Thai has word families of one word each, it seems reasonable to multiply the number of words in my vocabulary by a little more than 2 to acquire a rough estimate of an equivalent ESL vocabulary. With my thousand word vocabulary, I'm the equivalent of an ESL student a little past "Beginner A". This seems about right.


Paul Davidson said...

Since you seem to be conversational in Thai, I'd be surprised if your vocabulary wasn't larger — especially given the language's foreignness. I know about 6,000 words in Japanese and that's really just enough to get along.

A good (and fairly obvious) way to estimate it is to find your largest Thai dictionary, pick five pages at random, count the number of words you know on those pages, divide by the total number of words on those pages, and multiple by the number of pages in the dictionary.

Scott Imig said...

That's a great idea. I'll try it. I have a couple of Thai dictionaries I can use for this.

This post is actually from June, but the RSS feed updated the time stamp when I added the label for morphology. I've made progress since June, so my vocabulary is definitely larger now.

I actually have a long way to go before I'd call myself conversational. I often get stuck by not knowing or remembering common words when I hold a conversation. (The other person can often help me out when this happens.)

Out of curiosity, when you say 6000 words, are you counting every form of the word you know, or just the base form? For example, in English, would you count "walk", "walks", "walked", "walking" as four words or one?

Paul Davidson said...

For a meaningful word count, conjugations of a word are all just one word. However, most languages have some way of making short word phrases that have separate meaning, and should count as a separate word.

For example, in English, "show", "show up", and "show off" deserve to be counted separately. A big dictionary will likely give separate definitions for all of these.

The tendency Japanese has is to make compound verbs; even if know the individual verbs, their meaning together will be listed separately in the dictionary, and I count it separately.

I'm sure there are instances in Thai you can think of. I don't know Thai well enough myself to say.

Scott Imig said...

Those are good thoughts.

Thai has compounding as well, with verbs and other parts of speech. Like your Japanese dictionary, the Thai dictionary lists compounds separately, and it seems right to count them separately. E.g. "to seek" is one word, but "to seek fame" means to campaign in an election, a totally different meaning.

Thanks for your ideas. I'll give it a try.

Scott Imig said...

Actually, I have to amend the translations I gave for the components of the compound words above.

I was thinking of หา ("seek") and หาเสียง ("campaign") above, but a better componentized translation for หาเสียง is "to seek votes". When I translated the compound word, I was thinking of ชื่อเสียง ("fame", or "name voice"), which is another compound word made from ชื่อ ("name") and เสียง ("voice" or "vote").

The counting principle you suggested still holds, though. Clearly หา, เสียง, and หาเสียง are three distinct words.