Wednesday, April 25, 2007

Word List Generation

I wonder whether there is any software that can generate a frequency-ordered word list from a piece of Thai text. That is, I'd like a tool that would let me paste a few pages of text, obtaining a list of words in decreasing order of frequency.

There are a lot of those tools available for English and other languages which use space to delimit words, but a quick web search didn't turn anything up which can be used for Thai.

I would like to use such a tool for vocabulary practice, so that I can study the most frequent words in a transcript first. That seems more efficient than studying unknown words in the order they appear.

I do have a way to obtain a frequency-ordered word list from Thai text, but it involves several different applications and copy-paste operations, so it's not very easy or efficient. I'd like to be able to do it with the press of a button.

2 comments:

rikker said...

Scott, there are some less-than-perfect tools out there.

Here is one, or here is another version of the same tool.

Both can only accept unicode text, and both require you to presegment the text. This, of course, is a problem. Here is a (not very good) segmenter.

I'm interested in a tool like this, too, and when I know of a better one, I'll let you know.

Scott Imig said...

Thanks! That's the sort of thing I'm looking for.

By the way, thai2english.com also segments Thai text if you paste into the big textbox on the home page.