Adso
The big thing I’ve been working on is getting Adso support in to the zdt. They’re the engine behind the cool newsinchinese website. I’ve been working off one of their flat file mysql databases. It has over 135,000 entries, compared to maybe 27000 that CEDICT had. Although that’s not a totally fair comparison since their dictionary is laid out a little different. For example, CEDICT has an entry with multiple definitions while Adso only has one definition per entry. Also Adso captures data like parts of speech which results in some duplicate entries since some words can act differently in different contexts. The only limitation with Adso so far that I see is that it does not have traditional characters.
To get Adso to work I have to convert their mysql flat file into my schema. Luckily my schema is basically just a simple subset of theirs, traditional char, simplified char, pinyin, definition. I mentioned above about the duplicates, and that’s what caused me the most trouble. First to get it to actually get rid of them correctly, and second to do it in a reasonable time. My first stab at it, my conversion algorithm took about 45 minutes to go through about 65,000 entries before I just stopped it. My second try got me almost there, about 10-15 minutes and maybe 90,000 entries before I actually ran out of memory. Finally, on my third try I got it down to 30 secs and finished the whole file succesfully. Definately need to choose the right data structures and algorithms if you’re parsing such a huge file.
The resulting chinese.script file is a little above 14mb, compared to 2.7mb for the CEDICT version. I think the results are pretty good so far. Still got to do some more testing though.
Add comment August 24th, 2005