open dictionaries (i.e. the lack thereof)
Jun. 28th, 2004 06:52 pmOne of the "should exist" things that continues to amaze me is the lack of free, downloadable, open-source translation dictionaries.
I just spent 10 minutes searching and I haven't found anything worth linking to. WordNet is apparently far behind in other languages, and an interlingual WordNet seems FAR, FAR AWAY. But WHY??
I find this strange because the benefit - cost of such an enterprise is enormous.
This project would cost very little because:
* We already have more than enough data in parallel corpora, for example Canadian government or European Union data to automatically extract translations quite reliably. (This was my project in January)
* Anyone who moves to a new language community learns a very significant of what anyone could expect from a "complete dictionary" in a few years (no such thing can actually exist, look up "Zipf Distribution"). And people do not learn that fast. Therefore, there are only a few thousand words needed per language. And another few thousands items to distinguish . Assuming a person can write 30 items / hour, this is only about 1000 man-hours / language (when not using any tools).
* There is already tons of data around in the form of explanations about which words to use when, etc. See
go_dutch and similar communities. Those people could simply formalize their contributions a little more.
The benefits would be:
* access dictionaries through the powers of computing
* ordinary people never having to buy dictionaries, theasauri, language tools again, and being limited by their non-digital or proprietary form.
* never having to ask humans to help you when you simply want to translate a word or to know which word to use in which context.
I believe that this net benefit is so big that, I wouldn't mind seeing some government money used to finance such a valuable public good. But it's actually not really necessary! Someone with the leadership and time, please step up.
I just spent 10 minutes searching and I haven't found anything worth linking to. WordNet is apparently far behind in other languages, and an interlingual WordNet seems FAR, FAR AWAY. But WHY??
I find this strange because the benefit - cost of such an enterprise is enormous.
This project would cost very little because:
* We already have more than enough data in parallel corpora, for example Canadian government or European Union data to automatically extract translations quite reliably. (This was my project in January)
* Anyone who moves to a new language community learns a very significant of what anyone could expect from a "complete dictionary" in a few years (no such thing can actually exist, look up "Zipf Distribution"). And people do not learn that fast. Therefore, there are only a few thousand words needed per language. And another few thousands items to distinguish . Assuming a person can write 30 items / hour, this is only about 1000 man-hours / language (when not using any tools).
* There is already tons of data around in the form of explanations about which words to use when, etc. See
![[livejournal.com profile]](https://www.dreamwidth.org/img/external/lj-community.gif)
The benefits would be:
* access dictionaries through the powers of computing
* ordinary people never having to buy dictionaries, theasauri, language tools again, and being limited by their non-digital or proprietary form.
* never having to ask humans to help you when you simply want to translate a word or to know which word to use in which context.
I believe that this net benefit is so big that, I wouldn't mind seeing some government money used to finance such a valuable public good. But it's actually not really necessary! Someone with the leadership and time, please step up.