Shannon invented a way of measuring the entropy of English: have human subjects try to guess the next letter in a sentence, letter by letter, until the end of the sentence. The idea is that a more fluent speaker will be able to guess more accurately: learning is compression.
I can imagine different skills involved in this:
* at the character-level: knowledge of spelling patterns in the language. (which can acquired rather quickly by a smart outsider)
* knowledge of the vocabulary and grammar
* knowledge of the semantics and pragmatics of the text
* knowledge of the domain of discourse
I propose instead a word-guessing experiment, to somewhat filter out distortions that might appear at the character-level.
I'd like to see how I score in all my languages, and even languages that I don't know at all, like German. A time constraint seems sensible too. Another design factor would be number of tries. Another possibility is to offer multiple choice answers, especially for the word-guessing part: an adaptive program will learn to adjust to the user's ability and offer reasonable choices (not too easy, not too hard) for someone at his/her level of knowledge (which btw extracts maximum information about the reader :-) ).
I can imagine different skills involved in this:
* at the character-level: knowledge of spelling patterns in the language. (which can acquired rather quickly by a smart outsider)
* knowledge of the vocabulary and grammar
* knowledge of the semantics and pragmatics of the text
* knowledge of the domain of discourse
I propose instead a word-guessing experiment, to somewhat filter out distortions that might appear at the character-level.
I'd like to see how I score in all my languages, and even languages that I don't know at all, like German. A time constraint seems sensible too. Another design factor would be number of tries. Another possibility is to offer multiple choice answers, especially for the word-guessing part: an adaptive program will learn to adjust to the user's ability and offer reasonable choices (not too easy, not too hard) for someone at his/her level of knowledge (which btw extracts maximum information about the reader :-) ).
(no subject)
Date: 2004-09-24 06:26 am (UTC)Anyway, this probability model is then pretty good, in fact you can even generate random sentences that follow the model of the language. What you get is grammatically correct sentences that look ok but there is no meaning to the sentence. Sort of correct language but with the meaning removed.
So, by taking the compression rate of a good PPM compressor and subtracting what a human can do, what you are actually measuring is just the amount of "meaning information" in the text. Which is kind of cool.