gusl: (Default)
[personal profile] gusl
Shannon invented a way of measuring the entropy of English: have human subjects try to guess the next letter in a sentence, letter by letter, until the end of the sentence. The idea is that a more fluent speaker will be able to guess more accurately: learning is compression.

I can imagine different skills involved in this:

* at the character-level: knowledge of spelling patterns in the language. (which can acquired rather quickly by a smart outsider)
* knowledge of the vocabulary and grammar
* knowledge of the semantics and pragmatics of the text
* knowledge of the domain of discourse


I propose instead a word-guessing experiment, to somewhat filter out distortions that might appear at the character-level.

I'd like to see how I score in all my languages, and even languages that I don't know at all, like German. A time constraint seems sensible too. Another design factor would be number of tries. Another possibility is to offer multiple choice answers, especially for the word-guessing part: an adaptive program will learn to adjust to the user's ability and offer reasonable choices (not too easy, not too hard) for someone at his/her level of knowledge (which btw extracts maximum information about the reader :-) ).

(no subject)

Date: 2004-09-24 06:26 am (UTC)
From: [identity profile] mathemajician.livejournal.com
I've seen things related to this in terms of PPM compression. Basically PPM compression just predicts the next character in a string according to the last n characters. It builds a simple probability table/tree and keep simple statistics on how often each character comes after each context (i.e. the n previous characters). In practice it works really well and most of the data compression records are set by variants on this method.

Anyway, this probability model is then pretty good, in fact you can even generate random sentences that follow the model of the language. What you get is grammatically correct sentences that look ok but there is no meaning to the sentence. Sort of correct language but with the meaning removed.

So, by taking the compression rate of a good PPM compressor and subtracting what a human can do, what you are actually measuring is just the amount of "meaning information" in the text. Which is kind of cool.

February 2020

S M T W T F S
      1
2345678
9101112131415
16171819202122
23242526272829

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags