May. 3rd, 2005

gusl: (Default)
It shouldn't be too hard to create an algorithm to identify an author's native language in English text.

At least for those with a lower English level, it should be very easy to spot signature mistakes. For example, if a question begins with "what for [noun] ...", then the author's native language is very probably Dutch or German. (it's a literal translation of the Dutch way of saying "what kind of [noun] ...")

I wonder if a generic machine-learning technique would discover this pattern when fed with a corpus of texts labelled with the author's native language.

It should be easier than identifying the author's gender, in any case. Apparently, no one claims to guess the author's gender with more than 80% accuracy. I find this unsatisfactory.

February 2020

S M T W T F S
      1
2345678
9101112131415
16171819202122
23242526272829

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags