gusl: (Default)
[personal profile] gusl
It shouldn't be too hard to create an algorithm to identify an author's native language in English text.

At least for those with a lower English level, it should be very easy to spot signature mistakes. For example, if a question begins with "what for [noun] ...", then the author's native language is very probably Dutch or German. (it's a literal translation of the Dutch way of saying "what kind of [noun] ...")

I wonder if a generic machine-learning technique would discover this pattern when fed with a corpus of texts labelled with the author's native language.

It should be easier than identifying the author's gender, in any case. Apparently, no one claims to guess the author's gender with more than 80% accuracy. I find this unsatisfactory.

(no subject)

Date: 2005-05-03 04:40 pm (UTC)
From: [identity profile] bondage-and-tea.livejournal.com
Here are some tests of your intutions -- what country did the speakers of this utterance come from?

"This allows to improve the efficiency"

(no subject)

Date: 2005-05-03 04:45 pm (UTC)
From: [identity profile] gustavolacerda.livejournal.com
"allows to improve"

A lot of languages, really. I think English is exceptional in requiring an explicit subject. AFAIK, it could be Portuguese, French or Dutch... i.e. I can't rule out any languages other than English.

(no subject)

Date: 2005-05-03 04:46 pm (UTC)

(no subject)

Date: 2005-05-03 08:02 pm (UTC)
From: [identity profile] xach.livejournal.com
I have a doubt...what means VAR NOT BOUND in LISP????

(no subject)

Date: 2005-05-03 08:15 pm (UTC)
From: [identity profile] gustavolacerda.livejournal.com
"what means ... ?"

Again, could be many languages... English is also unique in using auxiliaries in virtually every question.

All I'll say is that most Dutch people speak better English than that.

Btw, I don't have the data or the knowledge to make these judgements in general... I'm only good enough in 4 languages.

(no subject)

Date: 2005-05-03 08:17 pm (UTC)
From: [identity profile] xach.livejournal.com
How about "I have a doubt"? I see that error quite often.

(no subject)

Date: 2005-05-03 08:24 pm (UTC)
From: [identity profile] gustavolacerda.livejournal.com
hm... I wouldn't call it an error myself. [?] rather than [*].

I can tell you that it corresponds to a frequent expression in Portuguese. For some reason, it's used as often or more often than "I have a question".

It should be possible to ask Google if a literal translation to French or Spanish occurs proportionately more frequently than in English.

(no subject)

Date: 2005-05-03 08:17 pm (UTC)
From: [identity profile] gustavolacerda.livejournal.com
Ok. "to be" isn't always an auxiliary, but virtually every question in English has either "to be", "to do" or "to have".

(no subject)

Date: 2005-05-04 05:21 pm (UTC)
From: [identity profile] gustavolacerda.livejournal.com
or shall/should or will/would

(no subject)

Date: 2005-05-04 05:22 pm (UTC)

(no subject)

Date: 2005-05-04 05:10 pm (UTC)
From: (Anonymous)
Non-nativity does not just make itself manifest in mistakes. Non-native authors often have a better or more stilted use of English grammar than native authors because the former learned the grammar by rules, and the latter by use.
--Sebastian

(no subject)

Date: 2005-05-04 05:20 pm (UTC)
From: [identity profile] gustavolacerda.livejournal.com
Sure.

Usage can also indicate a foreigner. For example, Dutch people often say "is it high time for ..." which is unusual but not incorrect English (where it is usual and correct Dutch). It's very hard to break away from such patterns... especially when there is no equivalent English expression.

I tend to cringe when I hear the colloquial "dit keer"... it's hard enough to tell de-words from het-words without exceptions to the rule.

February 2020

S M T W T F S
      1
2345678
9101112131415
16171819202122
23242526272829

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags