gusl: (Default)
[personal profile] gusl
NLP is a pretty cool research area but just brainstorming projects makes my head hurt. Seriously, what hasn't been done? All the interesting&novel ideas I can come up with involve an expensive process of collecting/annotating data.

Here's another, which has been done (though only recently):

I'd like to make a classifier to identify the native language of the author of an English text.

A quick googling produced: Oren Tsur, Ari Rappoport - Using Classifier Features for Studying the Effect of Native Language on the Choice of Written Second Language Words
We apply machine learning techniques to study language transfer, a major topic in
the theory of Second Language Acquisition (SLA). Using an SVM for the problem of
native language classification, we show that a careful analysis of the effects of various
features can lead to scientific insights. In particular, we demonstrate that character bigrams
alone allow classification levels of about 66% for a 5-class task, even when content
and function word differences are accounted for. This may show that native language
has a strong effect on the word choice of people writing in a second language.


and to do it from audio: Bouselmi et al - Discriminative phoneme sequence extraction for non-native speaker’s origin classification
The existence of discriminative phone sequences in non-native speech is a significant result of this work. The system that we have developed achieved a significant correct classification rate of 96.3% and a significant error reduction compared to some other tested techniques.

(no subject)

Date: 2008-10-20 07:59 am (UTC)
From: [identity profile] the-locster.livejournal.com
Also consider using the Amazon Mechanical Turk (or similar) to build data sets of human language audio or to get transcripts of existing audio.

Mechanical Turk

Date: 2008-10-20 08:07 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
I just had an enthusiastic email exchange with someone who saw my post and recommends the AMT. It sounds awesome. I wonder what kind of data is supported in and out (text in/out yes, but what about audio? video? eye-tracking seems less feasible).

Are you saying audio out is supported? (Meaning: workers give you recordings of their voice)

Re: Mechanical Turk

Date: 2008-10-20 08:35 am (UTC)
From: [identity profile] the-locster.livejournal.com
To be honest I'm not sure but as I understand it, it's a fairly open framework so I think pretty much anything is possible given a java applet, plugin, flash control, etc.

February 2020

S M T W T F S
      1
2345678
9101112131415
16171819202122
23242526272829

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags