gusl: (Default)
[personal profile] gusl
NLP is a pretty cool research area but just brainstorming projects makes my head hurt. Seriously, what hasn't been done? All the interesting&novel ideas I can come up with involve an expensive process of collecting/annotating data.

Here's another, which has been done (though only recently):

I'd like to make a classifier to identify the native language of the author of an English text.

A quick googling produced: Oren Tsur, Ari Rappoport - Using Classifier Features for Studying the Effect of Native Language on the Choice of Written Second Language Words
We apply machine learning techniques to study language transfer, a major topic in
the theory of Second Language Acquisition (SLA). Using an SVM for the problem of
native language classification, we show that a careful analysis of the effects of various
features can lead to scientific insights. In particular, we demonstrate that character bigrams
alone allow classification levels of about 66% for a 5-class task, even when content
and function word differences are accounted for. This may show that native language
has a strong effect on the word choice of people writing in a second language.


and to do it from audio: Bouselmi et al - Discriminative phoneme sequence extraction for non-native speaker’s origin classification
The existence of discriminative phone sequences in non-native speech is a significant result of this work. The system that we have developed achieved a significant correct classification rate of 96.3% and a significant error reduction compared to some other tested techniques.

(no subject)

Date: 2008-10-20 07:53 am (UTC)
From: [identity profile] the-locster.livejournal.com
When it comes to building large data sets I tend to think that imaginatively sifting stuff from the internet is the way to go. e.g. when the micrsoft photosynth team wanted to recreate a 3d model of St Mark's square in venice they didn't send out a team of photogrohers they just searched flickr. Of course this means trimming some branches from your ideas tree, but also adding a lot.

Some existing machin learnign data sets to consider:
http://archive.ics.uci.edu/ml/datasets.html
http://www.kdnuggets.com/datasets/

(no subject)

Date: 2008-10-20 07:59 am (UTC)
From: [identity profile] the-locster.livejournal.com
Also consider using the Amazon Mechanical Turk (or similar) to build data sets of human language audio or to get transcripts of existing audio.

Mechanical Turk

Date: 2008-10-20 08:07 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
I just had an enthusiastic email exchange with someone who saw my post and recommends the AMT. It sounds awesome. I wonder what kind of data is supported in and out (text in/out yes, but what about audio? video? eye-tracking seems less feasible).

Are you saying audio out is supported? (Meaning: workers give you recordings of their voice)

Re: Mechanical Turk

Date: 2008-10-20 08:35 am (UTC)
From: [identity profile] the-locster.livejournal.com
To be honest I'm not sure but as I understand it, it's a fairly open framework so I think pretty much anything is possible given a java applet, plugin, flash control, etc.

(no subject)

Date: 2008-10-20 03:19 pm (UTC)
From: [identity profile] altamira16.livejournal.com
I believe that what you are asking requires spontaneous speech. If I recall correctly, the CU Accent Corpus had one minute segments of spontaneous speech by speakers from various backgrounds. You may want to email Dr. John H.L. Hansen at UT Dallas and ask him how one goes about getting access to this corpus. It may cost something, but I am not sure about the details.

February 2020

S M T W T F S
      1
2345678
9101112131415
16171819202122
23242526272829

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags