gusl | computational linguistics

The great thing about computational linguistics (in particular: text rather than speech) is that it's very easy to come up with research questions that can be answered by doing (often simple) statistics on large corpora, e.g.:

* in checklists, people don't always end their sentence with a period/mark. What grammatical structures tend to be closed off with explicit punctuation?
* when do bloggers complain the most about their partners / their bosses? What are the correlates to company earnings, unemployment rates?
* how does one's writing reflect one's linguistic (or cognitive) impairments (e.g. aphasics or L2 speakers)? How much insight can you get into someone's mind from their writings?
* what can you predict about other data sources (e.g. stock prices, movie ratings) based on newspaper text?
* find correlates of font choice
(and if you're getting people to type for you, keylogger data can be cognitively much more interesting! Perhaps as interesting as eye-tracking data.)

The not-so-great thing is that shallow approaches don't work for everything (although they can be surprisingly good!) and annotations can be expensive (though Mechanical Turk is making this a lot cheaper).

Having said that, I'm simply more interested in statistics: theory, methodology, modeling and algorithmics. And although engineering can be lots of fun, it can also be a pain to use other people's tools (lemmatizers, parsers, POS taggers, etc) or hacking up your own.

Flat | Top-Level Comments Only

From:

aisa0.livejournal.com

what can you predict about other data sources (e.g. stock prices, movie ratings) based on newspaper text?

Nekkid, I've worked on this one professionally. That is, predicting stock price based on news sources. It is an interesting idea, but it is not dealing with fundamental data, which would be the stock price. In some sense, the information contained in the new source is also contained in the stock price, except the stock price is quite a bit more reliable.

Where I'm currently at, we have a product that uses a data source roughly equivalent to news, because we don't have access to fundamental data. It is better than nothing, but ofter as not the signal you're extracting from stuff is so small that it is easy to loose in noise.

gustavolacerda.livejournal.com

I've had and seen similar ideas more than 4 years ago, so I imagine you'd have a lot of competition.

Nekkid?

<< It is an interesting idea, but it is not dealing with fundamental data, which would be the stock price. In some sense, the information contained in the news source is also contained in the stock price, except the stock price is quite a bit more reliable.>>

You should ideally use both.

The interesting question is whether the text data contains additional information, i.e. is the newspaper worth reading for a computer?
if "current stock prices" d-separates "text data" from "future stock prices", then text data is useless, or to take another perspective: text data can't help you predict the residual of the autoregression. (To my knowledge, all statistical models for a single-time series can be viewed as some type of autoregression)

Can you explain why ideally you'd use both? If they are correlated, you only need the stock price, and if they aren't, you don't care about the news. That is simplistic, so what am I missing?

I think this article may be helpful: http://en.wikipedia.org/wiki/Conditional_independence

<< If they are correlated, you only need the stock price >>

There are lots of situations in which you've measured two variables that are correlated to each other but together they are more effective at prediction than each one on its own.

The simplest example I can think of is multiple noisy measurements of X:
Y1 = X + N(0,1)
Y2 = X + N(0,1)

Goal: predict X from Y1, Y2.
Y1 and Y2 are correlated to each other and to X. In this particular case, it is sufficient to average the Ys (assuming the above measurement model is known to be correct).

Also, if I may be pedantic, please try to avoid "correlated". What you mean is "dependent". The two terms are only equivalent when distributions are jointly Gaussian.

Thank you for pointing out that I was misusing correlated. Alas, I do not understand statistics well enough to understand the distinction between dependent and correlated from the explanation you've provided. The wiki article on conditional independence was over my head as well, I think due to lack of example. At any rate I didn't have any knowledge in my head to anchor the content of that article to.

Your example helped, however.

<< Thank you for pointing out that I was misusing correlated. Alas, I do not understand statistics well enough to understand the distinction between dependent and correlated from the explanation you've provided >>

I didn't provide an explanation for that. Uncorrelated just means correlation = 0. Here's an example: imagine a cloud that is parabola-shaped, so that it starts with a positive trend and ends on a negative trend, and thus the correlation will the 0. However, knowing X still tells you something about Y, so they are not independent.

As for conditional independence, you would first need to understand the concepts of:
* conditional distribution
* independence

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29

Gustavo Lacerda

computational linguistics

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

Profile

February 2020

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags