gusl: (Default)
[personal profile] gusl
While [livejournal.com profile] marymcglo was driving us back from the Machine Learning picnic last Saturday, we somehow came up with some empirical questions that were difficult to answer objectively. For example:

"Are low-income people more likely to marry early?"

i.e. the kind of demographic questions that economists are interested in.

We have two kinds of data available:
* Census data
* Marriage records

Integrating these two to answer our question is not trivial.

For one thing, census data is anonymous. Also, if you don't have access to microdata (i.e. individual data points), then all you get are distributions conditioned on variables like "gender", "age group", "race" or "marriage status". In particular, you can't condition on more than one thing. In situations like this, one trick is to ask a different question:

"Are people in low-income counties more likely to marry early?"

whose answer can be used to answer our original question, but only if we buy an independence assumption, namely that people in low-income countries are representative of low-income people in general. In other words we have to assume that the bias is small. Economists use such tricks all the time.

The methodologist in me wants to create a formal language for querying all this demographic data, while making these economists' tricks explicit. Once we have such a language, some logical questions are:
* what class of questions can be answered by our data?
* what questions need extra assumptions to be answered by our data?

Using this language, you would ask the reasoning engine a particular question, and it would come back offering you a choice of assumptions that could be used to answer the question. It is up to you to decide whether and how much you believe each of these assumptions. The more often an assumption gets accepted, the higher its prior gets: this way, the system formalizes what assumptions are considered "common-sense".

This is also a semantic-web-ish idea. For example, your question might talk about concepts that are not explicitly talked about in the data, but only indirectly so (there is a gap between your question and the data). Or you might have semantic interoperability issues between your data sets (the gap is inside the data).

Finally, I would like to create a Library of Formalized Economic Arguments. I don't know if anyone else is interested in this. While many economists seem to be interested in methodological issues, I don't know any who would like to take this to a foundational level.

P.S.: I didn't even mention causal inferences yet.

---

Census Microdata:

Uses of Microdata
Most population data - especially historical census data - have traditionally been available only in aggregated tabular form. The IPUMS is microdata, which means that it provides information about individual persons and households. This makes it possible for researchers to create tabulations tailored to their particular questions. Since the IPUMS includes nearly all the detail originally recorded by the census enumerations, users can construct a great variety of tabulations interrelating any desired set of variables. The flexibility offered by microdata is particularly important for historical research because the aggregate tabulations produced by the Census Bureau are often not comparable across time, and until recently the subject coverage of census publications was limited.

(no subject)

Date: 2006-09-14 05:53 pm (UTC)
From: [identity profile] radiantsun.livejournal.com
AHHH!!! So what's the answer? Are people in low income counties more likely to marry early????

Also I like your points about what data to bring up.

(no subject)

Date: 2006-09-14 08:44 pm (UTC)
From: [identity profile] gustavolacerda.livejournal.com
If I knew the answer, I might not have gotten inspired to write this post.

(no subject)

Date: 2006-09-15 05:32 am (UTC)
From: [identity profile] combinator.livejournal.com
The microdata will let you ask questions such as: in a specific age group and at a specific income level, what is the probability that you are married? The fact that it's only 5% of the data doesn't matter much. I think the problem is that you need to trace people over time. When people get married, they may not make much money, but they will make more later (for example, if they get married while still in school). But I think the anonymity of public records won't let you do that.

(no subject)

Date: 2006-09-15 05:40 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
The fact that people get lumped into age groups would make our estimates of age-of-marriage pretty coarse.

(no subject)

Date: 2006-09-15 11:39 pm (UTC)
From: [identity profile] trufflesniffer.livejournal.com
I think this is the sort of question things like the BHPS are designed to help solve. (I'm not sure what the US version of it is.)

February 2020

S M T W T F S
      1
2345678
9101112131415
16171819202122
23242526272829

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags