gusl | Entries tagged with information

While

marymcglo was driving us back from the Machine Learning picnic last Saturday, we somehow came up with some empirical questions that were difficult to answer objectively. For example:

"Are low-income people more likely to marry early?"

i.e. the kind of demographic questions that economists are interested in.

We have two kinds of data available:
* Census data
* Marriage records

Integrating these two to answer our question is not trivial.

For one thing, census data is anonymous. Also, if you don't have access to microdata (i.e. individual data points), then all you get are distributions conditioned on variables like "gender", "age group", "race" or "marriage status". In particular, you can't condition on more than one thing. In situations like this, one trick is to ask a different question:

"Are people in low-income counties more likely to marry early?"

whose answer can be used to answer our original question, but only if we buy an independence assumption, namely that people in low-income countries are representative of low-income people in general. In other words we have to assume that the bias is small. Economists use such tricks all the time.

The methodologist in me wants to create a formal language for querying all this demographic data, while making these economists' tricks explicit. Once we have such a language, some logical questions are:
* what class of questions can be answered by our data?
* what questions need extra assumptions to be answered by our data?

Using this language, you would ask the reasoning engine a particular question, and it would come back offering you a choice of assumptions that could be used to answer the question. It is up to you to decide whether and how much you believe each of these assumptions. The more often an assumption gets accepted, the higher its prior gets: this way, the system formalizes what assumptions are considered "common-sense".

This is also a semantic-web-ish idea. For example, your question might talk about concepts that are not explicitly talked about in the data, but only indirectly so (there is a gap between your question and the data). Or you might have semantic interoperability issues between your data sets (the gap is inside the data).

Finally, I would like to create a Library of Formalized Economic Arguments. I don't know if anyone else is interested in this. While many economists seem to be interested in methodological issues, I don't know any who would like to take this to a foundational level.

P.S.: I didn't even mention causal inferences yet.

---

Census Microdata:

Uses of Microdata
Most population data - especially historical census data - have traditionally been available only in aggregated tabular form. The IPUMS is microdata, which means that it provides information about individual persons and households. This makes it possible for researchers to create tabulations tailored to their particular questions. Since the IPUMS includes nearly all the detail originally recorded by the census enumerations, users can construct a great variety of tabulations interrelating any desired set of variables. The flexibility offered by microdata is particularly important for historical research because the aggregate tabulations produced by the Census Bureau are often not comparable across time, and until recently the subject coverage of census publications was limited.

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29

Gustavo Lacerda

demographic data: inductive methodologies, query languages & information retrieval

problem with the news

all roads lead to... (where Google finds my wild ideas)

Profile

February 2020

Syndicate

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags