gusl: (Default)
This is paper draft that is probably more worthy of being a blog post, so I decided to blogize it. Here we go:

Abstract:
It is well-known that Independent Component Analysis (ICA) is not identifiable when more than one source is Gaussian (Comon 1993). However, the question of how identifiable it is is rarely asked. In this paper, we show a generalization of the identifiability result, namely that the number of degrees of freedom in the observational equivalence class of ICA is k-1, where k is the number of Gaussians in the sources.

However, since the sources are latent variables, k is unknown and needs to be estimated. We frame this as a problem of "hyper-equatorial regression" (i.e. find the hyper-great-circle that best fits the resampled points on the hypersphere), which is solved by doing PCA and discarding components beyond the knee of the eigenspectrum.

Having estimated the Gaussian subspace, we can then project the data onto it, and now we have a fully identifiable ICA.

We conclude by presenting a simple modification of the FastICA algorithm that only returns the identifiable components, by exiting early if non-Gaussian components can't be found. This is efficient because it avoids the bootstrap and clustering steps.


Independent component analysis is a statistical technique used for estimating the mixing matrix A in equations of the form x = A s, where x is observed and s and A are not. s is usually called the "sources".

The matrix A can be identified up to scaling and column-permutations as long as the observed distribution is a linear, invertible mixture of independent components, at most one of which is Gaussian.

This note is concerned with the unidentifiability that results from more than one source being Gaussian, in which, rather than a single mixing matrix, there is a proper subspace of mixing matrices that leads to the observed joint distribution in the large sample limit.

The observational equivalence class of ICA (modulo scaling and permutation) is the set of mixing matrices that can be obtained by applying an invertible linear transformation of the Gaussian components while maintaining the non-Gaussian components fixed. If we order the components so that the Gaussian ones come first, the observational equivalence class for any maxing matrix A can be written as the set {MA | M is like the identity matrix except for k-by-k block on the top left, which is a general invertible matrix}.

If we have 2 Gaussians on a large number of sources, ICA is far from useless: rather, all but 2 components can be identified, corresponding to the two Gaussian sources. The observational equivalence class has one degree of freedom, as can be seen in (b) below.

Now we illustrate the behavior of the ICA statistic on an example with 3 sources, by resampling and plotting the directions of the resulting components on the unit sphere:



As we can see, in Fig (b), the two components that lie on the circle can rotate around an axis. The next figure provides a 3D plot of this.



ToDo: implement proper clustering, modify FastICA, write a proper report

Having run a clustering procedure to eliminate the tight clusters, we use PCA to find the k-plane around which the points on the ring lie. This gives us the Gaussian subspace.

Thanks to Ives Macedo for discussions.
gusl: (Default)
I think it is well-known that the news media copies stories from each other (presumably with enough changes to not seem too blatant*, but nevertheless the typical lack of citations would suffice to be hung for plagiarism in any academic field). This process can be described as a DAG in which the nodes are stories and the arrows represent copying, and only some nodes are observed (printed). If you abstract over the rewordings, a diagram displaying merges and splits might not look that different from a version control graph.

Query: What aspects of these networks are public knowledge, and to what extent can we trace the history of an individual news item?

Have people developed tools to help this science of "news forensics"? My hope is that by figuring out who the original sources are and what they actually said, we can cancel out the effect of the replicator dynamics (what you see is what sells), and thus get more objective information.



* - If this were a single (rather than a developing) piece of news, and if everyone were honest, one would expect that the non-blatant copying means that the news gets phrased more and more awkwardly. In reality, I suspect they rephrase things without regard for the facts (and often towards sensationalism).
gusl: (Default)
The great thing about computational linguistics (in particular: text rather than speech) is that it's very easy to come up with research questions that can be answered by doing (often simple) statistics on large corpora, e.g.:

* in checklists, people don't always end their sentence with a period/mark. What grammatical structures tend to be closed off with explicit punctuation?
* when do bloggers complain the most about their partners / their bosses? What are the correlates to company earnings, unemployment rates?
* how does one's writing reflect one's linguistic (or cognitive) impairments (e.g. aphasics or L2 speakers)? How much insight can you get into someone's mind from their writings?
* what can you predict about other data sources (e.g. stock prices, movie ratings) based on newspaper text?
* find correlates of font choice
(and if you're getting people to type for you, keylogger data can be cognitively much more interesting! Perhaps as interesting as eye-tracking data.)

The not-so-great thing is that shallow approaches don't work for everything (although they can be surprisingly good!) and annotations can be expensive (though Mechanical Turk is making this a lot cheaper).

Having said that, I'm simply more interested in statistics: theory, methodology, modeling and algorithmics. And although engineering can be lots of fun, it can also be a pain to use other people's tools (lemmatizers, parsers, POS taggers, etc) or hacking up your own.
gusl: (Default)
According to the sources I've read, Gibbs sampling doesn't really tell you how to sample... it's just divide-and-conquer: you still need to sample each variable from its full conditional.

In my problem, that would be more trouble than it's worth, since no denominators are canceling. So I'm doing good old Metropolis-Hastings, with a proposal that makes its trajectory resemble Gibbs: it randomly chooses one variable to modify, and then randomly chooses a value for that variable. In other words, the proposal is uniform over the neighbors in the "generalized hypercube"*.

I can easily compute the posterior. In traditional MCMC, I think you would weight the samples by how often they appear. But doesn't it make more sense to directly compute the posterior in all sampled models?

Should I be using MCMC at all?
How else am I going to find the set of high-probability models? (Maybe what I want is a mode-oriented stochastic search, as my EA project did.

Believe it or not, it's my first time actually implementing MCMC.

Also, I'd like to know what proportion of the posterior mass my sampled models account for... but this is probably VeryHard to do if I can't enumerate the models (otherwise we could know whether the chain has mixed).



* - what do you call a non-binary hypercube? I mean: let each node be a string in {0,...,k-1}^n, and neighborhood relation is defined by set of strings that differ in exactly 1 character. When k=2, we get the n-dim hypercube. What's the word when n>2?.
gusl: (Default)
"Data Mining" and "Machine Learning" are separate communities, which means they go to different conferences, etc. (I didn't meet any data mining people at NIPS). I would like to understand why, since the central idea is the same: do statistics in order to answer queries.

Today I chatted with a PhD student in data mining about the topic of how his field differs from machine learning / statistics. What he said can be summarized as follows:
(1) data mining concerns huge data sets, in which issues like memory management, indexing, data summaries are important
(2) data mining cares a great deal about "pre-processing" (which apparently can include problems such as named-entity recognition/co-reference resolution)
(3) data mining cares about structured objects
(4) data mining cares about catering to users who are not able/willing to type their queries in a formal language
(5) data mining only cares about making "systems that work" (in quotes because this is very ambiguous)
(6) data mining doesn't have a very high standard for reproducibility

Here are my thoughts:
(1) machine learning deals with such issues too, though perhaps some only recently. Online learning (the idea of keeping (near)sufficient statistics that can be updated online) has surely been around for a good while.
(2) I really don't understand this comment.
(3) lots of machine learning is about structured objects! This is my favorite kind, and includes almost all of NLP and bioinformatics.
(4) I guess this means they are reaching towards HCI, specifically natural language-ish interfaces.
(5) no comment.
(6) I guess this is probably because the data they work with is largely proprietary, since one of their applications is business intelligence. Nevertheless, I wonder what it's like to work in a field where the research is not reproducible. If your results can't be reproduced, what do people cite you for?
gusl: (Default)
Why do we care about finding MLE, MAP parameter-settings? The focus on the maximum reminds me of the mode, a statistic that one doesn't care about most of the time. (Instead, one tends to focus on the mean and median).

Take the graph below:


likelihood graph likelihood graph



If your posterior looks like this, is the red circle really your best guess about theta?

Why don't we work with the full uncertainty about theta? Because it tends to be computationally expensive to do so?

---

Suppose you have a very expensive Bayesian update. The standard answer is to use MCMC to sample from the posterior. But suppose you're not interested in the whole posterior, just a small region of it (or a particular point even, which is a region of measure zero). Are there ways to prune away points that are going to end up outside my region of interest, or to estimate my point?

Profile

gusl: (Default)
gusl

December 2016

S M T W T F S
    123
45678910
11121314151617
18 192021222324
25262728293031

Syndicate

RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags