gusl | Entries tagged with machine

My experience with applied statistics has been fun in large part because it involved making the Bayesian rubber meet the road of inherent computational complexity, and because it allowed me to define and explore theoretical concepts and tie them to our applied problem... I also had a good time formalizing and unifying ideas, making them more coherent and clearer. And it has been great to work with dedicated people. I'm still learning from my new experience in a mentoring role.

But perhaps the most formative thing is the periodic mindfuck that comes with realizing how wrong I am. It is the frustration of realizing that missing data reflects perfectly innocent biases (leaving me with little recourse, since I don't have a model of our colleagues who decide how to collect data, and what to report). It is questioning maximum-likelihood. Last year, I saw how one can expect to make better predictions by deliberately plugging in a parameter value that one knows is smaller than the truth (in simulated data, no misspecification). Now I am seeing how a better search algorithm leads to worse prediction. Maybe it is time to penalize the likelihood afterall... But it's not clear where overfitting may be coming from, or what makes one block structure more complex than another.

In supervised classification problems, we can define complexity families (e.g. linear separators are very simple; quadratic ones are less simple, etc) and define penalties proportional to the complexity measure... but what would you do if there was no geometry to the space, or if you were ignorant of geometry to begin with? And what makes one complexity measure (or one geometry) superior to another anyway? (It is not helpful to mention that Kolmogorov Complexity provides a universal measure of complexity; although I do enjoy this brand of optimism)

NeedleBase, by ITA Software^*, seems to be a practical way of building a database from unstructured sources. Besides the web scraping (Information Extraction), they have tools for data cleaning/merging, all while maintaining provenance information (i.e. every datum points to its original source). video tutorial

Dapper's Data Mapper is great for making RSS feeds out of mere URLs, e.g. this one I made for Charles Kemp's publications. Semantify might be useful for webmasters.

Freebase has never impressed me.

Tangentially, has anyone used a Web 2.0 application for socially annotating webpages as you visit them (e.g. leaving PostIt notes for your friends to see), or chatting with other people who are visiting them at the same time (social browsing)? I've never seen a good one.

Any thoughts on Flock?

^{* - soon to be merged into Google, as I found by reading the comments to this post.}

According to the sources I've read, Gibbs sampling doesn't really tell you how to sample... it's just divide-and-conquer: you still need to sample each variable from its full conditional.

In my problem, that would be more trouble than it's worth, since no denominators are canceling. So I'm doing good old Metropolis-Hastings, with a proposal that makes its trajectory resemble Gibbs: it randomly chooses one variable to modify, and then randomly chooses a value for that variable. In other words, the proposal is uniform over the neighbors in the "generalized hypercube"^*.

I can easily compute the posterior. In traditional MCMC, I think you would weight the samples by how often they appear. But doesn't it make more sense to directly compute the posterior in all sampled models?

Should I be using MCMC at all?
How else am I going to find the set of high-probability models? (Maybe what I want is a mode-oriented stochastic search, as my EA project did.

Believe it or not, it's my first time actually implementing MCMC.

Also, I'd like to know what proportion of the posterior mass my sampled models account for... but this is probably VeryHard to do if I can't enumerate the models (otherwise we could know whether the chain has mixed).

^{* - what do you call a non-binary hypercube? I mean: let each node be a string in {0,...,k-1}^n, and neighborhood relation is defined by set of strings that differ in exactly 1 character. When k=2, we get the n-dim hypercube. What's the word when n>2?}.

"Data Mining" and "Machine Learning" are separate communities, which means they go to different conferences, etc. (I didn't meet any data mining people at NIPS). I would like to understand why, since the central idea is the same: do statistics in order to answer queries.

Today I chatted with a PhD student in data mining about the topic of how his field differs from machine learning / statistics. What he said can be summarized as follows:
(1) data mining concerns huge data sets, in which issues like memory management, indexing, data summaries are important
(2) data mining cares a great deal about "pre-processing" (which apparently can include problems such as named-entity recognition/co-reference resolution)
(3) data mining cares about structured objects
(4) data mining cares about catering to users who are not able/willing to type their queries in a formal language
(5) data mining only cares about making "systems that work" (in quotes because this is very ambiguous)
(6) data mining doesn't have a very high standard for reproducibility

Here are my thoughts:
(1) machine learning deals with such issues too, though perhaps some only recently. Online learning (the idea of keeping (near)sufficient statistics that can be updated online) has surely been around for a good while.
(2) I really don't understand this comment.
(3) lots of machine learning is about structured objects! This is my favorite kind, and includes almost all of NLP and bioinformatics.
(4) I guess this means they are reaching towards HCI, specifically natural language-ish interfaces.
(5) no comment.
(6) I guess this is probably because the data they work with is largely proprietary, since one of their applications is business intelligence. Nevertheless, I wonder what it's like to work in a field where the research is not reproducible. If your results can't be reproduced, what do people cite you for?

I'm wondering whether there's a meaningful connection between MDL and VLMCs.

I decided look up the definition of VLMCs. So I did a Google search for "Variable-length Markov Chain" + "prefix code", which led me to this page:

To my surprise, this is citing Rissanen (the inventor of MDL!).

But going to the PDF, I cannot find Rissanen, or "prefix code".

What is wrong here?

UPDATE: Ok, "Rissanen" does appear in the references, but apparently the PDF is not searchable. A search for "the" returns empty.

L2 regularization is seen a way to avoid overfitting when doing regression, no more.

L1 regularization tends to give sparse results. If the truth is sparse, this is seen as a way to get to the truth (although this is not always consistent, which is why we have Bolasso).

Even if the truth is not sparse, L1 may be seen as an Occam's razor. Is this a valid view?

Even if the truth is not sparse, L1 is a way to select a small number of variables, which can be useful for those of us concerned with scarce computational resources (although it's not clear why you'd choose L1 over PCA or Partial Least Squares)

Why do we care about finding MLE, MAP parameter-settings? The focus on the maximum reminds me of the mode, a statistic that one doesn't care about most of the time. (Instead, one tends to focus on the mean and median).

Take the graph below:

likelihood graph

If your posterior looks like this, is the red circle really your best guess about theta?

Why don't we work with the full uncertainty about theta? Because it tends to be computationally expensive to do so?

---

Suppose you have a very expensive Bayesian update. The standard answer is to use MCMC to sample from the posterior. But suppose you're not interested in the whole posterior, just a small region of it (or a particular point even, which is a region of measure zero). Are there ways to prune away points that are going to end up outside my region of interest, or to estimate my point?

I sometimes read a paper, and think: how I wish I had someone around here to work on this stuff with.

If I could find someone at CMU who:
* cites this paper, or cites many of the same papers as this paper.
* has been a co-author or colleague of one of the authors
then this problem would be considerably simplified.

Some relevant resources:
* CiteSeer can be used to get the citation graph (although this can be problematic)
* citeUlike is a recommendation system. Might be useful.

Chatting with Andrew tonight, at the CS grad lounge, we talked about Hit Song Science and we eventually focused on the following tangent:

- How could you make a system to translate a tune into the style of another musician (or into a different musical idiom)?

The following is what we came up with:

Using MDL, the naive answer is: take the encoding of the song in the first style (its original style) and use this string to generate a song in the second style. But then the new song won't be recognizable: it will just be a random tune in the 2nd style. In order to do better, we would need to preserve the features that are essential to the tune's identity, i.e. we need a model of human perception.

We could describe styles by looking at certain features individually (e.g. tempo, pitch, etc), and setting the values/ranges characteristic of the style. Features that are continuous (e.g. tempo) could have their value adapted to the new style, thereby making the tune feel closer to something in Style2. But then there's the issue of maintaining internal consistency: if you e.g. increase the tempo (because the new style requires a faster tempo), you might need to change the way the drums are played (because a particular drum rhythm just sounds wrong at a higher tempo).

This makes it an unsolvable problem, strictly speaking, just like translation between natural languages (translations often sound like translations, and there is definitely information loss). But if the goal is to make a spoof (as is usually the case when people try to perform this task), then it's ok.

Herb Simon, and his student Pat Langley have long been interested in comparing human-learning and machine-learning side-to-side.

Much of our AI-based educational software is necessarily in domains where machines do better than human students (e.g. equation-solving). In domains where students are better, e.g. playing soccer, learning natural languages, we cannot create full-fledged cognitive tutors (as these require the system to judge correctness in novel problems). This doesn't mean that computer can't support the learning process, by creating tutors for skills that transfer (either vertically or horizontally) to the target skill, e.g. I can imagine soccer players benefitting from biofeedback.

We can create support for some subtasks essential for learning (e.g. vocabulary building), but the computer's knowledge of words will be "dead" knowledge, necessarily disembodied from its natural function (i.e. communicating with humans, with no domain restrictions).

Human and machines speak different languages. This is to be expected, since they evolved under different constraints:
* environment: real world vs. simulated world
* hardware architecture: wetware

OTOH, the way programs do things can't be that far from the way humans do things, since they are programmed by humans, afterall. I can imagine different programming styles, in which one implements more or less cognitively-plausible ways of solving the problem.

Much of the challenge of HCI and ITS is to formalize human concepts so that humans and machines can communicate more easily. By formalizing concepts that people actually think about, we make it possible to give the computer high-level instructions, instead of having to worry about annoying technicalities.

It has long bothered me that computers are generally rather unreflective, unaware of their output (even if you give the computer a screen capture, its perception of it is extremely raw and low-level, because it doesn't know how to interpret menus, widgets, etc, which is rather ironic: they know how to make these widgets, but not how to read them). It seems that they could a learn a lot from feedback: if they could put themselves in the users' place, they would know when they being a bad, bad computer.

Is anyone aware of user-simulators that try to figure out how to use a new interface? Is this a standard method of evaluating UI designs?

---

Another idea that I've become familar with in my recent experience has been that of a behavior recorder: it's basically like a key&mouse logger, but for higher-level steps.

By applying machine learning on behavior-recorder data, we can figure out how people behave. i.e. learn a cognitive model. Tagging these behaviors into higher-level chunks should improve this.

Now, take video games. There should be no shortage of behavior data for them. In fact, there are probably myriads of cognitive phenomena manifested in this data. To paraphrase Herb Simon, there is no shortage of data: there is only shortage of expert attention.

unified architectures for Machine Learning?

Our ontology is, as usual: observables (inputs) and unobservables we are interested in (output).

The purpose of much machine learning is to: given some data, induce a function that, when given a new data point with partial information, will let us complete it.

Analogy: reconstructing an image

Supervised learning is when we learn from complete images, and then perform the task of filling in the missing area.
Unsupervised learning is when we learn from incomplete images to begin with. The goal may be to complete the square, or merely to find a classification of the incomplete images.

But in a more general context, the different variables associated with the data will have different units and types (e.g. two-valued, multi-valued, discrete, continuous, etc). While such data points can still be encoded as images (everything can), we lose contraints that made learning feasible (e.g. continuity).

My impression is that many learning systems are hard-coded for a specific learning function (i.e. a given set of inputs and outputs), and aren't robust to changes, i.e. if you add an input the system won't improve, if you remove an input the system breaks. If we have a system that has learned an estimate of 1=>2, it should be easy to turn that learning into an estimate of 2=>1 (of course, if 1=>2 is very information-lossy, your standard for 2=>1 can't be very high).

-------------------------------

Learning argumentative structures

Anyway, I mention all this because in my current project on learning argumentative structures, our most ambitious goal (i.e. automating the process of argument-mapping) involves a multi-step process, and we may or may not want to add scaffolding (different levels of human-annotation) along the way.

Our "variables":

1 skeleton of the graph
2 text in nodes
3 raw source text
4 text segments ("quotes") to be used
5 links between text segments and nodes

The ambitious goal is to learn 3 => 1,2,5 (4 being a necessary intermediate step)

2 => 4 should be easy, as the text in nodes tends to be close paraphrases of the source, and the target space is small (there are only so many quotes you can take a short text).
1,3,4 => 5 should be easy to make perform reasonably well by using simple heuristics about ordering and textual cues (words like "therefore").
1,3 => 2 could benefit from this heuristic: if you see the text "we assume that" in 3, the sentence following that must be a leaf node (i.e. axiom node). Likewise, some cues may help us identify the root node: maybe "therefore" gets used in final conclusions whereas "thus" is used in intermediate nodes more often.
1,2,3,4 => 5 is easy: 1,3,4 => 5 is feasible already, and now we have 2, which makes the problem almost trivial: just use string matching.

---

Collecting data:

fixed text, different graphs: read & formalize assignments
fixed graph, different texts: read graph & write assigments (expose the author's point of view)

People are bad at being random, i.e. unpredictable. Randomness is a depletable resource. I'd like to look at Machine Learning algorithms that learn to exploit such patterns.

This one is intelligent. Still, I've been managing to maintain a small lead, which is close to the best I can expect if I beat their expectations about patterns the average person would exhibit. ("You have won 56, lost 48, and tied 46 games". What is the probability of such a favourable outcome for if this were random?). One known bias is that people think they are being random when they avoid repeating the same choice. The machine will try to take advantage of this.

This one did not learn to exploit some predictable players.

I am now missing the Structural Knowledge Transfer for Machine Learning workshop at ICML06.

Hit Song Science applies machine learning to predict whether a given song will be a hit. Rudi is not impressed. My take: evaluation, evaluation, evaluation!

Some Google searches give some unexpectedly interesting results

Applying SIENA: An illustrative analysis of the co-evolution of adolescents’ friendship networks, taste in music, and alcohol consumption

Jodi L. Pearson and Stephen J. Dollinger - Music preference correlates of Jungian types

Abstract
The purpose of the current study was to explore the relationship between personality and music preferences, using the Myers Briggs Type Indicator. It was hypothesized that the sensing–intuition dimension would correlate with overall musical enjoyment. Thus, as compared with those participants who scored toward the sensing end, we expected high scorers (intuition end) to endorse more musical styles, particularly classical music, as well as to have greater musical training and involvement. This hypothesis was tested and confirmed with a sample of 104 undergraduates. Moreover, extraversion also correlated with overall musical interest, particularly for popular/rock music. Finally, thinking–feeling correlated with liking for country and western music. Whereas past research has conceptualized music preferences in terms of approach to or avoidance of stimulation, these findings support the notion of cultural involvement as a personality dimension.
...
Individual weights indicated that intuition predicted liking of jazz/soul/folk (0.36, P<0.001) and classical music (0.27, P<0.05). Extraverts enjoyed popular/rock music more than introverts (introversion WEIGHT=âˆ’0.34, P<0.05). Those scoring toward the feeling end of TF were more likely to endorse country-western music than those scoring toward the thinking end (=0.27, P<0.01).

How meaningful are the MBTI dimensions?

Wouldn't we be better off doing data mining on personality questionnaires, in order to find an optimal set of personality dimensions?

If we have a questionnaire with k questions, the space of possible answers is A^k, where A is the set of admissible answers for an individual question. We could simplify this and say that A = {0,1} (yes-or-no questions). The consequence is that the possible answers form a (discrete) hypercube.

A personality dimension would be a linear combination of these answers.
A subspace of A^k is a collection of personality dimensions.

The interesting question is:

Given an integer n, and a sample of completed questionnaires, how do you find the optimal linear subspace in n dimensions? In general, you would define a set of variables that you want to predict. But let's say we want to predict all the variables equally, i.e. we want the subspace that provides the least-lossy compression of the data (measured by say, least-squares). Since we're maximizing meaningful information over all possible n-dimensional subspaces, I wonder if this corresponds to minimizing entropy. But it seems that maximizing entropy would just maximize noise.

Of course, the approach of linear combinations ignores complex interactions between the variables (e.g. given Q1, Q2 is positively correlated with Q3; but given ~Q1, Q2 are Q3 are negatively correlated). But we can always solve this problem by adding extra variables (e.g. a variable that equals "NOT (Q2 XOR Q3)", which measures their correlation): I wonder if all the logical relationships remain preserved when you do the linear regression (under the Boolean extension from {0,1} to [0,1] where "AND" becomes multiplication). Another interesting question is "which logical dependencies are expressible with a set of conjunctions?".

I'm now imagining that a good algorithm would be to create a graph with questions as nodes, and edges as strongly positive pairwise correlations. Good dimensions will show up as clusters (dense subgraphs), i.e. just let the dimension be the sum of all questions in the cluster. Good subspaces can be found by finding partitions that cut through the fewest edges (sort of similar to min-cut) while still being more or less balanced in the size of the clusters.

-

Conclusion: I need to take a machine learning class.

And by "class", I mean a good book.

And by "good", I mean "a book that answers my questions, without too much reading effort required".

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29

Gustavo Lacerda

my experience with applied statistics; my current brainmelt

semantifying the web, via Programming by Demonstration

Gibbs sampling

data mining vs machine learning

citeseer; PDF search broken

L1 regularization

why MLE, MAP?

idea: finding experts locally

MDL and pattern generation (algorithmic composition, etc)

some thoughts on HCI and Intelligent Tutoring Systems (ITS)

learning argumentative structures

unified architectures for Machine Learning?

Learning argumentative structures

Rock-Paper-Scissors

Structural Knowledge Transfer for Machine Learning

Hit Song Science

Google searches: PCA + taste in music

personality questionnaires & data mining

Profile

February 2020

Syndicate

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags