gusl: (Default)
[personal profile] gusl

unified architectures for Machine Learning?


Our ontology is, as usual: observables (inputs) and unobservables we are interested in (output).

The purpose of much machine learning is to: given some data, induce a function that, when given a new data point with partial information, will let us complete it.



Analogy: reconstructing an image

Supervised learning is when we learn from complete images, and then perform the task of filling in the missing area.
Unsupervised learning is when we learn from incomplete images to begin with. The goal may be to complete the square, or merely to find a classification of the incomplete images.

But in a more general context, the different variables associated with the data will have different units and types (e.g. two-valued, multi-valued, discrete, continuous, etc). While such data points can still be encoded as images (everything can), we lose contraints that made learning feasible (e.g. continuity).

My impression is that many learning systems are hard-coded for a specific learning function (i.e. a given set of inputs and outputs), and aren't robust to changes, i.e. if you add an input the system won't improve, if you remove an input the system breaks. If we have a system that has learned an estimate of 1=>2, it should be easy to turn that learning into an estimate of 2=>1 (of course, if 1=>2 is very information-lossy, your standard for 2=>1 can't be very high).

-------------------------------

Learning argumentative structures


Anyway, I mention all this because in my current project on learning argumentative structures, our most ambitious goal (i.e. automating the process of argument-mapping) involves a multi-step process, and we may or may not want to add scaffolding (different levels of human-annotation) along the way.

Our "variables":

1 skeleton of the graph
2 text in nodes
3 raw source text
4 text segments ("quotes") to be used
5 links between text segments and nodes

The ambitious goal is to learn 3 => 1,2,5 (4 being a necessary intermediate step)

2 => 4 should be easy, as the text in nodes tends to be close paraphrases of the source, and the target space is small (there are only so many quotes you can take a short text).
1,3,4 => 5 should be easy to make perform reasonably well by using simple heuristics about ordering and textual cues (words like "therefore").
1,3 => 2 could benefit from this heuristic: if you see the text "we assume that" in 3, the sentence following that must be a leaf node (i.e. axiom node). Likewise, some cues may help us identify the root node: maybe "therefore" gets used in final conclusions whereas "thus" is used in intermediate nodes more often.
1,2,3,4 => 5 is easy: 1,3,4 => 5 is feasible already, and now we have 2, which makes the problem almost trivial: just use string matching.

---

Collecting data:

fixed text, different graphs: read & formalize assignments
fixed graph, different texts: read graph & write assigments (expose the author's point of view)

(no subject)

Date: 2006-11-12 12:37 am (UTC)
From: [identity profile] jcreed.livejournal.com
My impression is that many learning systems are hard-coded for a specific learning function (i.e. a given set of inputs and outputs), and aren't robust to changes, i.e. if you add an input the system won't improve, if you remove an input the system breaks.

What makes you think this? If I give Naive Bayes more training data, it does better. If I take away training data, it doesn't fail catastrophically, but rather its performance degrades gradually.

(no subject)

Date: 2006-11-12 12:41 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
It's my own very uninformed impression. What I have in mind is NLP systems.

(no subject)

Date: 2006-11-12 12:45 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
hm... Wikipedia says Naive Bayes is a classifier. The kinds of things I have in mind are reconstructing things over a big space, i.e. there is no predefined set of classes.

(no subject)

Date: 2006-11-12 12:54 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
Oops, it seems I suggested that classifiers only work with pre-defined set of classes, which is of course not true (just look at any clustering scheme).

(no subject)

Date: 2006-11-12 08:56 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
Also, by "add training data", it sounds like you might mean more data points, e.g. in biometrics, this would mean more individuals.

When I talk about "adding/removing an input", I mean extra variables about each data points. e.g. in biometrics, information about the neck width.

Is this what you meant?

(no subject)

Date: 2006-11-12 07:15 am (UTC)
From: [identity profile] mdinitz.livejournal.com
I agree with jcreed that many (perhaps most) learning algorithms do much better when given more data (that seems to be the point of using machine learning as opposed to normal algorithms). For example, lately I've been thinking about algorithms for learning distributions, both online and offline. Here you're given a bunch of points drawn from the distribution and you try to use them to figure out the parameters of the distribution. Ryan O'Donnell at CMU has some interesting papers in this area, as does Avrim Blum. Here it's obvious that the more data you get, the better. And certainly when I took learning theory most of the bounds you derive are bounds on the number of sample points you need in order to guarantee (with high probability) that the function you output is close to the actual function. Of course, classical learning theory is mostly about various kinds of classifiers, so if you don't care about classifiers then this isn't all that convincing.

(no subject)

Date: 2006-11-12 08:52 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
I actually know very little about ML. When I said "many learning systems are hard-coded", I was mostly making a straw man in order to make my point.

(no subject)

Date: 2006-11-12 08:59 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
(that seems to be the point of using machine learning as opposed to normal algorithms)

Are you making the point that machine learning is more dynamic than "normal algorithms", i.e. statistical recipes?

(no subject)

Date: 2006-11-12 03:10 pm (UTC)
From: [identity profile] mdinitz.livejournal.com
In some vague sense. Normally in learning theory you make claims like "If I see some amount of data (usually somewhere around 1/epsilon * log(1/delta)) then with probability at least 1-delta I will output a function that is epsilon-close to the acutal function". This is the famous PAC model. Normal online or approximation algorithms don't have this dependence on the number of points. In online algorithms espeically you have to make a strong guarantee at every single time point, since the adversary can stop the input at any time, while in learning algorithms you're guaranteed to see some amount of input. In approximation algorithms you make claims based on the number of points all the time (e.g. and algorithms with an omega(1)-approximation ratio), but here you have the opposite trend that the more points you see the worse your approximation is. Machine learning is the only kind of algorithms I can think of where you benefit from seeing more points, which is the point I was trying to make.

Also, I like jcreed misinterpreted you to mean adding more points. If you mean adding omre variable about each data point, then you're taking the problem into a higher dimensional space, so it makes sense that this would make the problem harder. Finding almost any kind of structure is much much harder in higher dimensional spaces than in lower dimensional ones. I'm not sure that learning algorithms are hand-coded for a specific dimension, but they certainly benefit from having lower dimension. There's been some cool work lately though on learning over more general manifolds, so even if your training data appears to come in some incredibly high-dimension space you can try to find a low-dimension manifold containing the training data and learn over that.

(no subject)

Date: 2006-11-12 03:27 pm (UTC)
From: [identity profile] jcreed.livejournal.com
Aha, so you (gustavo) are saying many machine learning algorithms are hard-coded to a particular function type, not a particular function. There I think I agree with you, but I don't know what can be done about it.

(no subject)

Date: 2006-11-12 05:20 pm (UTC)
From: [identity profile] gustavolacerda.livejournal.com
oh, "type", what a beautifully appropriate word!

Does "add an input" sound more like "add a data point" than like "add a dimension"?

What would it mean for the learning to be hard-coded to a particular function? That you're just fitting parameters? If so, then you're still learning over a function "type" in a more restricted sense of the word (e.g. the type of linear functions touching (0,0)).

(no subject)

Date: 2006-11-12 05:35 pm (UTC)
From: [identity profile] jcreed.livejournal.com
Does "add an input" sound more like "add a data point" than like "add a dimension"?

Yes, the former.

What would it mean for the learning to be hard-coded to a particular function?

To be honest, the thing I had in mind doesn't make a lot of sense: that one has an exact, single function in mind, and cooks up a learning algorithm to perform well on learning that function from instances of it. It is a somewhat extreme straw-man, for in principle the learning algorithm could just be the function itself, without paying any attention to its training data, but there are slightly less extreme versions of it.

I want to say that the human brain itself has a fixed "type" for its inputs and outputs, that we are born with only so many eyes and ears on the input side, and only so many arms and mouths on the output side. Nonetheless when people sustain injuries, they "route around" missing inputs or outputs, and seem to incorporate external objects into their sense of self insofar as learning to use tools.

I'm hesitant to accept the idea that algorithms now are "just inflexible" or "brittle" or something and that we need to figure out "the algorithm for flexibility", but there's something in what you're saying that is persuading me that there is some kind of generality that we might not have, and that we might need.

Let me try to rework your objection/claim/idea into a similar one:

Our current machine learning algorithms typically (but not always) learn functions whose type is something like RnRm. Although we can twiddle around the ns and ms, it's still rather unstructured vectors of reals. How can we learn functions of more interesting types? Specifically, how can we learn functions of types which, on both the input and output side, are structured with enough complexity to admit objects that start looking like programs and language-utterances?

February 2020

S M T W T F S
      1
2345678
9101112131415
16171819202122
23242526272829

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags