gusl: (Default)
[personal profile] gusl
"Data Mining" and "Machine Learning" are separate communities, which means they go to different conferences, etc. (I didn't meet any data mining people at NIPS). I would like to understand why, since the central idea is the same: do statistics in order to answer queries.

Today I chatted with a PhD student in data mining about the topic of how his field differs from machine learning / statistics. What he said can be summarized as follows:
(1) data mining concerns huge data sets, in which issues like memory management, indexing, data summaries are important
(2) data mining cares a great deal about "pre-processing" (which apparently can include problems such as named-entity recognition/co-reference resolution)
(3) data mining cares about structured objects
(4) data mining cares about catering to users who are not able/willing to type their queries in a formal language
(5) data mining only cares about making "systems that work" (in quotes because this is very ambiguous)
(6) data mining doesn't have a very high standard for reproducibility

Here are my thoughts:
(1) machine learning deals with such issues too, though perhaps some only recently. Online learning (the idea of keeping (near)sufficient statistics that can be updated online) has surely been around for a good while.
(2) I really don't understand this comment.
(3) lots of machine learning is about structured objects! This is my favorite kind, and includes almost all of NLP and bioinformatics.
(4) I guess this means they are reaching towards HCI, specifically natural language-ish interfaces.
(5) no comment.
(6) I guess this is probably because the data they work with is largely proprietary, since one of their applications is business intelligence. Nevertheless, I wonder what it's like to work in a field where the research is not reproducible. If your results can't be reproduced, what do people cite you for?

(no subject)

Date: 2009-03-14 11:29 am (UTC)
From: [identity profile] the-locster.livejournal.com
The various methods in data mining and machine learning have different motivations and roots, each method stems from an idea in a limited context or domain. Only when you start to fill out the details of what you actually want to achieve does it become apparent that many of these methods are tackling the same problems from different perspectives. Perhaps they still have different priorities and primary goals but as the methods develop I think people find that they converge on the same central problems, even if those problems weren't your original or primary objective.

I suspect computing power has an affect also. Many of these specialized viewpoints are in fact working in sub-domains of broader more general domains. But you couldn't tackle the general problem because of CPU constriants, so specialized methods were developed that found good approximate solutions in sub-domains, but that worked poorly or were not applicable in the broader domain.

In summation I think we'll see the merging and consolidation of ideas as hardware gets faster and the individual specializations are made defunct by broader methods. In the mean time it's frustrating to see such a degree of clustering around narrow ideas and specialization going on.

(no subject)

Date: 2009-03-14 03:18 pm (UTC)
infryq: Kitchen scene at dawn, post-processed to appear as if painted (Default)
From: [personal profile] infryq
> (4) I guess this means they are reaching towards HCI, specifically natural language-ish interfaces.
Do you consider Google to have a natural language interface? I mean, you type in words, you get documents back. Whereas some of the other top examples in information retrieval systems -- whose users are doctors and paralegals -- can get you exactly what you want, or operate at very high levels of recall, but you have to spend five minutes formulating a query using the right search operators in a specific syntax. People are much more interested currently in how fabulous it is that Joe Schmoe can get *better* search results without a structured query language.

> (5) no comment.
I don't actually think 5 was meant to be offensive -- it just means that the data mining community is more willing to accept solutions with no basis in theory; heuristics that seem to work in practice but nobody knows why. Likewise, there are a lot of methods around that make conceptual sense, but don't seem to work when you try them in real systems --- either our machines aren't fast enough yet, our data sets aren't big enough yet, or they're just refining results at too minor a level to be noticeable (i.e. statistically significant) in the face of bigger retrieval issues.

(no subject)

Date: 2009-03-14 04:25 pm (UTC)
From: [identity profile] gwillen.livejournal.com
From this description I would say that the dataminers would claim most of what Google does as datamining, but Google calls it machine learning. You say po-tay-to...

(no subject)

Date: 2009-03-14 07:13 pm (UTC)
From: [identity profile] gustavolacerda.livejournal.com
Oh, not offensive.
But many machine learning people say the same thing about their field in relation to Statistics. While it's true that Statisticians are more interested in proving theorems, it's hard to delineate any further than that. The two fields care about the same things, but with somewhat different weighting.

I'd say machine learning people would absolutely accept "solutions with no basis in theory"... but most of them would make an effort to understand why it works. And I suspect the same is true of data miners. I think machine learners are generally more interested in fancy techniques, but only when they "work"!

(no subject)

Date: 2009-03-14 07:18 pm (UTC)
From: [identity profile] gustavolacerda.livejournal.com
<< Do you consider Google to have a natural language interface? >>

I won't say yes or no. We haven't defined terms. But Google's interface is definitely a step in that direction, compared to database query languages. They check your spelling!

(no subject)

Date: 2009-03-15 02:24 am (UTC)
From: [identity profile] mapjunkie.livejournal.com
The way I think of it is, although both disciplines branch out into each others territory, there is a clear difference:

1) Data mining is about knowledge discovery in data. It focuses primarily on describing previously unknown relationships in a given data set.

2) Machine learning is about improving on a particular task through examples instead of direct instruction.

Consider unsupervised learning. Data miners ask what knowledge of value could be uncovered statistically, while a machine learning practitioner might ask what kinds of bounds they could place on the distributions, given the data seen.

I would say that data mining is an application of machine learning, drawing also from visualization, algorithms, and more conventional statistics.

February 2020

S M T W T F S
      1
2345678
9101112131415
16171819202122
23242526272829

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags