data mining vs machine learning
Mar. 14th, 2009 12:45 am"Data Mining" and "Machine Learning" are separate communities, which means they go to different conferences, etc. (I didn't meet any data mining people at NIPS). I would like to understand why, since the central idea is the same: do statistics in order to answer queries.
Today I chatted with a PhD student in data mining about the topic of how his field differs from machine learning / statistics. What he said can be summarized as follows:
(1) data mining concerns huge data sets, in which issues like memory management, indexing, data summaries are important
(2) data mining cares a great deal about "pre-processing" (which apparently can include problems such as named-entity recognition/co-reference resolution)
(3) data mining cares about structured objects
(4) data mining cares about catering to users who are not able/willing to type their queries in a formal language
(5) data mining only cares about making "systems that work" (in quotes because this is very ambiguous)
(6) data mining doesn't have a very high standard for reproducibility
Here are my thoughts:
(1) machine learning deals with such issues too, though perhaps some only recently. Online learning (the idea of keeping (near)sufficient statistics that can be updated online) has surely been around for a good while.
(2) I really don't understand this comment.
(3) lots of machine learning is about structured objects! This is my favorite kind, and includes almost all of NLP and bioinformatics.
(4) I guess this means they are reaching towards HCI, specifically natural language-ish interfaces.
(5) no comment.
(6) I guess this is probably because the data they work with is largely proprietary, since one of their applications is business intelligence. Nevertheless, I wonder what it's like to work in a field where the research is not reproducible. If your results can't be reproduced, what do people cite you for?
Today I chatted with a PhD student in data mining about the topic of how his field differs from machine learning / statistics. What he said can be summarized as follows:
(1) data mining concerns huge data sets, in which issues like memory management, indexing, data summaries are important
(2) data mining cares a great deal about "pre-processing" (which apparently can include problems such as named-entity recognition/co-reference resolution)
(3) data mining cares about structured objects
(4) data mining cares about catering to users who are not able/willing to type their queries in a formal language
(5) data mining only cares about making "systems that work" (in quotes because this is very ambiguous)
(6) data mining doesn't have a very high standard for reproducibility
Here are my thoughts:
(1) machine learning deals with such issues too, though perhaps some only recently. Online learning (the idea of keeping (near)sufficient statistics that can be updated online) has surely been around for a good while.
(2) I really don't understand this comment.
(3) lots of machine learning is about structured objects! This is my favorite kind, and includes almost all of NLP and bioinformatics.
(4) I guess this means they are reaching towards HCI, specifically natural language-ish interfaces.
(5) no comment.
(6) I guess this is probably because the data they work with is largely proprietary, since one of their applications is business intelligence. Nevertheless, I wonder what it's like to work in a field where the research is not reproducible. If your results can't be reproduced, what do people cite you for?
(no subject)
Date: 2009-03-14 03:18 pm (UTC)Do you consider Google to have a natural language interface? I mean, you type in words, you get documents back. Whereas some of the other top examples in information retrieval systems -- whose users are doctors and paralegals -- can get you exactly what you want, or operate at very high levels of recall, but you have to spend five minutes formulating a query using the right search operators in a specific syntax. People are much more interested currently in how fabulous it is that Joe Schmoe can get *better* search results without a structured query language.
> (5) no comment.
I don't actually think 5 was meant to be offensive -- it just means that the data mining community is more willing to accept solutions with no basis in theory; heuristics that seem to work in practice but nobody knows why. Likewise, there are a lot of methods around that make conceptual sense, but don't seem to work when you try them in real systems --- either our machines aren't fast enough yet, our data sets aren't big enough yet, or they're just refining results at too minor a level to be noticeable (i.e. statistically significant) in the face of bigger retrieval issues.
(no subject)
Date: 2009-03-14 07:13 pm (UTC)But many machine learning people say the same thing about their field in relation to Statistics. While it's true that Statisticians are more interested in proving theorems, it's hard to delineate any further than that. The two fields care about the same things, but with somewhat different weighting.
I'd say machine learning people would absolutely accept "solutions with no basis in theory"... but most of them would make an effort to understand why it works. And I suspect the same is true of data miners. I think machine learners are generally more interested in fancy techniques, but only when they "work"!
(no subject)
Date: 2009-03-14 07:18 pm (UTC)I won't say yes or no. We haven't defined terms. But Google's interface is definitely a step in that direction, compared to database query languages. They check your spelling!