personality questionnaires & data mining
Oct. 11th, 2005 08:41 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
How meaningful are the MBTI dimensions?
Wouldn't we be better off doing data mining on personality questionnaires, in order to find an optimal set of personality dimensions?
If we have a questionnaire with k questions, the space of possible answers is A^k, where A is the set of admissible answers for an individual question. We could simplify this and say that A = {0,1} (yes-or-no questions). The consequence is that the possible answers form a (discrete) hypercube.
A personality dimension would be a linear combination of these answers.
A subspace of A^k is a collection of personality dimensions.
The interesting question is:
Given an integer n, and a sample of completed questionnaires, how do you find the optimal linear subspace in n dimensions? In general, you would define a set of variables that you want to predict. But let's say we want to predict all the variables equally, i.e. we want the subspace that provides the least-lossy compression of the data (measured by say, least-squares). Since we're maximizing meaningful information over all possible n-dimensional subspaces, I wonder if this corresponds to minimizing entropy. But it seems that maximizing entropy would just maximize noise.
Of course, the approach of linear combinations ignores complex interactions between the variables (e.g. given Q1, Q2 is positively correlated with Q3; but given ~Q1, Q2 are Q3 are negatively correlated). But we can always solve this problem by adding extra variables (e.g. a variable that equals "NOT (Q2 XOR Q3)", which measures their correlation): I wonder if all the logical relationships remain preserved when you do the linear regression (under the Boolean extension from {0,1} to [0,1] where "AND" becomes multiplication). Another interesting question is "which logical dependencies are expressible with a set of conjunctions?".
I'm now imagining that a good algorithm would be to create a graph with questions as nodes, and edges as strongly positive pairwise correlations. Good dimensions will show up as clusters (dense subgraphs), i.e. just let the dimension be the sum of all questions in the cluster. Good subspaces can be found by finding partitions that cut through the fewest edges (sort of similar to min-cut) while still being more or less balanced in the size of the clusters.
-
Conclusion: I need to take a machine learning class.
And by "class", I mean a good book.
And by "good", I mean "a book that answers my questions, without too much reading effort required".
Wouldn't we be better off doing data mining on personality questionnaires, in order to find an optimal set of personality dimensions?
If we have a questionnaire with k questions, the space of possible answers is A^k, where A is the set of admissible answers for an individual question. We could simplify this and say that A = {0,1} (yes-or-no questions). The consequence is that the possible answers form a (discrete) hypercube.
A personality dimension would be a linear combination of these answers.
A subspace of A^k is a collection of personality dimensions.
The interesting question is:
Given an integer n, and a sample of completed questionnaires, how do you find the optimal linear subspace in n dimensions? In general, you would define a set of variables that you want to predict. But let's say we want to predict all the variables equally, i.e. we want the subspace that provides the least-lossy compression of the data (measured by say, least-squares). Since we're maximizing meaningful information over all possible n-dimensional subspaces, I wonder if this corresponds to minimizing entropy. But it seems that maximizing entropy would just maximize noise.
Of course, the approach of linear combinations ignores complex interactions between the variables (e.g. given Q1, Q2 is positively correlated with Q3; but given ~Q1, Q2 are Q3 are negatively correlated). But we can always solve this problem by adding extra variables (e.g. a variable that equals "NOT (Q2 XOR Q3)", which measures their correlation): I wonder if all the logical relationships remain preserved when you do the linear regression (under the Boolean extension from {0,1} to [0,1] where "AND" becomes multiplication). Another interesting question is "which logical dependencies are expressible with a set of conjunctions?".
I'm now imagining that a good algorithm would be to create a graph with questions as nodes, and edges as strongly positive pairwise correlations. Good dimensions will show up as clusters (dense subgraphs), i.e. just let the dimension be the sum of all questions in the cluster. Good subspaces can be found by finding partitions that cut through the fewest edges (sort of similar to min-cut) while still being more or less balanced in the size of the clusters.
-
Conclusion: I need to take a machine learning class.
And by "class", I mean a good book.
And by "good", I mean "a book that answers my questions, without too much reading effort required".
(no subject)
Date: 2005-10-11 09:13 pm (UTC)Here's a link by a chap who used this on answers to him online political beliefs survey, with interesting results.
http://ex-parrot.com/~chris/wwwitter/20030731-a_little_knowledge.html
(no subject)
Date: 2005-10-11 09:44 pm (UTC)The Libertarians' "World's Smallest Quiz" has 2 axes: social freedom, economic freedom. But I believe that the pragmatism dimension is much more realistic.
Axes
Date: 2005-10-12 03:58 pm (UTC)Note that the label he puts on the axes are just his interpretations of the data. The axes are really defined by the questions and their resultant weights.
Also of related interest is his 2005 version, which evidently gets a larger and less insular sample of respondants. The axes turned out quite a bit different there.
http://www.politicalsurvey2005.com/
(no subject)
Date: 2006-10-19 04:02 am (UTC)(no subject)
Date: 2006-10-19 04:15 pm (UTC)(no subject)
Date: 2005-10-11 10:33 pm (UTC)(no subject)
Date: 2005-10-12 07:44 am (UTC)I imagine that some of the components would correspond to the subculture that you're in (people learn to like the styles they are around) while others would correspond to a personal "cognitive/emotional profile".
(no subject)
Date: 2005-10-12 06:37 am (UTC)http://www.stanford.edu/class/cs229/materials.html
chapter 10 is pca.
but it's pretty dense material.