gusl: (Default)
[personal profile] gusl
How meaningful are the MBTI dimensions?

Wouldn't we be better off doing data mining on personality questionnaires, in order to find an optimal set of personality dimensions?

If we have a questionnaire with k questions, the space of possible answers is A^k, where A is the set of admissible answers for an individual question. We could simplify this and say that A = {0,1} (yes-or-no questions). The consequence is that the possible answers form a (discrete) hypercube.

A personality dimension would be a linear combination of these answers.
A subspace of A^k is a collection of personality dimensions.

The interesting question is:

Given an integer n, and a sample of completed questionnaires, how do you find the optimal linear subspace in n dimensions? In general, you would define a set of variables that you want to predict. But let's say we want to predict all the variables equally, i.e. we want the subspace that provides the least-lossy compression of the data (measured by say, least-squares). Since we're maximizing meaningful information over all possible n-dimensional subspaces, I wonder if this corresponds to minimizing entropy. But it seems that maximizing entropy would just maximize noise.

Of course, the approach of linear combinations ignores complex interactions between the variables (e.g. given Q1, Q2 is positively correlated with Q3; but given ~Q1, Q2 are Q3 are negatively correlated). But we can always solve this problem by adding extra variables (e.g. a variable that equals "NOT (Q2 XOR Q3)", which measures their correlation): I wonder if all the logical relationships remain preserved when you do the linear regression (under the Boolean extension from {0,1} to [0,1] where "AND" becomes multiplication). Another interesting question is "which logical dependencies are expressible with a set of conjunctions?".

I'm now imagining that a good algorithm would be to create a graph with questions as nodes, and edges as strongly positive pairwise correlations. Good dimensions will show up as clusters (dense subgraphs), i.e. just let the dimension be the sum of all questions in the cluster. Good subspaces can be found by finding partitions that cut through the fewest edges (sort of similar to min-cut) while still being more or less balanced in the size of the clusters.

-

Conclusion: I need to take a machine learning class.

And by "class", I mean a good book.

And by "good", I mean "a book that answers my questions, without too much reading effort required".

(no subject)

Date: 2006-10-19 04:02 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
hey, how did you learn about PCA, being an amateur and all?

(no subject)

Date: 2006-10-19 04:15 pm (UTC)
From: [identity profile] marknau.livejournal.com
I like analyzing data, so I know just enough about a variety of statistical methods to be dangerous.

February 2020

S M T W T F S
      1
2345678
9101112131415
16171819202122
23242526272829

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags