gusl | personality questionnaires & data mining

How meaningful are the MBTI dimensions?

Wouldn't we be better off doing data mining on personality questionnaires, in order to find an optimal set of personality dimensions?

If we have a questionnaire with k questions, the space of possible answers is A^k, where A is the set of admissible answers for an individual question. We could simplify this and say that A = {0,1} (yes-or-no questions). The consequence is that the possible answers form a (discrete) hypercube.

A personality dimension would be a linear combination of these answers.
A subspace of A^k is a collection of personality dimensions.

The interesting question is:

Given an integer n, and a sample of completed questionnaires, how do you find the optimal linear subspace in n dimensions? In general, you would define a set of variables that you want to predict. But let's say we want to predict all the variables equally, i.e. we want the subspace that provides the least-lossy compression of the data (measured by say, least-squares). Since we're maximizing meaningful information over all possible n-dimensional subspaces, I wonder if this corresponds to minimizing entropy. But it seems that maximizing entropy would just maximize noise.

Of course, the approach of linear combinations ignores complex interactions between the variables (e.g. given Q1, Q2 is positively correlated with Q3; but given ~Q1, Q2 are Q3 are negatively correlated). But we can always solve this problem by adding extra variables (e.g. a variable that equals "NOT (Q2 XOR Q3)", which measures their correlation): I wonder if all the logical relationships remain preserved when you do the linear regression (under the Boolean extension from {0,1} to [0,1] where "AND" becomes multiplication). Another interesting question is "which logical dependencies are expressible with a set of conjunctions?".

I'm now imagining that a good algorithm would be to create a graph with questions as nodes, and edges as strongly positive pairwise correlations. Good dimensions will show up as clusters (dense subgraphs), i.e. just let the dimension be the sum of all questions in the cluster. Good subspaces can be found by finding partitions that cut through the fewest edges (sort of similar to min-cut) while still being more or less balanced in the size of the clusters.

-

Conclusion: I need to take a machine learning class.

And by "class", I mean a good book.

And by "good", I mean "a book that answers my questions, without too much reading effort required".

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29

Gustavo Lacerda

personality questionnaires & data mining

(no subject)

(no subject)

Profile

February 2020

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags