If you’ve taken any number of personality tests online, at some point you’ve probably wondered what’s going on under the hood. Who writes these tests? What sort of qualifications do they have? And how accurate are these things, really?

Since personality tests have caught on as a handy tool for everything from online marketing to voter manipulation, you can kill an almost limitless amount of time taking soul-searching quizzes on the internet. While we expect that two minutes spent on a Buzzfeed quiz might pay off in some fun, shareable tidbit, on some level we also know that there’s a difference between “What Muppet Are You?” and the sort of serious assessment that we might use to, say, choose a career. But what’s that difference, exactly? And how do we know which assessments we can actually expect to be accurate?

Developing an accurate personality test is a time-consuming and technical process. Ideally, the people who write personality tests will have a deep knowledge of their subject matter and will spend considerable amounts of time creating their assessment and analyzing their results. Most of the quizzes you’ll find online have been written in a few hours by someone who was only guessing at what they were doing. In contrast, tests with real predictive value are the result of considerable effort and expertise.

Test development is complex, but it also follows a step-by-step process. Understanding this process can help you to understand how all those personality tests you’ve been taking actually work—and which ones you can rely on for accurate information about yourself.

So buckle up. You’re about to get an inside look into how a personality test is made.

What will we measure?

Although it may sound obvious, the first step in creating a personality assessment is figuring out what you’re actually going to be assessing. More specifically, we want to identify what are called constructs—specific characteristics or phenomena that we can define and measure. Example constructs might include dominance, anxiety, emotionality, or motivation. 

Some constructs can be measured directly, like a person's weight or blood pressure. Unfortunately, in psychology, things are rarely so simple. Personality traits can’t be measured in any concrete way, and they are often much harder to define. Even experts in traits like extraversion can have trouble agreeing on the definition of what they’re studying! Though it’s futile to try to achieve perfect clarity, a solid test development process always begins with an effort to define the attribute to be assessed as clearly as humanly possible.

If your assessment is based on an existing theory of personality, your constructs are defined for you. For instance, Myers and Briggs theorized that the four key constructs for describing personality were Extraversion/Introversion, Sensing/Intuition, Thinking/Feeling, and Judging/Perceiving. So, creating a personality test based on Myers and Briggs’ theory means developing a deep knowledge of these constructs so that you can understand how you might measure them.

On the other hand, some assessments are built from scratch, without any established theory to define the constructs to be used. For instance, a corporation might wish to develop an assessment that predicts employee success. We might imagine that personal characteristics like motivation, cooperation, intelligence, and creativity would all be important for success on the job. Each of these ideas would need to be fully fleshed out and defined as a distinct construct to be assessed—before we even begin to think about creating the assessment itself.

How will we measure it?

The next step is to decide what our assessment will look like. Some personality assessments use multiple choice questions. Others ask test-takers to rate their agreement with various statements on a point-based scale. Others get even more creative, asking respondents to choose a favorite picture from a selection of options, or put words or statements in order based on how well they apply.

Whatever the format of the test items, the process begins by creating a sort of master list of possible items. Usually, test developers use previous research to help them create items that will relate to their constructs. For instance, we know that Extraversion has to do with one’s energy level when interacting with people, so we might create an item that asks respondents to rate their level of agreement with the statement, “I feel energized when socializing in groups.”

Writing good test items is exceptionally difficult, and often only 10-20% of the items created will pass muster when they are subjected to statistical testing. For this reason, in the initial development stages, we create many more items that we anticipate including in the final assessment. 

Who will take it?

Some personality assessments are created for specific populations, like students at a school or employees at a company. Others are developed for use by the public at large. Whatever the target audience, now that we have some items, it’s time to give them a test run with the intended population. In this step, test developers recruit as many people as possible to complete the items they have created. The more data they can collect, the better.

How well does it work?

We’ve decided what we want to measure. We’ve written items. And we’ve asked some volunteers to respond to those items. Now it’s time to see if our assessment is actually working.

This is the step that separates the professionals from the amateurs. Most anyone with an interest in psychology can think up an idea for a personality quiz. They can probably write some test questions. And, if they have some technical skills, they can even get that quiz up on the internet for other people to take.

But does that test measure what it’s supposed to? Probably not. Even professional test developers are rarely able to just throw together an assessment that works. The difference is that the pros have methods for analyzing and improving tests until they actually produce meaningful results.

There are two key concepts in test development: reliability and validity. In order to have any real value, an assessment must be tested and proven in both areas.

A reliable assessment will produce consistent results. If a person takes a reliable personality assessment in January and then again in July, their scores will be roughly the same.

A valid assessment measures the construct that it is intended to measure. If a well-respected CEO takes an assessment of her leadership ability, we would expect that she would receive high scores. If her scores were low, we would question the test’s validity because it would not be consistent with our real-life observations.

To conceptualize reliability and validity, imagine that you have just bought a new scale and decide to try it out by weighing yourself 5 times in a row. If it is reliable, it will produce the same weight reading each time. If it is valid, it will show weight readings that are close to your actual weight. A scale that is valid but not reliable will give readings that are close to your true weight, but different each time. A scale that is reliable but not valid will always show the same weight reading, but it will always be wrong.

How do we determine these factors for a personality test? The simple, if somewhat boring answer is that we have several statistical tests that we apply to the data we’ve gathered after trying out our assessment on a sample group. 

Correlation is a measure of whether two factors tend to occur together. For instance, height is correlated with weight, because taller people usually weigh more. Measures of correlation are used in many ways in test development to test both reliability and validity. For instance, if scores on a test of earning potential are found to be correlated with real-life income, that would indicate the test is valid. If we were formulating a test of extraversion, we might check if our scores correlate with the number of friends a person has.

Test-retest correlation is a measure of how well two test scores correlate when a single person has taken the same assessment on two separate occasions. A high test-retest correlation indicates good reliability.

Cronbach’s alpha is a test of reliability which looks at whether items that are supposed to be related conceptually are correlated in actual sample data. The method for calculating alpha is complicated, but interpreting it is simple. Alpha scores range from zero to one, with better scores being higher. Most test developers are satisfied when alpha measures at .8 or above.

Factor analysis is a complex way to analyze test items to figure out how they are organized conceptually. To understand factor analysis, imagine that you have made up 100 personality test questions, completely at random. You have no idea how many constructs you are testing for, or what those constructs are; you’ve simply written some questions that you think might be interesting to have people answer. Once you’ve had people answer them, though, you begin to wonder what your assessment might be measuring.

A factor analysis will show you how all the items you’ve written may be conceptually grouped. It will determined how many factors there are in your assessment—that is, how many potential constructs it might be measuring. And it will determine which items correlate with each factor. So, your factor analysis might group the following items:

  • I give money to the needy
  • I volunteer for charity
  • I am eager to help my neighbors

Seeing this, you might logically conclude that your assessment measures a construct related to altruism, and that these three items could be a potential starting point for an altruism scale.

In actual test development, factor analysis can clarify the constructs that are assessed by a test, or even help us to discover new ones. For instance, we used factor analysis to discover and define the 23 facets of personality type measured in the TypeFinder Personality Assessment.

All these methods require expertise and advanced software, as well as plenty of time to test, revise, and test again. Most test developers spend many months developing items, administering them to sample populations, running statistical tests, and then revising based on those results before finally coming up with an assessment that has been proven valid and reliable. It’s not a process that can be done casually or quickly—which is why a good personality assessment can be hard to find.

So the next time you take a personality quiz and wonder, “What the heck does that result mean?” you’ll know: Probably nothing. But if you take a quiz that tells you to move to Paris, marry your true love, or become a famous rock star, then hey—don’t let us stop you. Sometimes, the best thing a personality test can do is tell us what we already know.


Molly Owens
Molly Owens is the founder and CEO of Truity. She is a graduate of UC Berkeley and holds a master's degree in counseling psychology. She began working with personality assessments in 2006, and in 2012 founded Truity with the goal of making robust, scientifically validated assessments more accessible and user-friendly. Molly is an ENTP and lives in the San Francisco Bay Area, where she enjoys elaborate cooking projects, murder mysteries, and exploring with her husband and son.