Achieving a Better Understanding of Reliability and Validity

March 3, 2022

by Pam Boney


Without a grasp of concepts like maximum torque, number of cylinders, engine displacement, and horsepower, it would be difficult for someone looking for a new vehicle to compare and contrast different brands and models while searching for their optimal new ride.  For instance, understanding the difference between 750 horsepower and 700 horsepower enables the consumer to imagine the added power of the stronger engine.  However, torque may be more important than horsepower to a person interested in using their vehicle to tow a substantial weight. Determining exactly what attributes are important for the purchase decision can take a surprising amount of study and consideration. This assertion about the utility of being familiar with product attributes — and the relative importance of each of those attributes for the needs at hand — holds true for just about any product (including personality assessments). 

For personality assessments such as those created by Tilt 365, two primary characteristics that convey some of the strengths of the given assessment are “validity” and “reliability.”  To fully understand why our company has been rated a Top 20 company in the Training Industry for multiple years, one must have knowledge of these two psychometric concepts.  As with most words, these terms can have drastically different meanings when used in everyday conversation as opposed to in more formal settings (e.g., where they are being used by a professor to defend the merits of a particular exam format).

Definitions and Background for Each Concept

An accessible example for illustration purposes is the scale at the doctor’s office.  If someone knows that they weigh 150 pounds (perhaps by having recently used numerous other scales and having received that same number each time), but the scale says they weigh 300 pounds, we have pretty good reason to believe that this scale’s measurement is invalid.  But if the same person steps on a different scale and it reads 150 pounds, we have less of a reason to doubt that that measure is inaccurate, and thus we are more willing to think that the scale is serving as a valid measurement tool. 

Now let’s suppose the same person eats well for a month and doesn’t gain any weight.  When they get weighed again after those thirty days and the scale reads “150 pounds,” we have witnessed the scale demonstrating reliability (a synonym for this type of reliability would be “consistent”; in technical language, this is called Test-Retest Reliability).  Even if the scale read 300 pounds the first time and then still read 300 pounds after the thirty days of eating well, we could still say it demonstrates reliability (even though the reading is clearly invalid/inaccurate).

This demonstrates an interesting nuance between the two terms:  A measurement can be reliable but not necessarily valid (as just discussed above), but a measurement cannot be both valid and unreliable. Continuing our scale example: if one steps on a scale and gets a reading of 150, then steps off and back on and gets a reading of 300, and then off and on again for a reading of -25, we would have a great reason to conclude that the scale is both unreliable (because it doesn’t give the same numbers each time) and, in this case, we can therefore also conclude that the scale is not valid (because a valid scale would accurately show the weight each time, more or less) — despite its having shown the “correct” weight the first time that we tried it.

If we extend this to an academic setting, we want a math test to be written in such a way that someone who understands 95% of the content will approximately get, for instance, a 95 on the test.  And if they take the same test a month later, (while not forgetting anything learned and also while not learning any more of the content), we want them to roughly get a 95 again (i.e., we want the test to be valid/accurate and reliable/consistent). 

Test-Retest Reliability and Change

In both the weight and the math-test example above, you may have noticed an important stipulation: We are assuming that nothing about the underlying state of the individual (specifically, their weight or their understanding of math content) is changing. As you can imagine, determining whether a measure is reliable (at least in this particular way) is much more tricky when we allow for or even expect changes in the underlying attribute being measured: If we change our diet and exercise routines for several months and then return to the scale, we would be suspicious of the scale’s reliability if it didn’t read a different value. Likewise, we would raise our eyebrows if a diligent student received the same score on the same assessment of math understanding at the beginning and at the end of the (developmentally appropriate) math course on that topic. 

There are many different kinds of reliability and many ways of demonstrating evidence for the validity of a measure — and we’ve only skimmed the surface above — so there is less difficulty than you might think in dealing with such questions as how, if we don’t have a trusted, preexisting measure, we can even determine what the “correct” value of some attribute is, or, as we’ve raised here, how, if we expect an attribute to change over time, we can assess a measure’s reliability. Suffice to say: These are nuanced and complicated topics, and, like with the car example at the opening of this discussion, it can take a great deal of knowledge to even know how to ask the right questions to get at the information that we need.

Settings in Which Either (or Both) are Useful and/or Necessary

In an academic setting, creating an invalid and/or unreliable exam might cause extreme frustration and tank a student’s GPA.  But if we are dealing with an instrument designed to set off an alarm when carbon monoxide levels are getting too high in an enclosed factory, an invalid and/or unreliable instrument can be the difference between life and death for workers.  In such a situation, determining validity and reliability for the instrument involves experiments where CO levels are adjusted.  In the context of personality assessments, there is a wealth of research and methodology (beyond the scope of this blog) for determining whether a given assessment is valid and/or reliable.   


The Validity and Reliability of Tilt Assessments

Extensive, rigorous testing was conducted on each of the currently available Tilt365's strengths assessment tools (i.e., the True Tilt Personality Profile™, the Positive Influence Predictor™, and the Team Climate Profile™), prior to deployment.   Furthermore, just as automobile companies periodically release new versions of vehicle models, Tilt’s science team is constantly working to make its assessments ever more robust. 

If our assessments are truly reliable (in the scientific sense discussed in this blog), you will get different results after there have been changes within yourself and/or within your team.  If people have what we call a growth mindset, they are constantly trying to be ever more agile so that they have a better notion of what behavior is appropriate or necessary in any given scenario.  Beyond the simple assessments of reliability and validity that we discuss above, we are continuously monitoring and checking our assessments to ensure that, as appropriate, they will reflect intrapersonal changes as you hone and cultivate your strengths over time.


When learning advanced mathematics or the intricate mechanics of a luxury sedan, the hardest part can be achieving a grasp of new terms and concepts.  As mentioned earlier in this blog, this post was written for the exact purpose of enlightening site visitors about a couple of the words we use extensively (“validity” and “reliability”). 

Unlike many of the topics that we cover in our blogs, primary academic references aren’t really necessary or relevant to the broad, general information provided in this current post.  For readers wanting to explore the information in this blog in a deeper manner, please refer to a solid textbook on statistics, such as Validity and Reliability: 2016 Edition, or peer-reviewed articles, such as Golafshani (2003), Lakshmi and Mohideen (2013), or Bannigan and Watson (2009).