Which method involves administering the same test twice to the same group?

Reliability tells you how consistently a method measures something. When you apply the same method to the same sample under the same conditions, you should get the same results. If not, the method of measurement may be unreliable or bias may have crept into your research.

There are four main types of reliability. Each can be estimated by comparing different sets of results produced by the same method.

Type of reliabilityMeasures the consistency of…The same test over time.The same test conducted by different people.Different versions of a test which are designed to be equivalent.The individual items of a test.

Test-retest reliability

Test-retest reliability measures the consistency of results when you repeat the same test on the same sample at a different point in time. You use it when you are measuring something that you expect to stay constant in your sample.

A test of color blindness for trainee pilot applicants should have high test-retest reliability, because color blindness is a trait that does not change over time.

Why it’s important

Many factors can influence your results at different points in time: for example, respondents might experience different moods, or external conditions might affect their ability to respond accurately.

Test-retest reliability can be used to assess how well a method resists these factors over time. The smaller the difference between the two sets of results, the higher the test-retest reliability.

How to measure it

To measure test-retest reliability, you conduct the same test on the same group of people at two different points in time. Then you calculate the correlation between the two sets of results.

Test-retest reliability example

You devise a questionnaire to measure the IQ of a group of participants (a property that is unlikely to change significantly over time).You administer the test two months apart to the same group of people, but the results are significantly different, so the test-retest reliability of the IQ questionnaire is low.

Improving test-retest reliability

When designing tests or questionnaires, try to formulate questions, statements, and tasks in a way that won’t be influenced by the mood or concentration of participants.
When planning your methods of data collection, try to minimize the influence of external factors, and make sure all samples are tested under the same conditions.
Remember that changes or recall bias can be expected to occur in the participants over time, and take these into account.

Interrater reliability

Interrater reliability (also called interobserver reliability) measures the degree of agreement between different people observing or assessing the same thing. You use it when data is collected by researchers assigning ratings, scores or categories to one or more variables, and it can help mitigate observer bias.

In an observational study where a team of researchers collect data on classroom behavior, interrater reliability is important: all the researchers should agree on how to categorize or rate different types of behavior.

Why it’s important

People are subjective, so different observers’ perceptions of situations and phenomena naturally differ. Reliable research aims to minimize subjectivity as much as possible so that a different researcher could replicate the same results.

When designing the scale and criteria for data collection, it’s important to make sure that different people will rate the same variable consistently with minimal bias. This is especially important when there are multiple researchers involved in data collection or analysis.

How to measure it

To measure interrater reliability, different researchers conduct the same measurement or observation on the same sample. Then you calculate the correlation between their different sets of results. If all the researchers give similar ratings, the test has high interrater reliability.

Interrater reliability example

A team of researchers observe the progress of wound healing in patients. To record the stages of healing, rating scales are used, with a set of criteria to assess various aspects of wounds. The results of different researchers assessing the same set of patients are compared, and there is a strong correlation between all sets of results, so the test has high interrater reliability.

Improving interrater reliability

Clearly define your variables and the methods that will be used to measure them.
Develop detailed, objective criteria for how the variables will be rated, counted or categorized.
If multiple researchers are involved, ensure that they all have exactly the same information and training.

What can proofreading do for your paper?

Scribbr editors not only correct grammar and spelling mistakes, but also strengthen your writing by making sure your paper is free of vague language, redundant words and awkward phrasing.

See editing example

Parallel forms reliability

Parallel forms reliability measures the correlation between two equivalent versions of a test. You use it when you have two different assessment tools or sets of questions designed to measure the same thing.

Why it’s important

If you want to use multiple different versions of a test (for example, to avoid respondents repeating the same answers from memory), you first need to make sure that all the sets of questions or measurements give reliable results.

In educational assessment, it is often necessary to create different versions of tests to ensure that students don’t have access to the questions in advance. Parallel forms reliability means that, if the same students take two different versions of a reading comprehension test, they should get similar results in both tests.

How to measure it

The most common way to measure parallel forms reliability is to produce a large set of questions to evaluate the same thing, then divide these randomly into two question sets.

The same group of respondents answers both sets, and you calculate the correlation between the results. High correlation between the two indicates high parallel forms reliability.

Parallel forms reliability example

A set of questions is formulated to measure financial risk aversion in a group of respondents. The questions are randomly divided into two sets, and the respondents are randomly divided into two groups. Both groups take both tests: group A takes test A first, and group B takes test B first. The results of the two tests are compared, and the results are almost identical, indicating high parallel forms reliability.

Improving parallel forms reliability

Ensure that all questions or test items are based on the same theory and formulated to measure the same thing.

Internal consistency

Internal consistency assesses the correlation between multiple items in a test that are intended to measure the same construct.

You can calculate internal consistency without repeating the test or involving other researchers, so it’s a good way of assessing reliability when you only have one data set.

Why it’s important

When you devise a set of questions or ratings that will be combined into an overall score, you have to make sure that all of the items really do reflect the same thing. If responses to different items contradict one another, the test might be unreliable.

To measure customer satisfaction with an online store, you could create a questionnaire with a set of statements that respondents must agree or disagree with. Internal consistency tells you whether the statements are all reliable indicators of customer satisfaction.

How to measure it

Two common methods are used to measure internal consistency.

Average inter-item correlation: For a set of measures designed to assess the same construct, you calculate the correlation between the results of all possible pairs of items and then calculate the average.
Split-half reliability: You randomly split a set of measures into two sets. After testing the entire set on the respondents, you calculate the correlation between the two sets of responses.

Internal consistency example

A group of respondents are presented with a set of statements designed to measure optimistic and pessimistic mindsets. They must rate their agreement with each statement on a scale from 1 to 5. If the test is internally consistent, an optimistic respondent should generally give high ratings to optimism indicators and low ratings to pessimism indicators. The correlation is calculated between all the responses to the “optimistic” statements, but the correlation is very weak. This suggests that the test has low internal consistency.

Improving internal consistency

Take care when devising questions or measures: those intended to reflect the same concept should be based on the same theory and carefully formulated.

Which type of reliability applies to my research?

It’s important to consider reliability when planning your research design, collecting and analyzing your data, and writing up your research. The type of reliability you should calculate depends on the type of research and your methodology.

What is my methodology?Which form of reliability is relevant?Measuring a property that you expect to stay the same over time.Test-retestMultiple researchers making observations or ratings about the same topic.InterraterUsing two different tests to measure the same thing.Parallel formsUsing a multi-item test where all the items are intended to measure the same variable.Internal consistency

If possible and relevant, you should statistically calculate reliability and state this alongside your results.

Frequently asked questions about types of reliability

What’s the difference between reliability and validity?

Reliability and validity are both about how well a method measures something:

Reliability refers to the consistency of a measure (whether the results can be reproduced under the same conditions).
Validity refers to the accuracy of a measure (whether the results really do represent what they are supposed to measure).

If you are doing experimental research, you also have to consider the internal and external validity of your experiment.

How can I minimize observer bias in my research?

You can use several tactics to minimize observer bias.

Use masking (blinding) to hide the purpose of your study from all observers.
Triangulate your data with different data collection methods or sources.
Use multiple observers and ensure interrater reliability.
Train your observers to make sure data is consistently recorded between them.
Standardize your observation procedures to make sure they are structured and clear.

Why are reproducibility and replicability important?

Reproducibility and replicability are related terms.

A successful reproduction shows that the data analyses were conducted in a fair and honest manner.
A successful replication shows that the reliability of the results is high.

Why is bias in research a problem?

Research bias affects the validity and reliability of your research findings, leading to false conclusions and a misinterpretation of the truth. This can have serious implications in areas like medical research where, for example, a new form of treatment may be evaluated.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Middleton, F. (2022, November 30). The 4 Types of Reliability in Research | Definitions & Examples. Scribbr. Retrieved January 2, 2023, from https://www.scribbr.com/methodology/types-of-reliability/

What is test and retest method?

Test-retest reliability measures the consistency of results when you repeat the same test on the same sample at a different point in time. You use it when you are measuring something that you expect to stay constant in your sample.

What is test

Test-Retest Reliability: Used to assess the consistency of a measure from one time to another. Parallel-Forms Reliability: Used to assess the consistency of the results of two tests constructed in the same way from the same content domain.

When a test is administered twice at two different points of time it is called?

Test-retest reliability is measured by administering a test twice at two different points in time. This type of reliability assumes that there will be no change in the quality or construct being measured.

What are the 3 types of reliability?

Reliability refers to the consistency of a measure. Psychologists consider three types of consistency: over time (test-retest reliability), across items (internal consistency), and across different researchers (inter-rater reliability).