3.7 Quantitative rigour

The extent to which the researchers strive to improve the quality of their study is referred to as rigour. Rigour is accomplished in quantitative research by measuring validity and reliability. 55 These concepts affect the quality of findings and their applicability to broader populations.


Validity refers to the accuracy of a measure. It is the extent to which a study or test accurately measures what it sets out to measure. There are three main types of validity – content, construct and criterion validity.

  • Content validity: Content validity examines whether the instrument adequately covers all aspects of the content that it should with respect to the variable under investigation. 56 This type of validity can be assessed through expert judgment and by examining the coverage of items or questions in measure. 56 Face validity is a subset of content validity in which experts are consulted to determine if a measurement tool accurately captures what it is supposed to measure. 56 There are multiple methods for testing content validity – content validity index (CVI) and content validity ratio (CVR). CVI is calculated as the number of experts giving a rating of “very relevant” for each item divided by the total number of experts. Values range from 0 to 1, with items having a CVI score > 0.79 relevant; between 0.70 and 0.79, the item needs revisions, and if the value is below 0.70, the item is eliminated. 57 CVR varies between 1 and −1; a higher score indicates greater agreement among panel members. CVR is calculated as  (Ne – N/2)/(N/2), where Ne is the number of panellists indicating an item as “essential” and N is the total number of panelists. 57 A study by Mousazadeh et al. 2017 investigated the content, face validity and reliability of sociocultural attitude towards appearance questionnaire-3 (SATAQ-3) among female adolescents. 58 To ensure face validity, the questionnaire was given to 25 female adolescents, a psychologist and three nurses, who were required to evaluate the items with respect to problems, ambiguity, relativity, proper terms and grammar, and understandability. For content validity, 15 experts in psychology and nursing were asked to assess the qualitative content validity. To determine the quantitative content validity, the content validity index and content validity ratio were calculated. 58
  • Construct validity: A construct is an idea or theoretical concept based on empirical observations that are not directly measurable. An example of a construct could be physical functioning or social anxiety. Thus construct validity determines whether an instrument measures the underlying construct of interest and discriminates it from other related constructs. 55 It is important and expresses the confidence that a particular construct is valid. 55 This type of validity can be assessed using factor analysis or other statistical techniques. For example, Pinar, Rukiye 2005, evaluated the reliability and construct validity of the SF-36 in Turkish cancer patients. 59 The SF-36 is widely used to measure the quality of life or health status in sick and healthy populations. Principal components factor analysis with varimax rotation confirmed the presence of the seven domains in the SF-36: in the SF-36: physical functioning, role limitations due to physical and emotional problems, mental health, general health perception, bodily pain, social functioning, and vitality. It was concluded that the Turkish version of the SF-36 was a suitable instrument that could be employed in cancer research in Turkey. 59
  • Criterion validity: Criterion validity is the relationship between an instrument score and some external criterion. This criterion is considered the “gold standard” and has to be a widely accepted measure that shares the same characteristics as the assessment tool. 55 Determining the validity of a new diagnostic test requires two principal factors – sensitivity and specificity. 60  Sensitivity refers to the probability of detecting those with the disease, while specificity refers to the probability of the test correctly identifying those without the disease. 60 For example, the reverse transcriptase polymerase chain reaction (RT PCR) is the gold standard for testing COVID-19; its results are available at the earliest several hours to days after testing. Rapid antigen tests are diagnostic tools that can be used at the point of care, and the results can be obtained within 30 minutes). 61, 62 Therefore, the validity of these rapid antigen tests was determined against the gold standard. 61, 62 Two published articles that assessed the validity of the rapid antigen test reported sensitivity of 71.43% and 78.3% and specificity of 99.68% and 99.5%, respectively. 61, 62 Thus indicating that the tests were less effective in identifying those who have the disease but highly effective in identifying those who do not have the disease. While it is important to assess the accuracy of the instruments used, it is also imperative to determine if the measure and findings are reliable.


Reliability refers to the consistency of a measure. It is the ability of a measure or tests to reproduce a consistent result over time and across different observers. 55 A reliable measurement tool produces consistent results, even when different observers administer the test or when the test is conducted on different occasions. 55, 56 Reliability can be assessed by examining test-retest reliability, inter-rater reliability, and internal consistency.

  • Test-retest reliability: Test-retest reliability refers to the degree of consistency between the outcomes of the same test or measure taken by the same participants at varying times. It estimates the consistency of measurement repetition. The intraclass correlation coefficient (ICC) is often used to determine test-retest reliability. 56 For example, a study may be conducted to evaluate the reliability of a new tool for measuring pain and might administer the tool to a group of patients at two different time points and compare the results. If the results are consistent across the two-time points, this would indicate that the tool has good test-retest reliability. However, it is important to note that the reliability reduces when the time between administration of the test is extended or too long. An adequate time span between tests should range from 10 to 14 days. 56 The article by Pinar, Rukiye 2005, demonstrated this by assessing a test–retest stability using intraclass correlation coefficient-ICC. The retest procedure was conducted two weeks after the first test as two weeks was considered to be the optimum re-test interval. 59 This would be sufficiently long for participants to forget their initial responses but not too long that most health domains would change. 59
  • Inter-observer (between observers) reliability: is also known as inter-rater reliability, and it is the level of agreement between two or more observers on the results of an instrument or test. It is the most popular method of determining if two things are equivalent. 55, 56 For example, a study may be conducted to evaluate the reliability of a new tool for measuring depression. This will involve two different raters or observers independently scoring the same patient on the tool and comparing the results. If the results are consistent across the two raters, this would indicate that the tool has excellent inter-rater reliability. The Kappa coefficient is a measure used to assess the agreement between the raters. 56 It can have a maximum value of 1.00; the higher the value, the greater the concordance between the raters. 56
  • Internal consistency: Internal consistency refers to the extent to which different items or questions in a test or questionnaire are consistent with one another. It is also known as homogeneity, which indicates whether each component of an instrument measures the same characteristics. 55 This type of reliability can be assessed by calculating Cronbach’s alpha (α) coefficient, which measures the correlation between different items or questions. Cronbach α is expressed as a number between 0 and 1, and a reliability score of 0.7 or above is considered acceptable. 55 For example, Pinar, Rukiye 2005 reported that reliability evaluations of the SF-36 were based on the internal consistency test (Cronbach’s α coefficient). The results showed that Cronbach’s α coefficient for the eight subscales of the SF-36 ranged between 0.79 and 0.90, confirming the internal consistency of the subscales. 59

Now you have an understanding of the quantitative methodology. Use the Padlet below to write a research question that can be answered quantitatively.



Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

An Introduction to Research Methods for Undergraduate Health Profession Students Copyright © 2023 by Faith Alele and Bunmi Malau-Aduli is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.