Interpreting medical evidence

Last updated: September 6, 2023

CME information and disclosures

To see contributor disclosures related to this article, hover over this reference: ^[1]

Physicians may earn CME/MOC credit by searching for an answer to a clinical question on our platform, reading content in this article that addresses that question, and completing an evaluation in which they report the question and the impact of what has been learned on clinical practice.

AMBOSS designates this Internet point-of-care activity for a maximum of 0.5 AMA PRA Category 1 Credit(s)™. Physicians should claim only credit commensurate with the extent of their participation in the activity.

For answers to questions about AMBOSS CME, including how to redeem CME/MOC credit, see "Tips and Links" at the bottom of this article.

Summary

Critical appraisal and evidence-based medicine involve the practical application of clinical epidemiology concepts in order to guide clinical decision-making. This requires an evaluation of the quality and applicability of existing research studies to individual clinical scenarios. Appropriate interpretation of the results of a research study in the right context requires a basic understanding of the following foundational concepts (found in the “Epidemiology” article): types of epidemiological studies (e.g., observational studies, experimental studies), common study designs (e.g., case series, cohort studies, case-control studies, randomized controlled trials), causal relationships in research studies, and other reasons for observed associations (e.g., random errors, systematic errors, confounding). This article focuses on an approach to critical appraisal, and epidemiological concepts often encountered in studies of clinical interventions, i.e., measures of association (e.g., relative risk, odds ratios, absolute risk reduction, number needed to treat), measures used to evaluate screening and diagnostic test (e.g., sensitivity, specificity, positive predictive value, negative predictive value), precision, and validity.

The following concepts are discussed separately: measures of disease frequency (e.g., incidence rates, prevalence) commonly used in studies of population health, foundational statistical concepts (e.g., measures of central tendency, measures of dispersion, normal distribution, confidence intervals), and guidance on conducting research projects.

See also “Epidemiology,” “Statistical analysis of data,” and “Population health.”

Evidence appraisal

Evidence-based medicine ^[2]

Definition: The practice of medicine in which the physician uses clinical decision-making methods based on the best available current research from peer-reviewed clinical and epidemiological studies with the aim of producing the most favorable outcome for the patient.
Application in clinical practice
- Define the patient's clinical problem (can be formulated as a PICO question).
- Search for sources of information about the clinical problem.
- Perform a critical appraisal of relevant research studies.
- Apply the information
  - Before discussing the research findings with the patient, consider how and to which extent the researched options can improve patient care.
  - Present comprehensive, but synthesized evidence to the patient using clear and understandable language.
- Engaged in shared decision-making, considering individual patient's risk profile and preferences.

Levels of evidence ^[3]^[4]

Definition: a method used in evidence-based medicine to determine the strength of the findings from a clinical and/or epidemiological study
Methods: Several different systems exist for assigning levels of evidence.

Levels of evidence ^[4]
Level		Source of evidence
I		Findings from at least one high-quality randomized controlled study
II	II.1	Findings from at least one high-quality, nonrandomized controlled study
	II.2	Findings from a case-control study or cohort study
	II.3	Findings from multiple time-series studies or important results from large uncontrolled studies
III		Expert opinions

Grades of clinical recommendation ^[5]

A system developed by the US Preventive Task Force (USPSTF) to rate clinical evidence and create guidelines for clinical practice based on medical evidence. ^[3]

Grades of Recommendation ^[5]
Grade	Net benefit	Level of certainty	Recommendation
A	Substantial	High	Recommended for patients
B	Moderate/substantial	High	Recommended for patients
C	Small	Moderate to high	Recommended only for certain patients
D	Zero/negative	Moderate to high	Not recommended/discouraged for patients
I	Cannot be determined	Low or lacking	Evidence is insufficient to assess the benefits and harms. Might be due to poor quality, conflicting evidence, or complete lack of evidence Patients should fully understand the service being offered before accepting it.
Levels of certainty High: Further research is unlikely to influence the recommendation. Moderate: Further research may influence the recommendation. Low: Information is generally insufficient to assess harms and benefits.

Critical appraisal of research studies

Applications

Clinical practice (evidence-based medicine)
- Evaluation of the literature relevant to an individual patient's condition
- Review of updated guidelines on diagnosis and management of medical conditions
- Clinical decision-making
Research and academia
- Gathering background information for a research study
- Serving as a reviewer for a medical journal
- Participation in a journal club

Procedure

Perform an overall assessment and an in-depth analysis of the different study sections. ^[6]^[7]

Questions to ask when critically appraising a research paper ^[8]
Relevant questions to address
Overall assessment	Importance Is the content relevant to patient care? How does this contribute to the existing literature? Novelty Does the paper evaluate new diagnostic or therapeutic modalities? Does the paper evaluate existing diagnostic or therapeutic modalities in a new population or setting?
Title/abstract	What is the research question? Does the abstract appropriately summarize the main methods and results of the paper?
Introduction	Is the review of the prior literature appropriate/relevant? Are the study objectives/aims clearly stated? Are relevant hypotheses described?
Methods	Study design What is the study design? Is the chosen study design the most appropriate for the research question? Participant selection What were the study inclusion criteria? What were the study exclusion crtieria? Are there potential sources of selection bias? Study procedures (vary by study type) For observational studies: How and when were data collected (e.g., surveys, electronic health records)? For RCTs: How were participants randomized? Was there blinding? What were the intervention and control procedures? For systematic reviews and metaanalyses: Are any important studies missing? Is adequate detail provided for specific studies? Are any summary measures included (for metaanalyses)? Data collection What were the relevant exposures? What was the primary outcome? Were additional outcomes measured? Was data collected on potential confounders? Data analysis: What statistical tests were used, and were they appropriate?
Results	Population size How many individuals participated in the study? What was the response rate (for survey studies)? How many participants were lost to follow-up (prospective cohort studies, RCTs)? Participant characteristics What were the baseline characteristics of the participants? Were there significant differences between study groups? Analysis Were appropriate effect sizes presented? Did the analyses adjust for appropriate confounding variables? Presentation: Where the results reported according to current guidelines (see Equator network reporting guidelines in “Tips and Links”)?
Discussion	Did the authors interpret their results in the context of the existing literature? Are the study conclusions appropriate based on the findings? Is the study generalizable? Are the study limitations appropriately addressed?
Other	Do the study authors have any relevant conflicts of interest? Who funded the study? Was the study reviewed and approved by an Institutional Review Board?

Reporting guidelines are available for different study types, e.g., CONSORT for randomized trials, STROBE for observational studies, and PRISMA for systematic reviews.

Measures of association

Measures of association can be used to quantify the strength of a relationship between two variables. See also “Measures of disease frequency.”

Two-by-two table

The degree of association between exposure and disease is typically evaluated using a two-by-two table, which compares the presence/absence of disease with the history of exposure to a risk factor.

Two-by-two table
	Disease (outcome)	No disease (no outcome)	Total
Exposure (risk factor)	a	b	a + b
No exposure (no risk factor)	c	d	c + d
Total	a + c	b + d	a + b + c+ d

Chalk Talk: Medical Statistics 1

Risk

Risk factor: a variable or attribute that increases the probability of developing a disease or injury ^[9]
Absolute risk: the likelihood of an event occurring under specific conditions ^[3]
- Commonly expressed as a percentage
- Equal to the cumulative incidence, which can be calculated as follows: incidence rate × the time of follow-up
- Aim: to measure the probability of an individual in a study population developing an outcome
- Used in: cohort studies
- Formula: (number of new cases)/(total individuals in a study group) = (a + c)/(a + b + c + d)
Relative risk: See “Estimates of association strength.”
Attributable risk: See “Estimates of population impact.”

Formulas of common measures of association

Measures that help quantify the strength of association
- Relative risk (RR): (a/(a + b))/(c/(c + d))
- Odds ratio (OR): (a/c)/(b/d) = ad/bc
Measures that help quantify the impact of an association on a population
- Attributable risk (AR): a/(a + b) - c/(c + d)
- Absolute risk reduction (ARR): c/(c + d) – a/(a + b)
- Relative risk reduction (RRR): 1 - RR
- Number needed to treat (NNT): 1/ARR
- Number needed to harm (NNH): 1/AR

Estimates of association strength

Relative risk (RR; risk ratio) ^[3]^[10]

Description: : the likelihood of an outcome in one group exposed to a potential risk factor compared to the risk in another group that has not been exposed
Purpose
- To measure how strongly a risk factor is associated with an outcome (e.g., death, injury, disease)
- To help establish disease etiology
Used in: : cohort studies and randomized controlled trials
Formula: (incidence of disease in exposed group)/(incidence of disease in unexposed group) = (a/(a + b))/(c/(c + d))
Interpretation
- RR = 1: Exposure neither increases nor decreases the risk of the defined outcome.
- RR > 1: Exposure increases the risk of the outcome.
- RR < 1: Exposure decreases the risk of the outcome.

Chalk Talk: Medical Statistics 2

Odds ratio (OR) ^[11]

Description
- Comparison of the odds of an event occurring in one group against the odds of an event occurring in another group
- Odds: the probability of an event occurring divided by the probability of this event not occurring
- Calculated using the two-by-two table
Purpose: to measure the strength of an association between a risk factor and an outcome
Used in: : case-control studies
Formula
- Odds ratio of exposure: compares the odds of exposure among individuals with an outcome (e.g., disease) against the odds of exposure among individuals without an outcome
  - Odds of exposure in individuals with disease (i.e., case group): (exposure in individuals with disease)/(no exposure in individuals with disease) = a/c
  - Odds of exposure in individuals without disease (i.e., control group): (exposure in individuals without disease)/(no exposure in individuals without disease) = b/d
  - Odds ratio: (odds of exposure in individuals with disease)/(odds of exposure in individuals without disease) = (a/c)/(b/d) = ad/bc = (a/b)/(c/d)
Interpretation
- OR = 1: The outcome is equally likely in exposed and unexposed individuals.
- OR > 1: The outcome is more likely to occur in exposed individuals.
- OR < 1: The outcome is less likely to occur in exposed individuals.
Rare disease assumption
- Case-control studies do not track participants over time, so they cannot be used to calculate relative risk.
- However, the assumption can be made that if an outcome (e.g., disease prevalence) is rare, the incidence of that outcome is low and the OR is approximately the same as the RR.

Chalk Talk: Medical Statistics 5

Hazard ratio (HR)

Description: : a measure of the effect of an intervention on an outcome at any given point in time during the study period ^[12]^[13]
Purpose: to help determine how long it takes for an event to occur in individuals in the case group, compared to individuals in the control group
Used in: survival analysis
Formula: (observed number of events in exposed group / expected number of deaths in exposed group) at time (t) / (observed number of events in unexposed group/expected number of deaths in unexposed group) at time (t) ^[13]
Interpretation
- HR = 1: no relationship
- HR > 1: The outcome of interest is more likely to occur in exposed individuals.
- HR < 1: The outcome of interest is less likely to occur in exposed individuals.

The RR is the risk of an event occurring by the end of the study period (i.e., cumulative risk), while the HR is the risk of an event occurring at any point in time during the study period (i.e., instantaneous risk). ^[13]

The RR, OR, and HR are usually displayed with a corresponding p-value. They are considered statistically significant if the p-value is < 0.05.

Estimates of population impact

Chalk Talk: Medical Statistics 3

Attributable risk (AR) ^[14]

Description: the absolute difference between the risk of an outcome occurring in exposed individuals and unexposed individuals
Purpose: to measure the excess risk of an outcome that can be attributed to the exposure
Used in: cohort studies
Formulas
- Exposure AR: (incidence risk in exposed group) - (incidence risk in unexposed group) = a/(a + b) - c/(c + d)
- Population AR: (incidence risk in the study population) - (incidence risk in the unexposed group) = (a + c)/(a + b + c + d) - c/(c + d)

Attributable risk percent (ARP) ^[14]

Description: the proportion of disease incidence among exposed individuals that can be attributed to the risk factor
Purpose: to determine the proportion of cases in the exposed population that can be attributed to the risk factor
Used in: cohort studies and case-control studies
Formulas: (incidence risk among exposed) - (incidence risk among unexposed)/(incidence risk among exposed) x 100
- ARP = (RR - 1)/RR x 100
- The RR cannot be calculated for case-control studies, so the OR (an estimate of the RR) can be used to calculate the attributable risk: ARP = (OR–1)/OR x 100.
- Alternatively, ARP = AR/(incidence of disease in the exposed group) x 100 = (a/(a + b) – c/(c + d)) / (a/(a + b)) x 100

Relative risk reduction (RRR)

Description: : the proportion of risk in the exposure group after an intervention compared to the risk in the nonexposure group
Purpose: to determine how much the treatment reduces the risk of negative outcomes
Used in: cohort studies and cross-sectional studies
Formulas
- 1 - RR
- Alternatively, RRR = ((incidence risk in unexposed group) - (incidence risk in exposed group))/(incidence risk of disease in the unexposed group) = (c/(c + d) – a/(a + b)) / (c/(c + d));
Example: RRR can be used to demonstrate vaccine effectiveness = (risk among unvaccinated – risk among vaccinated)/(risk among unvaccinated) × 100. ^[10]

Absolute risk reduction (ARR; risk difference)

Description: : the difference between the risk in the exposure group after an intervention and the risk in the nonexposure group (e.g., risk of death)
Purpose: to show the risk without treatment as well as the risk reduction associated with treatment
Used in: cohort studies, cross-sectional studies, and clinical trials
Formula: : (absolute risk in the unexposed group) - (absolute risk in the exposed group) = c/(c + d) – a/(a + b)

Number needed to treat (NNT)

Description
- The number of individuals that must be treated, in a particular time period, for one person to benefit from treatment (i.e., to not develop the disease)
- Inversely related to the effectiveness of a treatment
Purpose: to compare the effectiveness of different treatments
Used in: clinical trials
Formula: : 1/ARR

Number needed to harm (NNH)

Description
- The number of individuals who need to be exposed to a certain risk factor before one person develops an outcome
- Directly correlates to the safety of the exposure
Purpose: to determine the potential harms of an intervention
Used in: clinical trials
Formula: : 1/AR

Number needed to screen (NNS)

Description: the number of individuals who need to be screened in a particular time period in order to prevent one death or adverse event ^[15]
Formula (same as NNT): 1/ARR

Evaluation of screening or diagnostic tests

Overview

Before a diagnostic modality (e.g., laboratory study, imaging study, diagnostic criteria) can be used in clinical practice, it needs to be determined how well the modality can distinguish between individuals with the disease and individuals without the disease.
A test is compared to the gold standard test using a two-by-two table.
A two-by-two table can be used to calculate a test's sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).

Features of a two-by-two table summarizing screening or diagnostic test results
	Disease	No disease	Interpretation
Positive test result	True positive (TP)	False positive (FP)	All subjects with positive test results (TP + FP) PPV = TP/(TP + FP)
Negative test result	False negative (FN)	True negative (TN)	All subjects with negative test results (FN + TN) NPV = TN/(FN + TN)
Interpretation	All subjects with disease (TP + FN) Sensitivity (true positive rate) = TP/(TP + FN) False negative rate = FN/(TP + FN)	All subjects without disease (FP + TN) Specificity (true negative rate) = TN/(FP + TN) False positive rate = FP/(FP + TN)	All subjects (TP + FP + FN + TN)

Two-by-two contingency table

Example 2 x 2 table of a diagnostic test ^[16]

Diagnostic test for tuberculosis (TB)
	Patients with TB	Patients without TB	Total
Positive test result	800 (TP)	400 (FP)	1200 (TP + FP)
Negative test result	200 (FN)	3600 (TN)	3800 (FN + TN)
Total	1000 (TP + FN)	4000 (FP + TN)	5000 (TP + FP + FN + TN)

Interpretation
- Sensitivity = TP/(TP + FN) = 800/(800 + 200) = 80%
- Specificity = TN/(FP + TN) = 3600/(400 + 3600) = 90%
- False positive rate = FP/(FP + TN) = 400/(400 + 3600) = 10%
- False negative rate = FN/(TP + FN) = 200/(800 + 200) = 20%
- PPV = TP/(TP + FP) = 800/(800 + 400) = 66.6 %
- NPV = TN/(FN + TN) = 3600/(200 + 3600) = 94.7%

Pretest probability

Description: the probability that a patient has a specific disease before the result of the test is known
Features
- The pretest probability of a disease is determined by its prevalence in a particular group.
- A test subject's pretest probability affects posttest probabilities (i.e., NPV, PPV) but does not affect test characteristics.
  - A higher pretest probability decreases the NPV and increases the PPV.
  - A lower pretest probability increases the NPV and decreases the PPV.
Relation between pretest probability and odds
- Pretest probability = pretest odds /(pretest odds + 1)
- Pretest odds = pretest probability /(1 - pretest probability)

Test characteristics

Description
- The intrinsic properties of a test that do not change based on pretest probability
- Test characteristics include sensitivity, specificity, false positive rate, false negative rate, positive likelihood ratio, and negative likelihood ratio.

Sensitivity and specificity

Overview of sensitivity and specificity of screening and diagnostic tests
	Sensitivity (true positive rate)	Specificity (true negative rate)
Description	The proportion of individuals with the disease who actually test positive, i.e., P(positive test\|disease) when expressed as a conditional probability.	The proportion of individuals without the disease who actually test negative, i.e., P(negative test\|no disease) when expressed as a conditional probability.
Features	A test with high sensitivity yields a low false negative rate. Tests with high sensitivity are often used for screening purposes.	A test with high specificity yields a low false positive rate. Tests with high specificity can be used to confirm the diagnosis following a positive screening test.

A highly sensitive test can rule out a disease if negative, and a highly specific test can rule in a disease if positive.

Cutoff value of a highly sensitive test Cutoff value of a highly specific test

Likelihood ratio ^[16]^[17]

Description
- A measure used to determine the utility of a diagnostic test in clinical practice
- Represents the probability of a test result in someone with the disease over the probability of the test result in someone without the disease
Interpretation
- Reflects how much more likely a disease is in a person with a given test result compared to their pretest probability
  - A likelihood ratio > 1 is associated with the presence of a disease.
  - A likelihood ratio < 1 is associated with absence of a disease.
  - If the likelihood ratio is 1, the posttest probability is similar to the pretest probability, and therefore the test has poor clinical utility.
- Likelihood ratio x pretest odds = posttest odds ^[16]^[17]
- A nomogram can also be used to convert pretest probability to posttest probability using likelihood ratios.
Types
- Positive likelihood ratio (LR⁺)
  - Ratio of the sensitivity rate (true positive rate) to the false positive rate
  - LR⁺ = (TP rate)/(FP rate) = sensitivity/(1 - specificity)
  - A LR⁺> 10 indicates that the test is excellent at ruling in (confirming) a disease.
- Negative likelihood ratio (LR^-)
  - Ratio between the false negative rate and the specificity (true negative rate)
  - LR^- = (FN rate)/(TN rate) = (1 - sensitivity)/specificity
  - A LR^- < 0.1 indicates that the test is excellent at ruling out (screening for) a disease.

When comparing diagnostic tests with similar sensitivity or tests with similar specificity, likelihood ratios are used to determine the relative clinical utility.

Posttest probability (predictive value) ^[17]^[18]

Description: the probability that a patient has a particular disease after a diagnostic test is carried out, i.e., P(disease status|test result) when expressed as a conditional probability.
Features
- Combines pretest probability (e.g., based on disease prevalence) and test characteristics (e.g., sensitivity, specificity, likelihood ratios) to quantify the likelihood of a patient having a disease
- Can be determined using formulas or nomograms
- PPV, 1 - PPV, NPV, and 1 - NPV are posttest probabilities.
Relation between posttest probability and odds
- Posttest probability = posttest odds /(posttest odds + 1)
- Posttest odds = posttest probability /(1 – posttest probability)

Positive predictive value (PPV)

Description: the proportion of individuals who test positive for a disease who actually have the disease, i.e., P(disease|positive test) when expressed as a conditional probability
Features
- Directly correlates with pretest probability
- The PPV increases with increasing prevalence of a disease in the population. ^[19]
Formula
- PPV = TP/(TP + FP) (see “Overview of sensitivity and specificity of screening and diagnostic tests”)
- The probability that an individual who tested positive actually does not have the disease, i.e., P(no disease|positive test) = 1 - PPV
- PPV can also be calculated using test characteristics and pretest probability or pretest odds of the disease. ^[20]
  - PPV = sensitivity / [sensitivity + ((1 - specificity) / pretest odds)]
  - Alternatively, PPV = LR⁺ / [LR⁺+ (1/pretest odds)]

Negative predictive value (NPV)

Description: the proportion of individuals who test negative for a disease who actually do not have the disease, i.e., P(no disease|negative test) when expressed as a conditional probability
Features
- NPV inversely correlates with pretest probability.
- NPV decreases with increasing prevalence of the disease.
Formula
- NPV = TN/(FN + TN) (see “Overview of sensitivity and specificity of screening and diagnostic tests”)
- The probability that an individual who tested negative actually has the disease, i.e., P(disease|negative test) = 1 - NPV
- NPV can also be calculated using test characteristics and pretest probability or pretest odds of the disease. ^[20]
  - NPV = specificity / [specificity + ((1 - sensitivity) x pretest odds)]
  - Alternatively, NPV = LR^- / [1 + (LR^- x pretest odds)]

Unlike sensitivity and specificity, which are determined solely by the diagnostic test itself, predictive values are also influenced by disease prevalence.

Effect of prevalence on post-test probabilities

Cutoff values ^[16]

Definition: dividing points on measuring scales where the test results are divided into different categories
- Positive: has the condition of interest
- Negative: does not have the condition of interest
Features: Sensitivity, specificity, PPVs, and NPVs vary according to the criterion and/or the cutoff values of the data.
Interpretation: What happens when a cutoff value is raised or lowered depends on whether the test in question requires a high value (e.g., tumor marker for cancer, lipase for pancreatitis) or a low value (e.g., hyponatremia, agranulocytosis).
- Lowering or raising a cutoff value for a high value test:
  - Decreased cutoff value (i.e., broadening the inclusion criteria): lower specificity, higher sensitivity, lower PPV, higher NPV
  - Increased cutoff value (i.e., narrowing the inclusion criteria): higher specificity, lower sensitivity, higher PPV, lower NPV
- Lowering or raising a cutoff value for a low value test:
  - Decreased cutoff value (i.e., narrowed inclusion criteria): higher specificity, lower sensitivity, higher PPV (decrease in false positives > decrease in true positives), lower NPV (increase in false negatives > increase in true negatives)
  - Increased cutoff value (i.e., broadened inclusion criteria): lower specificity, higher sensitivity, lower PPV (increase in true positives > increase in false positives), higher NPV (decrease in false negatives > decrease in true negatives)

Cutoff value of an optimal screening test Cutoff value of a highly sensitive test Cutoff value of a highly specific test

Receiver operating characteristic curve (ROC curve) ^[16]^[21]

Description: a graph that compares the sensitivity and specificity of a diagnostic test
Features
- Every diagnostic test generally involves a tradeoff between sensitivity and specificity.
- Sensitivity and specificity are inversely proportional, meaning that as the sensitivity increases, the specificity decreases, and vice versa.
- The ROC shows the tradeoff between clinical sensitivity and specificity for every possible cutoff value, to evaluate the ability of the test to correctly diagnose subjects
- The y-axis represents the sensitivity (i.e., true positive rate) and the x-axis corresponds to 1 - specificity (i.e., the false positive rate).
  - A test is considered more accurate the more closely the curve follows the y-axis.
  - A test is considered less accurate if the curve is closer to the diagonal.
- The area under the ROC curve (AUROC) can also be used for test comparison; the larger the AUROC, the more clinically useful the test. ^[22]
  - AUROC close to 1.0 indicates that the test has high combined sensitivity and specificity.
  - AUROC close to 0.5 indicates poor discriminative ability.
- Cutoff values
  - Normal ROC
    - Low cutoff value: low sensitivity (high FP) and high specificity (low FN)
    - High cutoff value: high sensitivity (low FP) and low specificity (high FN)
  - Inverse ROC
    - Low cutoff value: high sensitivity (low FP) and low specificity (high FN)
    - High cutoff value: low sensitivity (high FP) and high specificity (low FN)

Receiver operating characteristic (ROC) curve

Screening tests

Used to identify disease in asymptomatic individuals (e.g., mammogram for breast cancer, Pap smear for cervical cancer)
Should have a low LR^- and a high sensitivity

Potential bias in studies evaluating screening tests
	Lead-time bias	Length-time bias
Description	A type of bias in which survival time is overestimated because of early diagnosis through screening and does not reflect an actual delay in mortality Lead time: the length of time between the initial detection of disease and the expected outcome (i.e., death or onset of clinical symptoms) Lead-time bias occurs when survival times are chosen as an endpoint of screening trials.	A type of bias in which survival time is overestimated because screening tests have a higher probability of detecting slowly progressive cases, which have a longer asymptomatic phase and better prognosis than rapidly progressive cases.
Example	A CT scan detects a malignant tumor earlier than a conventional x-ray. However, early treatment does not improve survival. Therefore, any apparent advantage in 5-year survival rates of patients diagnosed via CT scan in comparison to those diagnosed using x-ray is the result of lead-time bias.	Slow-growing tumors are typically less aggressive than fast-growing tumors and remain asymptomatic for a longer period of time. Therefore, the proportion of slow-growing tumors is overrepresented in screening tests. Because patients with slowly progressive disease have longer survival than those with fast-growing tumors, the benefits of screening are overestimated.
Solutions	Mortality rates rather than survival times are the gold standard for evaluating screening tests.	Use a randomized controlled trial to allocate subjects into screening and control groups.

Confirmatory tests

Confirms disease in individuals with signs or symptoms of the disease (e.g., biopsy for breast cancer or cervical cancer)
Usually performed after a screening test to confirm a diagnosis
Should have a high LR⁺ and a high specificity

Chalk Talk: Medical Statistics 6

Precision and validity

Precision (reliability) ^[3]^[23]

Definition: the reproducibility of test results on the same sample under similar conditions
Features
- A test with a high precision will have minimal random error.
- Precision improves with decreased standard deviation and increased power of a statistical test.
Methods of estimating precision
- Interrater reliability: the extent to which a test yields the same results when performed by different researchers
- Parallel test reliability: the extent to which two tests measuring the same concepts with different items or questions yield the same results when repeated on the same subjects
- Test-retest reliability: the extent to which a test yields the same results when repeated on the same subjects

Validity (accuracy) ^[3]

Definition: the correspondence between test results and what the test was developed to measure
Features
- A test with high validity will have minimal systematic error and bias.
- Sensitivity and specificity are measures of validity.
Types
- Internal validity
  - The extent to which a study is free of error (most often in the form of bias) and the results are therefore true for the study sample
  - High internal validity can be achieved by:
    - Controlling for age, sex, and other characteristics
    - Refining measurement instruments to reduce systematic errors (bias) to a minimum
- External validity
  - The extent to which study results can be extrapolated from a sample population to the general population (generalizability)
  - A study with high external validity has the following characteristics:
    - The study results can be reproduced in different sample groups.
    - High internal validity

Reliability and validity

Related One-Minute Telegram

One-Minute Telegram 82-2023-2/3: Don’t trust, always verify: AI generates fake medical citations

References

Masic I, Miokovic M, Muhamedagic B. Evidence Based Medicine - New Approaches and Challenges. Acta Informatica Medica. 2008; 16 (4): p.219.doi: 10.5455/aim.2008.16.219-225 . | Open in Read by QxMD
Raymond S. Greenberg. Medical Epidemiology: Population Health and Effective Health Care, 5th Edition. McGraw-Hill ; 2015
Burns PB, Rohrich RJ, Chung KC. The Levels of Evidence and Their Role in Evidence-Based Medicine. Plast Reconstr Surg. 2015; 128 (1): p.305-310.doi: 10.1097/prs.0b013e318219c171 . | Open in Read by QxMD
U.S. Preventive Services Task Force Procedure Manual. https://www.uspreventiveservicestaskforce.org/uspstf/sites/default/files/inline-files/procedure-manual_2016%20%281%29.pdf. Updated: December 1, 2015. Accessed: August 10, 2020.
Schulz KF, Altman DG, Moher D. CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials. BMJ. 2010; 340: p.c332.doi: 10.1136/bmj.c332 . | Open in Read by QxMD
Elm E von, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP. Strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. BMJ. 2007; 335 (7624): p.806-808.doi: 10.1136/bmj.39335.541782.ad . | Open in Read by QxMD
Young JM, Solomon MJ. How to critically appraise an article. Nat Clin Pract Gastroenterol Hepatol. 2009; 6 (2): p.82-91.doi: 10.1038/ncpgasthep1331 . | Open in Read by QxMD
Miquel Porta. A Dictionary of Epidemiology. Oxford University Press ; 2014
$An Introduction to Applied Epidemiology and Biostatistics.
Szumilas M. Explaining odds ratios.. Journal of the Canadian Academy of Child and Adolescent Psychiatry = Journal de l'Academie canadienne de psychiatrie de l'enfant et de l'adolescent. 2010; 19 (3): p.227-9.
Xue X, Xie X, Gunter M, et al. Testing the proportional hazards assumption in case-cohort analysis. BMC Med Res Methodol. 2013; 13 (1).doi: 10.1186/1471-2288-13-88 . | Open in Read by QxMD
Brody T. Biostatistics—Part I. Elsevier ; 2016: p. 203-226
Kirkwood B, Sterne J. Essential Medical Statistics. Wiley-Blackwell ; 2003
Rembold CM. Number needed to screen: development of a statistic for disease screening. BMJ. 1998; 317 (7154): p.307-312.doi: 10.1136/bmj.317.7154.307 . | Open in Read by QxMD
Florkowski CM. Sensitivity, specificity, receiver-operating characteristic (ROC) curves and likelihood ratios: communicating the performance of diagnostic tests. Clin Biochem Rev. 2008; 29 (Suppl 1): p.S83-S87.
Parikh R, et. al.. Likelihood ratios: Clinical application in day-to-day practice. Indian J Ophthalmol. 2009; 57 (3): p.217.doi: 10.4103/0301-4738.49397 . | Open in Read by QxMD
Kanchanaraksa S.. Evaluation of Diagnostic and Screening Tests: Validity and Reliability. The Johns Hopkins University Bloomberg School of Public Health. 2008.
Parikh R, Mathai A, Parikh S, Chandra Sekhar G, Thomas R. Understanding and using sensitivity, specificity and predictive values. Indian J Ophthalmol. 2008; 56 (1): p.45.doi: 10.4103/0301-4738.37595 . | Open in Read by QxMD
Webb M.P.K., Sidebotham D.. Bayes' formula: a powerful but counterintuitive tool for medical decision-making. BJA Educ.. 2020; 20 (6): p.208-213.doi: 10.1016/j.bjae.2020.03.002 . | Open in Read by QxMD
Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve.. Radiology. 1982; 143 (1): p.29-36.doi: 10.1148/radiology.143.1.7063747 . | Open in Read by QxMD
Ekelund S. ROC Curves—What are They and How are They Used?. Point of Care: The Journal of Near-Patient Testing & Technology. 2012; 11 (1): p.16-21.doi: 10.1097/poc.0b013e318246a642 . | Open in Read by QxMD
Hulley SB, Cummings SR, Browner WS, Grady D, Newman TB. Designing Clinical Research. Lippincott Williams & Wilkins ; 2013
$Contributor Disclosures - Interpreting medical evidence. All of the relevant financial relationships listed for the following individuals have been mitigated: Jan Schlebes (medical editor, is a shareholder in Fresenius SE & Co KGaA). None of the other individuals in control of the content for this article reported relevant financial relationships with ineligible companies. For details, please review our full conflict of interest (COI) policy.

3 free articles remaining

You have 3 free member-only articles left this month. Sign up and get unlimited access.

Have an account? Log In Start free trial

Evidence-based content, created and peer-reviewed by physicians. Read the disclaimer