Regression, Probability, Sampling & Parametric Tests
This mathematically intensive unit transitions from descriptive statistics to predictive modeling and inferential statistics. It covers Linear Regression (Curve fitting via the least squares method) to predict outcomes. It explores the mathematics of Probability and key distributions (Binomial, Normal, Poisson). Crucially, it introduces the core concepts of clinical research: Hypothesis Testing (Null vs. Alternate, Errors), Sampling, and Parametric Tests like Student’s t-test and ANOVA to determine statistical significance.
Syllabus & Topics
- 1Regression Analysis & Least Squares: Correlation shows relationships; Regression allows us to PREDICT the value of a dependent variable (y) based on an independent variable (x). Curve fitting by the Method of Least Squares: Finds the line of best fit by minimizing the sum of the squares of the vertical derivations (errors) of the data points from the line. Equation of regression line: y = a + bx (to predict y from x) or x = a + by. ‘b’ is the regression coefficient (slope), ‘a’ is the intercept. Pharmaceutical Example: Plotting a standard calibration curve for a drug (Absorbance ‘y’ vs Concentration ‘x’) to unknown concentrations.
- 2Probability & Distributions: Probability: The mathematical chance that a specific event will occur (ranges from 0 to 1). (1) Binomial Distribution: Used for discrete events with only TWO possible outcomes (Success/Failure, e.g., Patient cured or not cured). Properties defined by ‘n’ (trials) and ‘p’ (probability of success). (2) Poisson Distribution: Used for rare, discrete events occurring in a fixed interval of time or space (e.g., number of bacterial mutations per million cells). Parameter is λ (mean rate). (3) Normal (Gaussian) Distribution: The most important continuous distribution. Symmetrical, bell-shaped curve. Mean = Median = Mode. 68% of data falls within 1 SD of the mean, 95% within 2 SD, 99.7% within 3 SD.
- 3Populations, Samples & Hypothesis Testing: Population: The entire group you want to draw conclusions about (e.g., all diabetics in India). Sample: A smaller, representative subset of the population actually studied to save time and money. Null Hypothesis (H₀): The assumption of ‘no difference’ or ‘no effect’ (e.g., Drug A is equal in efficacy to Placebo). Alternative Hypothesis (H₁): What the researcher hopes to prove (e.g., Drug A is superior to Placebo). The goal of statistical testing is to gather enough evidence to REJECT the Null Hypothesis.
- 4Errors in Hypothesis Testing: (1) Type I Error (False Positive / Alpha error, α): Rejecting a Null Hypothesis that is actually TRUE. (You conclude the drug works when it’s actually just a placebo effect). Usually set at α = 0.05 (5% risk). (2) Type II Error (False Negative / Beta error, β): Accepting (failing to reject) a Null Hypothesis that is actually FALSE. (You conclude the drug doesn’t work, but it actually does in reality). Standard Error of Mean (SEM): Shows how closely the sample mean estimates the true population mean. SEM = SD / √n. It shrinks as sample size increases.
- 5Parametric Tests & The t-test: Parametric Tests rely on restrictive assumptions: data is normally distributed, variances are homogeneous, and data is continuous (interval/ratio). Gosset’s Student’s t-test (for small samples, n<30): Compares the MEANS of groups to see if they are statistically significantly different. Types: (1) Unpaired (Independent) t-test: Compares means of TWO completely DIFFERENT groups (e.g., 10 rats given Drug A vs. 10 different rats given Placebo). (2) Paired t-test: Compares means of the SAME group measured TWICE (before and after treatment, e.g., Patient’s blood pressure before medication vs. 2 weeks after medication).
- 6Analysis of Variance (ANOVA): t-tests are limited to comparing TWO groups. If you have 3 or more groups (e.g., Drug A vs. Drug B vs. Placebo), running multiple t-tests increases the risk of a Type I error. Solution: ANOVA. Compares the variances BETWEEN the groups to the variances WITHIN the groups (F-ratio). (1) One-Way ANOVA: Tests ONE independent variable (e.g., analyzing blood pressure reduction across 3 different drug dosage groups). (2) Two-Way ANOVA: Tests TWO independent variables simultaneously (e.g., analyzing drug efficacy across 3 different dosages AND across 2 different age groups). If ANOVA is significant (p<0.05), you use Post-Hoc tests like the Least Significance Difference (LSD) to determine EXACTLY which specific groups differ from each other.
Learning Objectives
Exam Prep Questions
Q1. What is the key difference between Correlation and Regression?
Correlation only describes the strength and direction of the relationship between two variables (e.g., taller people tend to weigh more). It does not imply cause and effect. Regression goes a step further: it establishes a mathematical equation (like y = a + bx). This allows you to PREDICT the value of the dependent variable (Y) based on a specific input of the independent variable (X).
Q2. Why do we always test the “Null Hypothesis” instead of just proving what we want to prove?
In statistics, it is mathematically easier to definitively disprove a negative than to prove a positive. The basis of scientific rigor is skepticism. We assume the treatment had NO effect (Null Hypothesis). We then calculate the probability (p-value) of getting our experimental results if the “no effect” assumption was true. If that probability is very small (usually < 5%, p < 0.05), we can confidently REJECT the null hypothesis and support our Alternative claim.
Q3. When should I use a Paired t-test vs. an Unpaired t-test?
Look at the subjects. Are the two datasets coming from the EXACT SAME individuals/animals? If yes (e.g., measuring Mouse #1’s blood sugar on Monday, and then Mouse #1’s blood sugar on Friday after treatment), you use a Paired t-test.
If the datasets are from completely SEPARATE independent groups (e.g., 10 mice got the drug, 10 completely different mice got the control), you must use an Unpaired t-test.
