Unit 2: Regression, Probability, Sampling & Parametric Tests

Semester 8

BP801T

Regression, Probability, Sampling & Parametric Tests

This mathematically intensive unit transitions from descriptive statistics to predictive modeling and inferential statistics. It covers Linear Regression (Curve fitting via the least squares method) to predict outcomes. It explores the mathematics of Probability and key distributions (Binomial, Normal, Poisson). Crucially, it introduces the core concepts of clinical research: Hypothesis Testing (Null vs. Alternate, Errors), Sampling, and Parametric Tests like Student’s t-test and ANOVA to determine statistical significance.

Syllabus & Topics

1Regression Analysis & Least Squares: Correlation shows relationships; Regression allows us to PREDICT the value of a dependent variable (y) based on an independent variable (x). Curve fitting by the Method of Least Squares: Finds the line of best fit by minimizing the sum of the squares of the vertical derivations (errors) of the data points from the line. Equation of regression line: y = a + bx (to predict y from x) or x = a + by. ‘b’ is the regression coefficient (slope), ‘a’ is the intercept. Pharmaceutical Example: Plotting a standard calibration curve for a drug (Absorbance ‘y’ vs Concentration ‘x’) to unknown concentrations.
2Probability & Distributions: Probability: The mathematical chance that a specific event will occur (ranges from 0 to 1). (1) Binomial Distribution: Used for discrete events with only TWO possible outcomes (Success/Failure, e.g., Patient cured or not cured). Properties defined by ‘n’ (trials) and ‘p’ (probability of success). (2) Poisson Distribution: Used for rare, discrete events occurring in a fixed interval of time or space (e.g., number of bacterial mutations per million cells). Parameter is λ (mean rate). (3) Normal (Gaussian) Distribution: The most important continuous distribution. Symmetrical, bell-shaped curve. Mean = Median = Mode. 68% of data falls within 1 SD of the mean, 95% within 2 SD, 99.7% within 3 SD.
3Populations, Samples & Hypothesis Testing: Population: The entire group you want to draw conclusions about (e.g., all diabetics in India). Sample: A smaller, representative subset of the population actually studied to save time and money. Null Hypothesis (H₀): The assumption of ‘no difference’ or ‘no effect’ (e.g., Drug A is equal in efficacy to Placebo). Alternative Hypothesis (H₁): What the researcher hopes to prove (e.g., Drug A is superior to Placebo). The goal of statistical testing is to gather enough evidence to REJECT the Null Hypothesis.
4Errors in Hypothesis Testing: (1) Type I Error (False Positive / Alpha error, α): Rejecting a Null Hypothesis that is actually TRUE. (You conclude the drug works when it’s actually just a placebo effect). Usually set at α = 0.05 (5% risk). (2) Type II Error (False Negative / Beta error, β): Accepting (failing to reject) a Null Hypothesis that is actually FALSE. (You conclude the drug doesn’t work, but it actually does in reality). Standard Error of Mean (SEM): Shows how closely the sample mean estimates the true population mean. SEM = SD / √n. It shrinks as sample size increases.
5Parametric Tests & The t-test: Parametric Tests rely on restrictive assumptions: data is normally distributed, variances are homogeneous, and data is continuous (interval/ratio). Gosset’s Student’s t-test (for small samples, n<30): Compares the MEANS of groups to see if they are statistically significantly different. Types: (1) Unpaired (Independent) t-test: Compares means of TWO completely DIFFERENT groups (e.g., 10 rats given Drug A vs. 10 different rats given Placebo). (2) Paired t-test: Compares means of the SAME group measured TWICE (before and after treatment, e.g., Patient’s blood pressure before medication vs. 2 weeks after medication).
6Analysis of Variance (ANOVA): t-tests are limited to comparing TWO groups. If you have 3 or more groups (e.g., Drug A vs. Drug B vs. Placebo), running multiple t-tests increases the risk of a Type I error. Solution: ANOVA. Compares the variances BETWEEN the groups to the variances WITHIN the groups (F-ratio). (1) One-Way ANOVA: Tests ONE independent variable (e.g., analyzing blood pressure reduction across 3 different drug dosage groups). (2) Two-Way ANOVA: Tests TWO independent variables simultaneously (e.g., analyzing drug efficacy across 3 different dosages AND across 2 different age groups). If ANOVA is significant (p<0.05), you use Post-Hoc tests like the Least Significance Difference (LSD) to determine EXACTLY which specific groups differ from each other.

Learning Objectives

Line of Best Fit: Understand how to use the equations for the method of least squares to find the regression coefficients ‘a’ and ‘b’.

Normal Distribution: Describe the shape and key percentage properties (68-95-99.7 rule) of a normal distribution curve.

Hypothesis Testing: Clearly distinguish between the Null Hypothesis (H₀) and Alternative Hypothesis (H₁), and define Type I & Type II errors.

Select a t-test: Correctly choose between a paired and unpaired t-test based on a given experimental design scenario.

Understand ANOVA: Explain why ANOVA is used instead of multiple t-tests when comparing more than two experimental groups.

Exam Prep Questions

Q1. What is the key difference between Correlation and Regression?

Correlation only describes the strength and direction of the relationship between two variables (e.g., taller people tend to weigh more). It does not imply cause and effect. Regression goes a step further: it establishes a mathematical equation (like y = a + bx). This allows you to PREDICT the value of the dependent variable (Y) based on a specific input of the independent variable (X).

Q2. Why do we always test the “Null Hypothesis” instead of just proving what we want to prove?

In statistics, it is mathematically easier to definitively disprove a negative than to prove a positive. The basis of scientific rigor is skepticism. We assume the treatment had NO effect (Null Hypothesis). We then calculate the probability (p-value) of getting our experimental results if the “no effect” assumption was true. If that probability is very small (usually < 5%, p < 0.05), we can confidently REJECT the null hypothesis and support our Alternative claim.

Q3. When should I use a Paired t-test vs. an Unpaired t-test?

Look at the subjects. Are the two datasets coming from the EXACT SAME individuals/animals? If yes (e.g., measuring Mouse #1’s blood sugar on Monday, and then Mouse #1’s blood sugar on Friday after treatment), you use a Paired t-test.

If the datasets are from completely SEPARATE independent groups (e.g., 10 mice got the drug, 10 completely different mice got the control), you must use an Unpaired t-test.