Unit 1: Introduction, Central Tendency, Dispersion & Correlation

Semester 8

BP801T

Introduction, Central Tendency, Dispersion & Correlation

This unit builds the foundation of statistical analysis. It introduces the concept of statistics and frequency distributions. It covers how to summarize data using Measures of Central Tendency (Mean, Median, Mode) and how to measure data variability using Measures of Dispersion (Range, Standard Deviation) with specific pharmaceutical examples. It concludes with Correlation, teaching how to mathematically determine the relationship between two specific variables using Karl Pearson’s coefficient.

Syllabus & Topics

1Introduction to Statistics & Biostatistics: Statistics: The science of collecting, organizing, presenting, analyzing, and interpreting numerical data to make decisions. Biostatistics: The application of statistical principles to biological, medical, and public health data. Applications in Pharmacy: (1) Evaluating the efficacy of a new drug vs. placebo. (2) Analyzing quality control data (tablet weights, dissolution times) in manufacturing. (3) Determining shelf-life and stability. (4) Epidemiological studies on disease prevalence.
2Frequency Distribution: Frequency: The number of times a particular value or category occurs in a dataset. Frequency Distribution Table: Organizes data into classes or categories with their corresponding frequencies. Key terms: (1) Class Limits: The highest and lowest values in a class. (2) Class Interval (Width): Difference between upper and lower class boundaries. (3) Class Mark (Midpoint): Average of upper and lower limits. Uses: Condenses large amounts of raw data into a manageable, visual format (often the basis for Histograms).
3Measures of Central Tendency – Mean, Median, Mode: Central Tendency: A single value that attempts to describe a set of data by identifying the central position within that dataset. (1) Arithmetic Mean (Average): Sum of all observations divided by the number of observations (x̄ = Σx / n). Advantage: Uses all data points. Disadvantage: Highly affected by outliers. (2) Median: The middle value when data is arranged in ascending/descending order. If n is even, it’s the average of the two middle values. Advantage: Not affected by extreme outliers. Use for skewed data. (3) Mode: The most frequently occurring value in the dataset. A dataset can be bimodal (two modes) or have no mode. Pharmaceutical Example: Calculating the average (mean) weight of a batch of 20 paracetamol tablets.
4Measures of Dispersion – Range & Standard Deviation: Dispersion: Measures the spread, variability, or scatter of the data around the central value. Mean alone is insufficient; two datasets can have the same mean but vastly different spreads. (1) Range: Difference between the highest and lowest values (R = Max – Min). Simple but only depends on two extreme values. (2) Variance: The average of the squared deviations from the mean (σ² or s²). (3) Standard Deviation (SD): The square root of the variance. Formula for sample SD: s = √[ Σ(x – x̄)² / (n – 1) ]. It represents the average distance of observing data points from the mean. Pharmaceutical Example: A lower SD in tablet dissolution times indicates a highly consistent manufacturing process (better quality control).
5Correlation – Definition & Types: Correlation: A statistical technique that determines if, and how strongly, two continuous variables are related. It does NOT imply causation. Types: (1) Positive Correlation: As variable X increases, variable Y increases (e.g., dose of an antihypertensive drug and the magnitude of blood pressure drop). (2) Negative Correlation: As X increases, Y decreases (e.g., patient age and renal clearance rate). (3) Zero/No Correlation: No linear relationship exists (e.g., patient height and drug efficacy).
6Karl Pearson’s Coefficient of Correlation (r): A mathematical measure of the strength and direction of a linear relationship between two variables. Denoted by ‘r’. Range: The value of ‘r’ always lies between -1 and +1. +1 indicates perfect positive correlation, -1 indicates perfect negative correlation, 0 indicates no linear correlation. Formula: r = [n(Σxy) – (Σx)(Σy)] / √[ (nΣx² – (Σx)²)(nΣy² – (Σy)²) ]. This formula is essential for university exams. Pharmaceutical Problem Example: Finding the correlation coefficient between the ‘concentration of a drug’ (X) and ‘absorbance measured by UV spectrophotometer’ (Y) to prove Beer-Lambert’s law compliance.
7Multiple Correlation: Multiple Correlation: Examines the relationship between one dependent variable and two or more independent variables simultaneously. For example, understanding how a patient’s ‘Reduction in Blood Pressure’ (Dependent variable Y) is correlated with both ‘Drug Dose’ (Independent variable X1) and ‘Body Weight’ (Independent variable X2). Denoted by ‘R’. It helps in understanding complex physiological or pharmaceutical systems where outcomes are rarely caused by a single factor.

Learning Objectives

Calculate Central Tendency: Given a small dataset, accurately calculate the Mean, Median, and Mode.

Calculate Standard Deviation: Understand and apply the formula to calculate the Sample Standard Deviation.

Interpret Spread: Explain the importance of measuring dispersion in pharmaceutical quality control (why the Mean alone isn’t enough).

Define Correlation: Differentiate between positive, negative, and zero correlation with pharmaceutical examples.

Apply Karl Pearson’s Equation: Use the sum of squares and sum of products to calculate the correlation coefficient (r) for a given set of two variables.

Exam Prep Questions

Q1. Why do we divide by (n-1) instead of “n” when calculating Sample Standard Deviation?

When calculating the standard deviation for an entire POPULATION (everyone/everything), you divide by “N”. However, in research, we usually take a SAMPLE. If we divide a sample by “n”, mathematical proofs show we slightly UNDERESTIMATE the true population standard deviation. Dividing by (n-1), known as Bessel’s correction, corrects this bias, giving an “unbiased estimate” of the population variance.

Q2. If the correlation coefficient “r” is 0, does that mean there is absolutely no relationship between the variables?

Not necessarily. Karl Pearson’s “r” only measures LINEAR (straight-line) relationships. If r = 0, there is no linear relationship. However, the variables could still have a very strong NON-LINEAR (e.g., U-shaped or curved) relationship. Always plot a scatter graph to visually inspect the data before concluding.

Q3. When should I use the Median instead of the Mean?

The Mean is preferred for normally distributed (symmetrical) data because it uses every data point. However, the Mean is highly sensitive to extreme outliers. If you have a badly skewed dataset (e.g., 9 patients lose 1 kg, but 1 patient loses 50 kg), the Mean is dragged up and misrepresents the group. The Median (the middle value) is completely unaffected by that extreme 50 kg outlier, making it a better measure of central tendency for skewed data.