Basic Statistical Analysis of Biological Data

Descriptive Statistics, Mean and Standard Deviation

The Normal Distribution

A normal distribution, formally called a Gaussian distribution, is a statistical distribution of data that is symmetrically bell-shaped with a peak at the average value (mean). It is considered the most commonly-encountered probability distribution in statistics and the natural sciences.

The Normal Distribution


Image Source: SyQue.com

Two statistical quantities characterize any set of data that has a normal distribution: the arithmetic mean, which is where the peak of the curve occurs, and the standard deviation, which is a measure of data dispersion around the mean, or, in other words, a measure of the girth of the bell-shaped curve.

We know that many biological variables fit the normal distribution well, and unless some complicating factor exists, it is usually safe to assume that a biological variable will fit the normal distribution. This basic assumption allows us to apply the empirical rule, or 68-95-99 rule, to biological data:

For every normal curve, regardless of its mean or standard deviation,

  • About 68% of the data points fall within 1 standard deviation of the mean.
  • About 95% of the data points fall within 2 standard deviations of the mean.
  • About 99% of the data points fall within 3 standard deviations of the mean.

So, the mean and standard deviation are all you need to adequately describe the data you collect, assuming the data fit the normal distribution.

Descriptive Statistics

Mean and standard deviation are descriptive statistics. Descriptive statistics are statistics used to describe and summarize data collected in a survey or an experiment. The use of descriptive statistics to summarize data is generally considered a basic requirement when presenting data.

Four specific types of measures are considered descriptive statistics:

  • a measure of central tendency of the data, such as the arithmetic mean, median, and mode.
  • a measure of statistical dispersion of the data, such as standard deviation, variance, and range. Statistics of dispersion are used to give a single number that describes how compact or spread out a distribution of observations.
  • measures of shape of the distribution, such as skewness.
  • measures that describe the most unusual members of a population, such the minumum and maximum values observed, or sample quantiles.

For the purposes of this monograph, we are most interested in measures of central tendency and statistical dispersion.

Measures of Central Tendency

Measures of central tendency are single numbers that describe the "center" of the data set. There are three basic measures of central tendency: arithmetic mean, median and mode.

Arithmetic Mean. The arithmetic mean is the average value of a group of data. It is the most common measure of central tendency. It is calculated by dividing the sum of the observations by the number of observations. If the data are taken from all members of a population, the mean of that data set is more specifically called a population mean. If the data are taken from a subset of members of a population, i.e., a sample of the population, the mean of that data set is more specifically called a sample mean and is regarded as an estimate of the population mean.

Population mean = μ = ( ∑Xi ) / N

Sample mean = x = ( ∑xi ) / n

Where μ is the population mean, ∑Xi is the sum of all the population observations, N is the number of population members, x is the sample mean, ∑xi is the sum of all the sample observations, and n is the number of sample observations.

For data that do not fit a normal distribution, that is, data that are highly skewed, the arithmetic mean does not work well as a measure of central tendency. In these cases the median and mode should also be reported.

Median. The median is the middle value when the data from an experiment are sorted from lowest to highest. For an even number of data points, the median is the arithmetic mean of the two middle values in the sorted list. The median is a better measure of central tendency when dealing with highly skewed distributions.

Mode. The mode is the most common value in a data set. The mode is particularly useful when distinguishing multimodal from unimodal distributions.

In normal distributions, the mean, median and mode are equal to the same value. They are not equal to the same value in data that are not normally distributed (skewed distributions).

Measures of Statistical Dispersion

Measures of statistical dispersion are single numbers that describe how the data are dispersed around the mean. Commonly-used measures of statistical dispersion are variance and standard deviation. In each case, the larger the number, the wider the spread of data around the mean, or, in other words, the wider the girth of the bell-shaped curve in a normal distribution of data.

Variance. The variance is the most basic measure of how far a set of numbers is spread out from the mean. It is defined as the average of the squared differences from the mean, or in other words, the "average sum of squares". A sum of squares is the sum of the squared differences between each data point and the mean of all the data points. That is,

sum of squares = ∑ ( Xi - μ )2

where is sum, Xi represents each of the data points in the data set, taken individually, and μ is the mean.

Since variance is the average sum of squares, one calculates the population variance by dividing the sum of squares of the population data by the number of observations in the population, N:

δ2 = ∑ ( Xi - μ )2 / N

where δ2 is the population variance and N is the number of elements in the population.

One calculates the sample variance by dividing the sum of squares of the sample data by n-1:

s2 = ∑ ( xi - x )2 / ( n - 1 )

where s2 is the sample variance, x is the sample mean and n is the number of elements in the sample. Using this formula, the sample variance can be considered an unbiased estimate of the true population variance. Therefore, if you need to estimate population variance based on data from a random sample, this is the formula to use.

Standard deviation. Standard deviation is the most commonly-used measure of dispersion of data around a mean - reported far more frequently than the variance. Arithmetically, standard deviation is defined as the square root of the variance. The population standard deviation is equal to the square root of the population variance:

δ = √ [ ∑ ( Xi - μ )2 / N ]

where δ is the population standard deviation.

The sample standard deviation is equal to the square root of the sample variance.

s = √ [ ∑ ( xi - x )2 / ( n - 1 ) ]

where s is the sample standard deviation.

Repeating from above, assuming the data fit the normal distribution, which is usually a safe assumption for biological data, when we know the standard deviation of the data, we can conclude that 68% (or roughly two-thirds) of the data are within one standard deviation of the mean, 95.4 percent are within two standard deviations of the mean, and 99.7 (or almost all) are within 3 standard deviations of the mean.

Putting It All Together

One should always use the proper descriptive statistics when presenting data. Descriptive statistics can describe a data set simply and concisely. Assuming the data are normally distributed, which is generally a safe assumption with most biological data, the mean and standard deviation will fully characterizes the distribution of data points collected in a survey or an experiment and should be reported when presenting any data set. Moreover, when taking data from a sample of a population rather than from every member of the population, the sample mean and sample standard deviation are not only descriptors of the sample data set collected, but are also estimates of the population mean and population standard deviation of the variable being measured.

Next:   Confidence Intervals & Standard Error of the Mean


| >