Two statistical quantities characterize any set of data that has a normal distribution: the **arithmetic mean**, which is where the peak of the curve occurs, and the **standard deviation**, which is a measure of data dispersion around the mean, or, in other words, a measure of the girth of the bell-shaped curve.

We know that many biological variables fit the normal distribution well, and unless some complicating factor exists, it is usually safe to assume that a biological variable will fit the normal distribution. This basic assumption allows us to apply the **empirical rule**, or **68-95-99 rule**, to biological data:

So, the mean and standard deviation are all you need to adequately describe the data you collect, assuming the data fit the normal distribution.

For the purposes of this monograph, we are most interested in measures of central tendency and statistical dispersion.

##### Measures of Central Tendency

Measures of central tendency are single numbers that describe the "center" of the data set. There are three basic measures of central tendency: **arithmetic mean**, **median** and **mode**.

**Arithmetic Mean.** The arithmetic mean is the average value of a group of data. It is the most common measure of central tendency. It is calculated by dividing the sum of the observations by the number of observations. If the data are taken from all members of a population, the mean of that data set is more specifically called a **population mean**. If the data are taken from a subset of members of a population, i.e., a *sample* of the population, the mean of that data set is more specifically called a **sample mean** and is regarded as an *estimate* of the population mean.

**Population mean = μ = ( ∑X**_{i} ) / N

**Sample mean = x = ( ∑x**_{i} ) / n

Where **μ** is the population mean, **∑X**_{i} is the sum of all the population observations, **N** is the number of population members, **x** is the sample mean, **∑x**_{i} is the sum of all the sample observations, and **n** is the number of sample observations.

For data that do not fit a normal distribution, that is, data that are highly skewed, the arithmetic mean does not work well as a measure of central tendency. In these cases the median and mode should also be reported.

**Median.** The median is the middle value when the data from an experiment are sorted from lowest to highest. For an even number of data points, the median is the arithmetic mean of the two middle values in the sorted list. The median is a better measure of central tendency when dealing with highly skewed distributions.

**Mode.** The mode is the most common value in a data set. The mode is particularly useful when distinguishing multimodal from unimodal distributions.

In normal distributions, the mean, median and mode are equal to the same value. They are not equal to the same value in data that are not normally distributed (skewed distributions).

##### Measures of Statistical Dispersion

Measures of statistical dispersion are single numbers that describe how the data are dispersed around the mean. Commonly-used measures of statistical dispersion are **variance** and **standard deviation**. In each case, the larger the number, the wider the spread of data around the mean, or, in other words, the wider the girth of the bell-shaped curve in a normal distribution of data.

**Variance.** The variance is the most basic measure of how far a set of numbers is spread out from the mean. It is defined as the average of the squared differences from the mean, or in other words, the "average sum of squares". A sum of squares is the sum of the squared differences between each data point and the mean of all the data points. That is,

**sum of squares = ∑ ( X**_{i} - μ )^{2}

where **∑** is sum, **X**_{i} represents each of the data points in the data set, taken individually, and **μ** is the mean.

Since variance is the *average* sum of squares, one calculates the **population variance** by dividing the sum of squares of the population data by the number of observations in the population, N:

**δ**^{2} = ∑ ( X_{i} - μ )^{2} / N

where **δ**^{2} is the population variance and **N** is the number of elements in the population.

One calculates the **sample variance** by dividing the sum of squares of the sample data by n-1:

**s**^{2} = ∑ ( x_{i} - x )^{2} / ( n - 1 )

where **s**^{2} is the sample variance, **x** is the sample mean and **n** is the number of elements in the sample. Using this formula, the sample variance can be considered an unbiased estimate of the true population variance. Therefore, if you need to estimate population variance based on data from a random sample, this is the formula to use.

**Standard deviation.** Standard deviation is the most commonly-used measure of dispersion of data around a mean - reported far more frequently than the variance. Arithmetically, standard deviation is defined as the square root of the variance. The **population standard deviation** is equal to the square root of the population variance:

**δ = √ [ ∑ ( X**_{i} - μ )^{2} / N ]

where **δ** is the population standard deviation.

The **sample standard deviation** is equal to the square root of the sample variance.

** s = √ [ ∑ ( x**_{i} - x )^{2} / ( n - 1 ) ]

where **s** is the sample standard deviation.

Repeating from above, assuming the data fit the normal distribution, which is usually a safe assumption for biological data, when we know the standard deviation of the data, we can conclude that 68% (or roughly two-thirds) of the data are within one standard deviation of the mean, 95.4 percent are within two standard deviations of the mean, and 99.7 (or almost all) are within 3 standard deviations of the mean.

One should always use the proper descriptive statistics when presenting data. Descriptive statistics can describe a data set simply and concisely. Assuming the data are normally distributed, which is generally a safe assumption with most biological data, the **mean** and **standard deviation** will fully characterizes the distribution of data points collected in a survey or an experiment and should be reported when presenting any data set. Moreover, when taking data from a sample of a population rather than from every member of the population, the **sample mean** and **sample standard deviation** are not only descriptors of the sample data set collected, but are also estimates of the population mean and population standard deviation of the variable being measured.