Section 2 – Estimating Proportions within a Binary Population

April 27, 2012

This section is a common-sense, intuitive presentation. It explain the main concepts and provides some usable specifics, but without any mathematical foundation.

2.1. Common Sense Observations

You do not need to be a math major or a professional statistician to have an intuitive appreciation of the following.

  • In order to estimate the proportions of some attribute within a population, it would be helpful if you could rely on the proportions observed within a sample of the population.
  • Randomness is important. If you want to rely on a sample, it is important that the sample be random. This means that the sampling was done in such as way that each member of the underlying population had an equal chance of being selected for the sample. In political polling this requirement of randomness is violated if the pollster only calls landlines. In an e-discovery document evaluation context, this requirement is violated if the sample is based only on the earliest documents in chronological order.
  • The size of the sample is important. As the size of a random sample increases, there is greater “confidence” that the observed sample proportion will be “close” to the actual population proportion. If you were to toss a fair coin ten times, it would not be that surprising to get only 3 or fewer heads (a sample proportion of 30% or less). But if there were 1,000 tosses, most people would agree – based on intuition and general experience – that it would be very unlikely to get only 300 or fewer heads. In other words, with the larger sample size, it is generally apparent that the sample proportion will be closer to the actual “population” proportion of 50%.
  • While the sample proportion might be the best estimate of the total population proportion, you would not be very confident that this is exactly the population proportion. For example, assume a political pollster samples 400 voters and finds 208 for Candidate A and 192 for Candidate B. This leads to an estimate of 52% as A’s support in the population. However, it is unlikely that A’s support actual will be exactly 52%. The pollster will be more confident saying that A’s actual support is somewhere between 47% and 57%. And the pollster will very confident saying that A’s actual support is somewhere between 42% and 62%. So, there is a tradeoff between the confidence and the range around the observed proportion.

The value that math adds is that it provides a standard way of quantifying and discussing the intuitive concepts of confidence and closeness, and relating these to sample size.

2.2. Explanation of Statistical Terminology

Building on the preceding example involving political polling, the standard terminology for presenting the population estimate would be something like this:

“Based on the sample size of 400 voters, A’s support is estimated to be 52% with a confidence level of 95% and a margin of error of ±5%.”

Can we decode this?

  • Sample size is just what it says – the number of observations in the sample.
  • Margin of error of ±5% means that the pollster is referring to a range of 5% in each direction around the sample proportion. The range in this case is from 47% = 52% – 5% to 57% = 52% + 5%.
  • It is also possible to state the conclusion by simply stating the range, and without using the term “margin of error”: “Based on the sample size of 400 voters, A’s support is estimated to be in the range from 47% to 57% with a confidence level of 95%.”
  • When presented this way, using an explicit range, the explicit range is referred to as a confidence range or confidence interval. As compared to the margin of error, the confidence range has the advantage that it does not have to be exactly symmetrical around the sample proportion.
  • This leaves the term, confidence level. Obviously, 95% sounds pretty good. 98% or 99% would sound even better. Is 95% high enough? 90%?
  • Here is the derivation of the confidence level concept: The pollster in our example took a sample of 400 from the underlying population. That was just one of a very large number of “size 400” samples that could have theoretically been drawn from the population. When we say that the confidence level in this case is 95%, we are saying that 95% of the theoretically possible “size 400” samples are within 5% of the actual proportion. Thus, we are saying that 95% of the time, any particular “size 400” sample that is actually selected will be within 5% of the actual proportion.
  • A simple guideline – If you use a confidence level of X%, you should expect (100 – X)% of your conclusions to be incorrect. So, if you use a confidence level of 95%, you should expect 5% of your conclusions to be incorrect.

2.3. Sample Size, Margin of Error and Confidence Level are Interdependent

Without getting into the math, it is fair to say – and probably intuitively obvious – that sample size, margin of error/confidence range and confidence level are interdependent. You normally want to increase the confidence level, but that requires increasing the sample size and/or increasing the margin of error. This creates tradeoffs, because you would prefer to reduce the sample size (save time and work) and/or reduce the margin of error (narrow the range).

Following are tables that present some frequently used combinations of these interdependent values. These are all based on the basic normal approximation technique discussed further in Sections 4 and 5, and the “Normal Approx – Basic” page of the spreadsheet, [wpfilebase tag=”file” tpl=edrm-file-name id=88 /].

The basic technique may be inappropriate in some situations, but it is easy to implement. With the widespread availability of computers, there are more refined techniques that will increase the confidence level, reduce the sample size and/or reduce the margin of error.

  • Essentially, the basic technique makes the simplifying – and conservative – assumptions that the population is infinite and that the underlying population proportion is 50%.

The first table shows – for specific sample sizes – different possible “pairs” of margin of error and confidence level. In each case, the first four results are based on standard margins of error of 1%, 2%, 5% and 10%, solving for the confidence level. The following four results are based on standard confidence levels of 90%, 95%, 98% and 99%.

The pollster who reported a 95% confidence level with a 5% margin of error on a sample size of 400 was reporting consistently with the two highlighted cases, allowing for conservative rounding. The pollster could have just as accurately said that A’s actual support was between 50% and 54% with a 57% confidence level. Or could have said that A’s actual support was between 45% and 59% with a 99% confidence level. Once you have results for a sample of a given size, you can equivalently report small margins of error (tight ranges) with low levels of confidence, or large margins of error (wide ranges) with higher levels of confidence.

Sample Size Margin of Error Conf Level
100 0.0100 0.1585
100 0.0200 0.3108
100 0.0500 0.6827
100 0.1000 0.9545
100 0.0822 0.9000
100 0.0980 0.9500
100 0.1163 0.9800
100 0.1288 0.9900
400 0.0100 0.3108
400 0.0200 0.5763
400 0.0500 0.9545
400 0.1000 0.9999
400 0.0411 0.9000
400 0.0490 0.9500
400 0.0582 0.9800
400 0.0644 0.9900
1,500 0.0100 0.5614
1,500 0.0200 0.8787
1,500 0.0500 0.9999
1,500 0.1000 1.0000
1,500 0.0212 0.9000
1,500 0.0253 0.9500
1,500 0.0300 0.9800
1,500 0.0333 0.9900
5,000 0.0100 0.8427
5,000 0.0200 0.9953
5,000 0.0500 1.0000
5,000 0.1000 1.0000
5,000 0.0116 0.9000
5,000 0.0139 0.9500
5,000 0.0164 0.9800
5,000 0.0182 0.9900

Next is a table that shows the required sample sizes for different standard values of margin of error and confidence level.

Conf Level Margin of Error Sample Size
0.9000 0.0100 6,764
0.9000 0.0200 1,691
0.9000 0.0500 271
0.9000 0.1000 68
0.9500 0.0100 9,604
0.9500 0.0200 2,401
0.9500 0.0500 385
0.9500 0.1000 97
0.9800 0.0100 13,530
0.9800 0.0200 3,383
0.9800 0.0500 542
0.9800 0.1000 136
0.9900 0.0100 16,588
0.9900 0.0200 4,147
0.9900 0.0500 664
0.9900 0.1000 166

The highlighted combination shows that the required sample size for an “exact” 95% confidence level and 5 margin of error is 385.

All of these results can be reproduced using the accompanying spreadsheet, [wpfilebase tag=”file” tpl=edrm-file-name id=88 /], as explained in Section 5.