Revised February 18, 2015
Thanks to the following EDRM members, without whom Release 2 of Statistical Sampling Applied to Electronic Discovery would not exist:
Thank you also to Bill Dimm, Hot Neuron LLC, for additional comments and feedback.
The purpose of this document is to provide guidance regarding the use of statistical sampling in e-discovery contexts. This is an update/enhancement of material that was originally developed and posted on the EDRM website in 2012.
E-discovery participants recognize that, when used appropriately, statistical sampling can optimize resources and improve quality. However, an ongoing educational challenge is to meet the needs of two audiences within the e-discovery community.
Therefore, some of the material is covered twice. The earlier material is definitional and conceptual, and is intended for a broad audience. The later material and the accompanying spreadsheet provide additional, more technical information, to people in e-discovery roles who become responsible for developing further expertise.
The accompanying spreadsheet is EDRM Statistics Examples 20150123.xlsm.
As introductory matters, Subsection 1.2 provides a set of definitions related to statistical sampling, and Subsection 1.3 provides examples of e-discovery situations that warrant use of sampling.
Sections 2, 3, 4 and 5 examine four specific areas of statistics. The 2012 release focused only on the first of these, which is the problem of estimating the proportions of a binary population. If the observations of a population can have only two values, such as Responsive or Not Responsive, what can the proportion of each within a random sample tell us about the proportions within the total population?
The three new areas in this 2014 release are these.
These topics are presented in basic, non-technical ways in Sections 2, 3, 4 and 5.
Section 6 presents some important guidelines and considerations when using statistical sampling in the e-discovery. These recommendations are intended to help avoid misuse or improper use of statistical sampling.
Sections 7, 8 and 9 are more technical. They present more formally the math that underlies the earlier material, and make use of the accompanying Excel spreadsheet.
The purpose of this section is to define, in advance, certain terms and concepts that will be used in the ensuing discussions.
The EDRM provides a great overall guide as to the individual steps and processes of e-discovery. For purposes of outlining when sampling is important and how it can be effective, the particular portions are found in largely in the middle of the EDRM.
Generally speaking, the further to the left/top in the EDRM you are sampling, the more you are assessing inclusion of all material for review, the ability to review, the types of documents to review, and other items related to management of the process. The further to the bottom/right in the EDRM you go, the more you are assessing quality control and comprehensiveness of the process. Did you review everything you need to? Have you caught all privilege? Etc. Since the purposes differ, which impacts the method used to sample, it is necessary to address each portion separately.
One basic reason to use statistical sampling is to develop an estimate of proportions within a binary population. In addition to the estimate, itself, we want to quantify our “confidence” in the estimate according to established standards. This section provides a common-sense, intuitive explanation of this process. It presents the main concepts and provides some useable specifics, but without formal math. Formal math is presented in Sections 7 and 8 for readers who are interested.
One need not be a math major or a professional statistician to have an intuitive appreciation of the following.
The value that math adds is that it provides a standard way of quantifying and discussing the intuitive concepts of confidence and closeness, and relating these to sample size.
Building on the preceding example involving political polling, the standard terminology for presenting the population estimate would be something like this:
Based on the sample size of 400 voters, A’s support is estimated to be 52% with a confidence level of 95% and a margin of error of ±5%.
Can we decode this?
One further definitional point that bears repeating is that the margin of error is a proportion of the population, and not a proportion of the estimate. Using the political polling example above, where A’s support is estimated to be 52% with a confidence level of 95% and a margin of error of ±5%, assume the sample is from a voting population of 10 million. The 52% sample proportion leads to a “point estimate” within the population of 52% of 10 million = 5,200,000 million. Applied to the population, the margin of error is ±5% of 10 million = ±500,000 and the confidence interval is from 4,700,000 to 5,700,000. It is not correct to say that the margin of error is ±5% of the 5,200,000 point estimate, or ±260,000.
Without getting into the math, it is fair to say – and hopefully intuitively obvious – that sample size, margin of error/confidence range and confidence level are interdependent. You want to increase the confidence level, but that requires increasing the sample size and/or increasing the margin of error. This creates tradeoffs, because you would prefer to reduce the sample size (save time and work) and/or reduce the margin of error (narrow the range).
Following are two tables that illustrate this interdependence. (These tables are derived using a very basic technique, as discussed briefly in Subsection 2.4, and then more fully discussed further in Section 8, and the accompanying spreadsheet.^{7})
Table 1 shows different possible “pairs” of margin of error and confidence level assuming sample sizes of 400 and 1,500.
Table 1 | ||
---|---|---|
Sample Size | Margin of Error | Conf Level |
400 | 0.0100 | 0.3108 |
400 | 0.0200 | 0.5763 |
400 | 0.0300 | 0.7699 |
400 | 0.0500 | 0.9545 |
400 | 0.0750 | 0.9973 |
400 | 0.1000 | 0.9999 |
1,500 | 0.0100 | 0.5614 |
1,500 | 0.0200 | 0.8787 |
1,500 | 0.0300 | 0.9799 |
1,500 | 0.0500 | 0.9999 |
1,500 | 0.0750 | 1.0000 |
1,500 | 0.1000 | 1.0000 |
The pollster who reported a 5% margin of error with a 95% confidence level on a sample size of 400 was reporting consistently with the case highlighted in green, allowing for conservative rounding. With a sample size of 400, the pollster could have just as accurately reported a 2% margin of error with a 57% confidence level or a 10% margin of error with a 99% confidence level. Once you have results for a sample of a given size, you can equivalently report small margins of error (tight ranges) with low levels of confidence, or large margins of error (wide ranges) with higher levels of confidence.
Table 1 also shows that increasing the sample size will reduce margin of error and/or increase confidence level.
Table 2 shows the required sample sizes for different standard values of margin of error and confidence level.
Table 2 | ||
---|---|---|
Conf Level | Margin of Error | Sample Size |
0.9000 | 0.0100 | 6,764 |
0.9000 | 0.0200 | 1,691 |
0.9000 | 0.0500 | 271 |
0.9000 | 0.1000 | 68 |
0.9500 | 0.0100 | 9,604 |
0.9500 | 0.0200 | 2,401 |
0.9500 | 0.0500 | 385 |
0.9500 | 0.1000 | 97 |
0.9800 | 0.0100 | 13,530 |
0.9800 | 0.0200 | 3,383 |
0.9800 | 0.0500 | 542 |
0.9800 | 0.1000 | 136 |
The highlighted combination shows that the required sample size for an exact 95% confidence level and 5% margin of error is actually 385.
Consider a situation where 385 electronic documents are sampled for relevance to a particular discovery demand and only three documents are relevant. The sample proportion is thus only 3/385 = 0.007792 = 0.78%. Using Table 2, this would imply a 95% confidence level with a margin of error of ±5%. The confidence range would this be calculated as from 0.78% – 5% = -4.22% to 0.78% + 5% = 5.78%, and this of course makes no sense. The population proportion cannot possibly be negative. Also, since there were some relevant documents in the sample, the population proportion cannot possibly be zero.
There would be a similar problem if there had been 382 relevant documents in the sample of 385.
This is a practical example that illustrates the limitations of the math behind Tables 1 and 2. Another mathematical approach is needed in these situations, and fortunately there are approaches that work. Using one of the more common techniques,^{8} we can say that the estimated population proportion is 0.78% with a 95% confidence level and a confidence range from 0.17% to 2.32%.
Notice that this confidence range is not symmetrical around 0.78%. (0.78% – 0.17% = 0.61% while 2.32% – 0.78% = 1.54%.) This not a case where we can use the term (or concept) “margin of error” to indicate the same distance on either side of the sample proportion.
Thus, it is important to remember that
If we explain only the simple math, we leave the incorrect impression that this is all one has to know. If we explain more, we go beyond what most non-mathematicians are willing to engage and digest. We resolve this dilemma by keeping things as simple as possible in Sections 2, 3, 4 and 5, and then providing more advanced material in Sections 7 and 8, and the accompanying spreadsheet.
Not every situation requires an estimate of the population proportions. In some situations, it is more important to be confident that the population proportion is zero or very close to zero than to develop an actual estimate. For example, if a set of 2,000 documents has been reviewed by a human reviewer, we might want to use sampling to develop a level of confidence that the human reviewer’s error rate is not worse than some pre-established tolerance level, such as 10%. Our concern is that the error rate not be 10% or more. Since we are not concerned with the question of whether the actual rate is 2% or 3% or 5% or whatever, this enables smaller sample sizes.
Sampling problems of this sort are addressed in an area of math known as acceptance sampling. This section provides a basic introduction. More formal math is presented in Section 9 for readers who are interested.
We can understand intuitively that if we take a sample of the documents, and there are zero errors in that sample, we can get some confidence that the total error rate in the population of 2,000 documents is low. In quantitative terms, the problem could be framed as follows.
Acceptance sampling has developed as the mathematical approach to addressing these types of questions, and has traditionally been employed in the context of quality control in manufacturing operations. The types of underlying math are the same as those used on proportion estimation.
Table 3 shows the required sample sizes for different population sizes, confidence levels and unacceptable error rates. The row highlighted in green shows that a sample size of only 29 will meet the criteria in the example as posed.
Table 3 | |||
---|---|---|---|
Pop Size | Conf Level | Unacceptable Error Rate | Sample Size |
2,000 | 0.9000 | 0.1000 | 22 |
2,000 | 0.9000 | 0.0500 | 45 |
2,000 | 0.9000 | 0.0100 | 217 |
2,000 | 0.9500 | 0.1000 | 29 |
2,000 | 0.9500 | 0.0500 | 58 |
2,000 | 0.9500 | 0.0100 | 277 |
2,000 | 0.9800 | 0.1000 | 37 |
2,000 | 0.9800 | 0.0500 | 75 |
2,000 | 0.9800 | 0.0100 | 354 |
100,000 | 0.9000 | 0.1000 | 22 |
100,000 | 0.9000 | 0.0500 | 45 |
100,000 | 0.9000 | 0.0100 | 229 |
100,000 | 0.9500 | 0.1000 | 29 |
100,000 | 0.9500 | 0.0500 | 59 |
100,000 | 0.9500 | 0.0100 | 298 |
100,000 | 0.9800 | 0.1000 | 38 |
100,000 | 0.9800 | 0.0500 | 77 |
100,000 | 0.9800 | 0.0100 | 389 |
Rigorous quality control (“QC”) review using acceptance sampling might not have been a standard procedure in legal discovery in the past, especially when the entire coding was performed by humans. The advent of machine coding has increased the recognition that QC is a vital part of the e-discovery process.
This example shows what can be done, but also just scratches the surface. An important extension, using this example, is to find a sampling approach that also minimizes the probability that we mistakenly reject a reviewed set when the actual error rate it an acceptable level.
As noted, a more advanced technical discussion of acceptance sampling appears in Section 9.
For some time now, the legal profession and the courts have been embracing, or at least accepting, the use of technologies that offer the benefit of avoiding 100% human review of a corpus. Statistical sampling serves the important role of evaluating the performance of these technologies. After a brief discussion of key concepts and terminology, this section discusses the statistical issues that will be encountered and should be understood in these situations.
These concepts and definitions are specific enough to this section that it was premature to list them in Subsection 1.2. Different observers use some of these terms in different ways, and the goal here is not to judge that usage. The goal is simply to be clear about their meanings within this discussion.
According to Gold Standard Human Expert (“Actual”) | ||||
Responsive | Not Responsive | Total | ||
According to Computer Classifier (“Predicted”) | Responsive | True Positive (TP) | False Positive (FP) | Predicted Response (PR = TP + FP) |
Not Responsive | False Negative (FN) | True Negative (TN) | Predicted Not Responsive (PN = FN + TN) | |
Total | Actual Responsive (AR = TP + FN) | Actual Not Responsive (AN = FP + TN) | TOTAL (T) |
An important observation arising from the definitions in Subsection 4.1 is that sampling for recall presents a greater challenge than sampling for precision, elusion or yield.
When sampling for precision, the underlying population – predicted responsives – is a known population based on the work done by the classifier. Similarly, when sampling for elusion with an underlying population of predicted non-responsives and sampling for yield with the underlying population being the full population.
However, when sampling for recall, the underlying population – actual responsives — is not, itself, a known population until there has been gold standard review. As a result, and as will be explained, sampling for recall requires larger sample sizes than sampling for the other key metrics in order to achieve the same levels of confidence. We will discuss two common techniques for sampling for recall.
One technique has been referred to as the “Direct Method”.^{11} The essence of the direct method is to sample as many documents as necessary from a full corpus to find a sample of the required size of actual responsives. Even though the AR population is not known, a sample of actual responsives can be isolated by starting with a sample from the full population and then using human review to isolate the actual responsives from the actual non-responsives.
Thus, the required amount of human review will depend on the yield. For example, if the intent is to estimate recall based on a sample of size of 400, and the actual responsives are 50% the total population, human reviewers would have to review approximately 800 (i.e., 400/0.50 documents) in order to isolate the 400 that could be used to estimate recall. (The number is approximate because the process requires review of as many documents as necessary until 400 responsives are actually found.)
Similarly, if the actual responsives are only 10% of the total population, human reviewers would have to review approximately 4,000 (i.e., 400/.10 documents) in order to isolate the 400 responsives that could be used to estimate recall.
This reality regarding sampling for estimation of recall was understood in In re Actos.^{12} The parties agreed that the initial estimate of yield (termed “richness” in the In re Actos order) would be based on a sample of size 500 (the “Control Set”).^{13} They further agreed that the sample should be increased, as necessary, “until the Control Set contains at least 385 relevant documents” to assure that the “error margin on recall estimates” would not exceed 5% at a 95% confidence level.^{14}
A second technique for estimating recall is based on expressing recall as a function of combinations of precision, elusion and/or yield. For example,
Precision = TP/Positives, so TP = Precision * Positives
Elusion = FN/Negatives, so FN = Elusion * Negatives
Recall = TP/(TP + FN), so Recall = (Precision * Positives) / (Precision * Positives + Elusion * Negatives)
We can estimate precision and elusion using binary proportion techniques, and then put those estimates into the above formula to get an estimate of recall.
However, it is not correct to say that this estimate of recall has the same confidence level as the individual estimates of precision and elusion. The additional math required to express the confidence levels and confidence intervals is beyond the scope of this material, but suffice to say that this type of approach will not necessarily or substantially reduce the overall necessary sample sizes relative to the direct method.
In many situations, high recall will coincide with low elusion and vice versa. Since elusion is easier to sample than recall, for the reasons noted above, some commentators recommend elusion testing as an alternative to a recall calculation.
However, it is not always the case that the high recall/low elusion relationship will hold. For example, if a population has a 1% prevalence rate and the documents identified as non-responsive by the classifier have a 1% elusion rate, the classifier performed poorly. The classifier did not perform better than random guessing. Elusion would be “low”, but this would not indicate high recall.
Grossman and Cormack also reference this issue.^{15}
Having stated in Subsection 4.1 that it is not the purpose of this material to discuss particular classification technologies, it is still useful to make one observation that updates/corrects a point made in the EDRM statistics materials from 2012.
In discussing the use of sampling to create a seed set for the purpose of machine learning, those materials stated that it is “recognized that it is important to the process that this sample be unbiased and random.” It is no longer appropriate, if it ever was, to make this generalization.
Basically, there are multiple approaches to machine learning. They do not all use the same algorithms and protocols. Different designers and vendors use different techniques. There may be approaches under which the use of a random seed set is optimal, but there also may be approaches under which some form of judgmental sampling is more effective. One might say that the optimal protocols for seed set selection are vendor specific.
This provides a good lesson in a basic point about sampling. Randomness is not an inherently good quality. Random sampling is not inherently superior to judgmental sampling. Random sampling makes sense when your goal is to apply some specific mathematical techniques (such as confidence intervals and acceptance sampling), and those techniques depend on the specific assumption of randomness. Random sampling does not necessarily make sense when your goal is to work within a framework that is predicated on different assumptions about the incoming data.
While statistical sampling can be very powerful, it is also important that it not be used incorrectly. This section discusses some common sense considerations that might not be obvious to people with limited exposure to the use of statistics in practical contexts. The goal is to provide guidance addressed at preventing problems.
We can use the term “culling” to describe the process of removing documents from the population prior to review on the basis that those documents are believed to be non-responsive. Issues arise in the degree of certainty about non-responsiveness.
Practitioners understand that standards can change during review. It is entirely possible that what has been considered responsive or relevant has changed over the course of review.
It is possible, however, that the actual standards for responsiveness can change in the course of a review. This change in standards might be based on information and observations garnered in the early stages of the review. If this is case, then of course it would not be sound to use a sample based on one set of standards to estimate proportions under different standards.
Some calls are close calls. This does not, in itself, undermine the validity of statistical sampling, as long as the calls are being made under a consistent standard.
Do not assume that the proportion of responsive documents in a deduplicated population is the same as the proportion that had been in the pre-deduplicated population.^{16} This would only be true if the deduplication process reduced the numbers of responsives and non-responsives by the same percentage, and there is ordinarily no basis for knowing that.
As a simple example, assume a pre-deduplicated population of 500,000, of which 100,000 are responsive and 400,000 are non-responsive, for a 20% yield rate. (Of course, these amounts would not actually be known prior to sampling and/or full review.) The deduplication process removes 50,000 responsives and 350,000 non-responsives, resulting in a deduplicated population of 100,000, of which 50,000 are responsive, for a 50% yield rate. (Again, these amounts would not actually be known prior to sampling and/or full review.)
This may seem like an obvious point, but it is worth repeating because it leads to some important lessons.
It is often the case that a document can also be considered a collection of documents. A common example is an email plus its attachments. Another example would be an email thread – a back-and-forth conversation with multiple messages. Different practitioners may have different approaches to review and production in these situations.
While one may take the basic position that the full document (the email with all attachments, the full email thread) is responsive if any of it is responsive, there will be circumstances where classification of the component documents is necessary. For example, a non-privileged pdf could be attached to an otherwise privileged email to an attorney. Or a responsive email thread could include messages about the (irrelevant) company picnic – it might not be problematic to produce the full thread, but it might be determined that the company picnic parts should not be included if this is part of a machine learning seed set.
In other words, there may be good analytical reasons to analyze the component documents as distinct documents.
It is not the purpose of this material to opine generally on practice and approaches in this area. The important point from a statistics perspective is to be aware that this can result in ambiguities. The question of, “What percentage of documents is responsive?” is different if an email with attachments is considered one document or multiple documents. Depending on the need and circumstances, either question might make sense, but be aware of the need to handle them differently. Do not assume that a sampling result based on one definition provides a valid estimate for the other definition.
It is possible that there are readily identifiable sub-populations within the population. Judgment can be used to determine sub-populations of interest. An example in e-discovery is sub-populations based on custodian.
As noted above, when stating a statistical conclusion, also state the basis for that conclusion in terms of sample results and statistical methodology. Section 8, for example, explains that there are several distinct methods for development of confidence intervals, and variation within methods.
A reader who is competent in statistics ought to be able to reproduce the stated conclusion based on the input provided.
A person who is classifying a previously classified document for purposes of validation or quality control should not know the previous classification. E.g., when sampling for the purpose of estimating the precision of a predictive coding process, the sample will be drawn only from the “positives” generated by that process. However, the reviewer should not be informed in advance that the documents have already been classified as positive.
If you want to make a representation or an argument based on sample results, and you end up taking multiple samples, be prepared to show all of the samples. Do not limit your demonstration to the samples that are most supportive of your position.
Cherry picking of samples would be unsound statistical practice, and there may be questions of legal ethics.
For example, if you enter a sampling process with a plan that involves making a decision based on a sample of size 400, you cannot decide after looking at that sample to sample an additional 600 and then make the decision based on the total sample of size 1,000.
This entire discussion is limited to sampling situations where the sample space contains only two possible outcomes. Particular observations and conclusions presented here cannot necessarily be extended to cases where there are more than two possible outcomes. For example,
Consult your statistics consultant in these situations.
It is not the primary intent of these EDRM materials to present all the requisite statistical theory at the level of the underlying formulas. The amount of explanation that would be necessary to provide a “non-math” audience with a correct understanding is extensive, and would not necessarily be of interest to most members of that audience.
However, there were readers of the 2012 material who did request more rigor in terms of the statistical formulas. The basic goal in Sections 7, 8 and 9, therefore is to thread limited but technically correct paths through statistical materials, sufficient to explain confidence calculations and acceptance sampling. In addition, there is an Excel spreadsheet, EDRM Statistics Examples 20150123.xlsm, that implements most of the formulas using sample data.
The target audience for this section is mainly those who are working in e-discovery, and who already have some interest and experience with math at the college level. These could be people in any number of e-discovery roles, who have decided, or who have been called upon, to refresh and enhance their skills in this area. This material is written from the perspective of guiding this target audience. Section 7 covers basic points about the key distributions. Section 8 applies this material to calculate confidence intervals and related values. Section 9 explains acceptance sampling.
This material avoids some of the formal mathematical formulas – formulas involving factorials and integrals, for example. Instead, it presents the Microsoft Excel functions that can be used to calculate values. These avoid the more technical notation while still enabling discussion of concepts. Together with the actual spreadsheet, these should assist the reader who seeks to apply the material using Excel. References are to Excel 2010 or later.
The three main probability distributions that should be understood are the binomial distribution, the hypergeometric distribution, and the normal distribution. These are covered in standard college textbooks on probabilities and statistics. Wikipedia has articles on all of these, although, of course, Wikipedia must be used with caution.
This is the conceptually easiest model. The binomial distribution models what can happen if there are n trials of a process, each trial can only have two outcomes, and the probability of success for each trial is the same.
Pr(X = x) = BINOM.DIST(x,n,p,FALSE) | (7.2.1) |
Pr(X ≤ x) = BINOM.DIST(x,n,p,TRUE) | (7.2.2) |
The hypergeometric distribution models what can happen if there are n trials of a process, and each trial can only have two outcomes, but the trials are drawn from a finite population. They are drawn from this population “without replacement”, meaning that they are not returned to the population and thus cannot be selected again.
Pr(X = x) = HYPGEOM.DIST(x,n,M,N,FALSE) | (7.3.1) |
Pr(X ≤ x) = HYPGEOM.DIST(x,n,M,N,TRUE) | (7.3.2) |
HYPGEOM.DIST(x,n,M,N,TRUE) ~ BINOM.DIST(x,n,(M/N),TRUE) | (7.3.3) |
Mean: np | (7.4.1) |
Standard Deviation: (np(1-p))^{0.5} | (7.4.2) |
(The 0.5 exponent indicates square root.)
X̄ = X/n | (7.4.3) |
Mean: p | (7.4.4) |
Standard Deviation: (p(1-p)/n)^{0.5} | (7.4.5) |
The normal distribution is the familiar “bell curve”. It is more abstract than the binary binomial and hypergeometric distributions. However, it has some very useful characteristics.
Pr(X ≤ x) = NORM.DIST(x,µ,σ,TRUE) | (7.5.1) |
x = µ + z σ | (7.5.2) |
In other words, x is expressed as being z standard deviations away from the mean. Stated equivalently, z is the number of standard deviations that x is away from the mean.
The total probability under the curve is 1.00, the total of all possible outcomes. Also, for the normal distribution, the probability of being under any specific part of the curve depends only on z, whatever the values of µ or σ.
Table 4 | |
---|---|
Normal Distribution Probabilities | |
z | Pr (X ≤ µ + zσ) |
-4 | 0.000032 |
-3 | 0.001350 |
-2 | 0.022750 |
-1 | 0.158655 |
0 | 0.500000 |
1 | 0.841345 |
2 | 0.977250 |
3 | 0.998650 |
4 | 0.000068 |
Z = (X- µ)/σ | (7.5.3) |
z = (x- µ)/σ | (7.5.4) |
Thus,
Pr(X ≤ x) = Pr(Z ≤ z) = NORM.DIST(z,0,1,TRUE) | (7.5.5) |
NORM.DIST(z,0,1,TRUE) = NORM.S.DIST(z,TRUE) | (7.5.6) |
NORM.DIST(-z,0,1,TRUE) = 1 – NORM.DIST(z,0,1,TRUE) | (7.6.1) |
BINOM.DIST(x,n,p,TRUE) ~ NORM.DIST(x/n,p,(p(1-p)/n)^{0.5},TRUE) | (7.7.1) |
Equivalently,
BINOM.DIST(x,n,p,TRUE) ~ NORM.DIST(z,0,1,TRUE) | (7.7.1) |
where
z = (x/n – p)/((p(1-p)/n) ^{0.5}) | (7.7.2) |
BINOM.DIST(x,n,p,FALSE) ~ NORM.DIST((x+0.5)/n,p,(p(1-p)/n)^{0.5},TRUE) – NORM.DIST((x-0.5)/n,p,(p(1-p)/n)^{0.5},TRUE) | (7.7.4) |
BINOM.DIST(x,n,p,TRUE) ~ NORM.DIST((x+0.5)/n,p,(p(1-p)/n)^{0.5},TRUE) | (7.7.5) |
Equivalently,
BINOM.DIST(x,n,p,TRUE) ~ NORM.DIST(z,0,1,TRUE) | (7.7.6) |
where
z = ((x+0.5)/n – p)/((p(1-p)/n) ^{0.5}) | (7.7.7) |
Pr(X ≤ x) = NORM.DIST(x,µ,σ,TRUE) | (7.8.1) |
x = NORM.INV(prob,µ,σ) | (7.8.2) |
z = NORM.INV(prob,0,1) | (7.8.3) |
and then calculate x as
x = z σ + µ | (7.8.4) |
The most basic approach – the one used that is behind the figures in Tables 1 and 2 in Section 2 – is an approach known as the Wald method. We explain this and then reference some techniques that are generally considered superior.
Pr (X ≤ µ-e) = NORM.DIST(µ-e, µ, σ,TRUE) | (8.1.1) |
Pr (µ-e ≤ X ≤ µ+e) = NORM.DIST(µ+e, µ, σ,TRUE) – NORM.DIST(µ-e, µ, σ,TRUE) | (8.1.2) |
Pr (X ≥ µ + e) = 1 – NORM.DIST(µ+e, µ, σ,TRUE) | (8.1.3) |
The total area under the curve represents the total probability of all outcomes. I.e., the total area adds up to 1.00. The middle area – the “central region” – represents the probability of an outcome between (µ – e) and (µ + e). The areas on the left and right are referred to as the “tails”. The size of the left tail represents the probability of an outcome less than (µ – e). The size of the right tail represents the probability of an outcome greater than (µ + e).
Pr (X ≤ p-e) = NORM.DIST(p-e, p, (p(1-p)/n)^{0.5},TRUE) | (8.1.4) |
Pr (p-e ≤ X ≤ p+e) = NORM.DIST(p+e, p, (p(1-p)/n)^{0.5},TRUE) – NORM.DIST(p-e, p, (p(1-p)/n)^{0.5},TRUE) | (8.1.5) |
Pr (X ≥ p + e) = 1 – NORM.DIST(p+e, p, (p(1-p)/n)^{0.5},TRUE) | (8.1.6) |
Subsection 8.1 shows that if we know the mean and the standard deviation, we can determine the probability that some observed sample result with be within some interval around the mean. Of course, the problem when sampling is the opposite – we know the observed sample result and we want to quantify the confidence that the actual mean is within some interval around the observed sample result.
We envision two normal curves, one on either side of the observed sample proportion, p̂. p̂ (pronounced “p-hat”) is calculated as p̂= x/n and is an estimate of the actual proportion p.
Pr (X ≥ x/n) = (1-CL)/2 = 1 – NORM.DIST(x/n, pL, (pL(1- pL)/n)^{0.5},TRUE) | (8.2.1) |
Pr (X ≤ x/n) = (1-CL)/2 = NORM.DIST(x/n, pU, (pU(1- pU)/n)^{0.5},TRUE) | (8.2.2) |
The Wald Method makes the simplifying assumption that the standard deviation components in Formulas 8.2.1 and 8.2.2 can both be approximated by the known quantity (p̂(1-p̂)/n)^{0.5}, resulting in the following formulas.
Pr (X ≥ x/n) = (1-CL)/2 = 1 – NORM.DIST(p̂, pL, (p̂, pL, (p̂(1-p̂)/n)^{0.5},TRUE) | (8.2.3) |
Pr (X ≤ x/n) = (1-CL)/2 = NORM.DIST(p̂, pU, (p̂, pU, (p̂(1-p̂)/n)^{0.5},TRUE) | (8.2.4) |
This simplifying assumption also implies that pL and pU are equidistant from p̂ such that
p̂ – pL = pU – p̂ | (8.2.5) |
(1-CL)/2 = 1- NORM.DIST(p̂, pL, (p̂(1-p̂)/n)^{0.5}, TRUE) | |
1 – (1-CL)/2 = NORM.DIST(p̂, pL, (p̂(1-p̂)/n)^{0.5}, TRUE) | |
1 – (1-CL)/2 = NORM.DIST((p̂ – pL) / (p̂(1-p̂)/n)^{0.5}, 0, 1, TRUE) | |
1 – (1-CL)/2 = NORM.DIST((pU – p̂) / (p̂(1-p̂)/n)^{0.5}, 0, 1, TRUE) | |
pU = NORM.INV(1-(1-CL)/2, p̂, (p̂(1-p̂)/n)^{0.5}) | (8.2.6) |
pU – p̂ = NORM.INV(1-(1-CL)/2, p̂, (p̂(1-p̂)/n)^{0.5}) – p̂ | |
ME = NORM.INV(1-(1-CL)/2, p̂, (p̂(1-p̂)/n)^{0.5}) – p̂ | (8.2.7) |
pL = NORM.INV(1-(1-CL)/2, p̂, (p̂(1-p̂)/n)^{0.5}) | (8.2.8) |
ME = p̂ – NORM.INV(1-(1-CL)/2, p̂, (p̂(1-p̂)/n)^{0.5}) | (8.2.9) |
CL = NORM.DIST(p̂+ME, p̂, (p̂(1-p̂)/n)^{0.5}, TRUE) – NORM.DIST(p̂-ME, p̂, (p̂(1-p̂)/n)^{0.5},TRUE) | (8.2.10) |
What if we have not yet taken a sample? Instead of using any of formulas 8.2.6 through 8.2.10 as presented above, simply use 0.50 in place of p̂. This will provide conservative results in the sense that ME with be greater, or CL will be lower, than for any other value of p̂. This was the technique used to generate the values in Table 1 in Subsection 2.3. Or, to be less conservative but still conservative, use any value that is closer to 0.50 than the “worst case” anticipated sample proportion.
Finally, when solving for a sample size that will produce a desired CL and ME, one cannot start with a sample average (because that would already depend on having used some sample size.) Thus, solve for n in terms of p, CL and ME and a hypothetical p.
p+ME = NORM.INV(1-(1-CL)/2, p, (p(1-p)/n)^{0.5}) | |
(p+ME-p)/((p(1-p)/n)^{0.5})= NORM.INV(1-(1-CL)/2, 0,1) | |
ME/NORM.INV(1-(1-CL)/2,0,1) = (p(1-p)/n)^{0.5} | |
(ME/NORM.INV(1-(1-CL)/2,0,1))^{2} = p(1-p)/n | |
n= p(1-p)/((ME/NORM.INV(1-(1-CL)/2,0,1))^{2}) | (8.2.11) |
Formula 8.2.4 provides the sample size, given a desired CL and ME. The quantity p(1-p) is maximized – and thus the value of n is conservatively maximized – at p = 0.5. By using this maximum sample size, we are sure to meet the desired confidence level and margin of error. This is the basis for Table 2 in Subsection 2.3. If p turns out to be less than 0.5 or more than 0.5, the confidence level will be greater and/or margin of error will be lower.
The Wald method is presented here because it is in common use, but it is generally regarded as inferior to the Wilson and Binomial techniques discussed next. This technique should not be used if n is “too small” or if p is “too close” to either 0 or 1.
We state without proof or mathematical justification that the following constraints should both be satisfied when using the Wald method.
n > 9p/(1-p) | (8.1.12) |
n > 9(1-p)/p | (8.2.13) |
Thus, if p = 0.10, n should be at least 81. If p = 0.01, n should be at least 891.
In developing a confidence interval using the Wald method, there was a significant simplifying assumption. The assumed standard deviation was based on the observed sample proportion, p̂, and this further implied that pL and pU are equidistant from p̂. The Wilson method again envisions two normal curves, but does not make this simplifying assumption. For this reason, the illustrated curves are different in shape from one another.
In developing a confidence interval using the Wald method, there were two significant simplifying assumptions. The assumed standard deviation was based on the observed sample proportion, p̂ and pL and pU are assumed to be equidistant from p̂. The Wilson method again envisions two normal curves, but does not make these simplifying assumptions. For this reasons, the illustrated are curves are different is shape from one another.
Pr (X ≤ x/n) = (1-CL)/2 = 1 – NORM.DIST(x/n, pL, (pL(1- pL)/n)^{0.5},TRUE) | (8.3.1) |
Pr (X ≤ x/n) = (1-CL)/2\r\n = NORM.DIST(x/n, pU, (pU (1- pU)/n)^{0.5},TRUE) | (8.3.1) |
(1-CL)/2 = NORM.DIST(x/n, pU, (pU (1- pU)/n)^{0.5},TRUE) | |
(x/n) = NORM.INV((1-CL)/2, pU, (pU (1- pU)/n)^{0.5},TRUE) | |
(x/n – pU ) = NORM.INV((1-CL)/2), 0, 1) (pU (1- pU)/n)^{0.5} | |
(x/n – pU )^{2} = NORM.INV((1-CL)/2), 0, 1)^{2} (pU (1- pU)/n) | |
(x/n)^{2} – 2(x/n) pU + (pU )^{2} = (NORM.INV((1-CL)/2), 0, 1)^{2} / n) (pU- pU^{2}) | |
0 = (pU )^{2} (1 + NORM.INV((1-CL)/2), 0, 1)^{2} / n) + pU (-2)((x/n) + NORM.INV((1-CL)/2), 0, 1)^{2} / n) + (x/n)^{2} | (8.3.3) |
Equation (8.3.3) is a quadratic equation in pU, with constants a, b and c.
a = 1 + NORM.INV((1-CL)/2), 0, 1)^{2} / n | (8.3.4) |
b = -2 ((x/n) + NORM.INV((1-CL)/2), 0, 1)^{2} / n) | (8.3.5) |
c = (x/n)^{2} | (8.3.6) |
So,
pU = (-b ± (b2 – 4ac)^{0.5})/(2a) | (8.3.7) |
A similar derivation of pL will yield the same result, so pU is the higher root and pL is the lower root.
Instead of using two normal curves, as in the Wald and Wilson Method, we can take the same perspective based on two binomial distributions. Instead of calculating a sample proportion, we only need the observed number of “successes”, e.g., x responsive documents in a sample of n documents. The initial equations parallel those of the other methods.
Pr (X ≥ x) = (1-CL)/2 = 1 – BINOM.DIST(x,n,pL,TRUE) | (8.4.1) |
Pr (X ≤ x) = (1-CL)/2\r\n = BINOM.DIST(x,n,pU,TRUE) | (8.4.2) |
The Binomial technique is sometimes referred to as the Clopper-Pearson interval. Because it reflects the actual sampling model and not a normal approximation, it is sometimes also referred to as an “exact” method. One can also apply a Hypergeometric analogue to the Binomial, but this is outside the scope of this material.
Subsection 7.4 provides the mean and standard deviation for the average of the binomial distribution. The corresponding values for the average of the hypergeometric distribution, where M is the number of successes and N is the population size, are
Mean: M/N | (8.5.1) |
Standard Deviation: ((M/N)(1-(M/N))/n) (N-1/N-n)) ^{0.5} | (8.5.2) |
These values can be used in place of the binomial values in the normal approximations in Subsections 8.2 and 8.3 to reflect the finite population impact. Also, HYPGEOM.DIST can be used in place of BINOM.DIST in Subsection 8.4.
The most basic approach is to solve for confidence level, margin of error or sample size in term of the other two. When this is done, the math makes the conservative assumptions that (1) the proportion of successes is 0.50, and (2) the underlying population size is infinite.
Greater precision is possible if the actual proportion of successes and/or the size of the finite population are known. The Excel examples help to demonstrate this. The tradeoff is that this requires more intricate math. Over the course of a project, one can start with conservative standard guidelines and evolve toward a more precise picture as more is known.
Section 3 highlighted an example in which we wanted to establish with 95% confidence that the defect rate in a population of size 2,000 is less than 10%. How big must the sample be such that zero defects in the sample establishes this level of confidence?
Let us first assume the population is infinite or very large, so that the defect rate does not change once a sample is drawn from the population. We define u as the unacceptable defect rate. The probability that any single draw is not a defect is therefore (1-u). The probability of zero defects in two draws is thus
(1-u)*(1-u) = (1-u)^{2} | (9.1.1) |
The probability of zero defects in n draws is
(1-u)^{n} | (9.1.2) |
The probability of one or more defects in n draws is
1 – (1-u)^{n} | (9.1.3) |
If u is 10% in this example, and our approach is to reject the population if there are one or more defects in the samples, we will have 95% confidence of seeing one or more defects if
1 – (1-.10)^{n} ≥ .95 | (9.1.4) |
Equivalently,
(1-.10)^{n} ≥ (1-.95) | (9.1.5) |
With this formulation, we can solve for the lowest necessary n using logarithms.
ln ((1-.10)^{n}) = ln (1-.95) | |
n ln (1-.10) = ln (1-.95) | |
n = ln (.05)/ln(.90) | |
n = -2.99573/(-0.10536) = 28.43316 | |
n = 29 | (9.1.6) |
In other words, if the defect rate is actually 10%, and we take samples of size 29, we will see at least one defect 95% of the time. If our rule is to accept the lot if we see zero defects, we will incorrectly accept this unacceptable defect rate less than 5% of the time.
We can generalize this problem in two ways. First, instead of specifying a confidence level CL, such as 95%, it is more meaningful to specify (1-CL) as the maximum probability of accepting an unacceptable defect rate, MaxAccUn. Second, recognize that (1-u)^{n} is the binomial probability of zero observations in a sample of size n if the rate is u. So, Formula 9.1.5 can be stated as
BINOM.DIST(0,n,0.10,TRUE) ≤ 0.05 | (9.1.7) |
BINOM.DIST(0,n,u,TRUE) ≤ MaxAccUn | (9.1.8) |
Because acceptance sampling of this sort will typically involve limited, finite populations, it makes sense to present this relationship using the hypergeometric distributions. Our goal is to find the lowest n such that
HYPGEOM.DIST(0,n,(u*N),N,TRUE) ≤ MaxAccUn | (9.1.9) |
Or, specifically in this case, with u = 10%, N = 2000 and MaxAccUn = 5%,
HYPGEOM.DIST(0,n,200,2000,TRUE) ≤ 0.05 | (9.1.10) |
With the hypergeometric, we cannot use logarithms for a direct solution. The example in the Excel spreadsheet uses a VBA function that searches for n. Table 3 in Section 3 shows some examples.
The basic problem with the a rule that accepts only on zero defects is that you might find a defect, and thus reject a lot, even though the underlying defect rate is acceptable.
In the example that we have been using, what if the actual defect rate is only 1%, i.e., 20 defects in the population of 2000, and this is an acceptable rate? The probability of one or more defects in a sample of size 29 is
Pr(X ≥ 1) = 1 – Pr(X = 0) | (9.2.1) |
Pr(X ≥ 1) = 1 – HYPGEOM.DIST(0,29,20,2000,TRUE) | (9.2.2) |
Pr(X ≥ 1) = 1 – 0.7456 = 0.2544 | (9.2.3) |
In other words, we are happy to have a test that rejects 95% of the time if the defect rate is an unacceptable 10%, but we are not happy that the same test rejects more than 25% of the time even when the defect rate is an acceptable 1%.
This leads to an expansion of the set of definitions.
u = Unacceptable defect rate (as previously defined)
a = Acceptable defect rate
MaxAccUn = Maximum prob of accepting lot with unacceptable defect rate previously defined)
MaxRejAcc = Maximum prob of rejecting lot with acceptable defect rate
x = highest number of defects that will cause us to accept the lot (previously always zero)
n = required sample size (as previously defined)
The task now is to solve for the lowest x, and the associated n, that satisfy both of the following relationships.
HYPGEOM.DIST(x,n,(u*N),N,TRUE) ≤ MaxAccUn | (9.2.4) |
1 – HYPGEOM.DIST(x,n,(a*N),N,TRUE) ≤ MaxRejAcc | (9.2.5) |
HYPGEOM.DIST(x,n,(a*N),N,TRUE) ≥ 1 – MaxRejAcc | (9.2.6) |
Now, we are solving for two numbers, x and n, and there is certainly no direct method of calculation. The accompany spreadsheet provides a VBA function that is an array function, meaning that it solves for more than one value. It shows that, in the case where the N = 2000, u =10%, a = 1%, MaxAccUn = 5% and MaxRejAcc = 5%, the testing can be done with a sample size of 61 and a rule that you accept the lot if there up to 2 defects and that you reject the lot if there are more than 2 defects.
The accompanying Excel spreadsheet, EDRM Statistics Examples 20150123.xlsm, implements relevant calculations supporting Sections 7, 8 and 9. This spreadsheet was developed using Microsoft Excel 2013.
Notice – This spreadsheet is an .xlsm, meaning that it contains VBA code (macros). You may have to adjust your security settings in order to view and use them.
Caveat– This spreadsheet is intended to assist in learning. EDRM does not warrant the accuracy of this spreadsheet.
There is a Notes page, with same descriptive information that appears here. There are then pages for each of Sections 7, 8 and 9. Basically, these pages provide examples for (most of) the numbered Formulas that appear in those sections.