Section 1 – Introduction

April 27, 2012

1.1. Basic Concepts and Definitions

The purpose of this section is to define, in advance, certain terms and concepts that will be used in the ensuing discussions.

  • Sampling – The process of inferring information about a full population based on observations of a subset of the population.
    • Sample – The subset is referred to as the “sample”.
    • Population – The total group from which the sample is drawn. Might also be referred to as the “universe”.
  • Statistical sampling – Sampling that is done according to certain constraints and procedures, and thus conforms to certain mathematical models (“statistical models”) that can be used to quantify the implications of the sample observations for the total population.
  • Judgmental sampling – In this context, the “opposite” of statistical sampling. Sampling that does not adhere to the constraints of statistical sampling, and thus cannot be used to reach the same quantitative conclusions as statistical sampling. Also known as informal sampling, intuitive sampling, heuristic sampling.
  • Member (of the population) – Each individual unit or entity within the population.
  • Observation – When a member of the population is selected for the sample, that member of the population is said to have been “observed”. The sample is comprised of observations.
  • Attribute (of interest) – Members of a population, such as a collection of electronically stored documents, will have many characteristics or “attributes”. For example, date, file type, source/custodian. However, the purpose of statistically sampling is typically not to infer information about all of these. The purpose is typically limited to inferring information about one attribute of interest. In e-discovery, the attribute of interest is often “responsiveness”. Another example of attribute that may be of interest is whether the document is privileged.
    • As a general point, many attributes, such as dates and custodians, are easily known and aggregated by the computer. It is easy to know about these attributes for the full population. The purpose of sampling will typically be to learn about attributes that require some work to evaluate.
  • Sample space – All the possible outcomes of an observation. More precisely, all the possible values of an attribute.
    • Where the attribute of interest is responsiveness, the possible values are “Responsive” or “Not Responsive”.
  • Binary – When a sample space has only two possible outcomes (True or False, Heads or Tails, Responsive or Not Responsive), the attribute can be referred to as “binary”. Another term for this is “dichotomous”.
    • It is not binary if there are three possible outcomes.
  • Proportion(s) – In a situation involving a binary attribute, this refers to the percentage of each outcome.
    • The sample proportion(s) are the observed percentages within the sample, such as 60% Responsive and 40% Not Responsive. The sum has to be 100%.
    • We can also refer to the “underlying” population proportion(s) or the “actual” population proportion(s). This, of course, is the information that we do not know and are trying to estimate.
    • (In this document, proportions/percentages might be expressed in decimal form as well as percentage form. E.g., “0.60” is the same as 60%.)
  • Randomness – A critical requirement for statistical sampling. A random sample of a population is a sample in which each member of the population has an equal probability of being selected in the sample.
    • Pseudo-randomness – Since true randomness is hard to implement, pseudo-randomness connotes the use of techniques that mimic the effect of random selection, and are thus viewed as adequate where there is a mathematical assumption of randomness.

1.2. Mathematical Techniques

In terms of math, the focus of this document is on the problem of estimating the proportions of a binary population. This is generally amenable to analysis using simple, well-established and well-understood statistical techniques. If the observations of a population can have only two values, such as Responsive or Not Responsive, what can the proportion of each within a random sample tell us about the proportions within the total population? A basic, non-technical presentation is in Section 2. Additional information and Excel examples are presented in Section 4.

1.3. Potential e-Discovery Situations that Warrant Sampling

Here are some examples of situations where either of these statistical sampling techniques may be useful.

  • Preservation/Collection – Determining if a particular source should be preserved and collected for an investigation or civil case, e.g. a large group share (folder) located in a corporate network. Sampling of files from this location may be a cost-effective means for a case team to analyze if the contents of this source location are material to their case.
  • Processing – Incorporating sampling of the error and text not found files, as determined by the processing software, can be used as part of a quality assurance methodology.
  • Analysis – Sampling against the results from a keyword search that was executed against a document universe. Sampling of the keyword responsive and keyword non-responsive (negative) results can provide an estimation of the quality of the search strategy that was applied.
    • As an example, a case team can use this information to further refine and reduce “noisy” terms yielding primarily not relevant results.
    • Sampling from the universe of documents that were outside of the keyword results search (negative results) can provide further guidance to the case team on the quality of the search strategy. Using sampling, a case team may estimate the percentage of material defects, actual relevant documents, in the negative population. The case team may use this insight to expand the keyword terms to capture additional relevant documents.
  • Review – A case team may use sampling to estimate the quality of particular reviewers’ or an entire review team’s documents decisions when performing manual review. A case team may use sampling as part of a validation methodology when using automated machine coding technology. Sampling the coding decisions of the automated system provides an estimate of the amount of incorrect decisions the machine has applied to the document population. The case team may determine that the material defect rate or incorrect coding decisions rate is above an established acceptable threshold. The machine system may need to be re-trained to provide more accurate results. Another round of sampling, after the system has been re-trained and new documents decisions applied, may determine acceptable quality has been achieved.
  • Production – Sampling can be used as an efficient means to estimate the quality of the production images generated. Sampling and then performing visual inspection on images can help a case team determine if the production quality is acceptable or if the production images need to be re-generated with different settings applied.

1.4. Guidelines and Considerations

Section 3 presents some important guidelines and considerations when using statistical sampling in the e-discovery process. These recommendations are intended to help avoid misuse or improper use of statistical sampling.

1.5. Areas Not Covered

The following are topics that relate to this subject, but are outside the scope of the present document. Some or all of these may be covered in future releases.

  • Judgmental sampling.
  • Legal issues that may arise when attempting to use statistical sampling in order to avoid full evaluation of a population. For example, you may be able to use sampling to validly estimate that only 1% of a particular custodian’s documents will be responsive. This does not necessarily mean that it is legally reasonable to unilaterally exclude that custodian from further analysis. As another example, parties might agree to use sampling in concept, but might not agree on what sample sizes are reasonable.
  • Statistical sampling techniques in situations where the sample space contains more than two possible outcomes.