An EDRM White Paper – part of the EDRM White Paper Series
Culling using selection criteria is used to manage otherwise overwhelming volume of data in litigation discovery.1 Courts and experts are endorsing the emerging best practice of iterative measurement of selection results, borrowing from established and effective data management practices outside of the litigation setting. By insisting on this practice, in-house counsel and their advisers can both contain the cost of discovery and reduce or eliminate the risk of challenge to the search and selection methodology.
The use of search terms and other selection criteria (e.g., date restrictions or other filters) by parties in litigation at each and every stage of discovery, from preservation to collection to privilege review, is commonplace. The courts and other experts recognize this practice as a necessary alternative to the impossible task of reviewing every possible document in the overwhelmingly large collections of electronic data subject to consideration. And, over time, the process for choosing and applying search terms has evolved as parties, the bench and the bar have looked for better and more cost effective methods to handle discovery. But as the use of selection criteria has become an indispensable tool in the e-discovery toolkit, its application has come under increasing scrutiny by opposing parties and the courts in order to ensure that appropriate search criteria are being used and that responsive materials are not systematically missed in the rote application of search terms. In the last couple of years, there have been several high profile e-discovery cases that dealt with the choice of search terms and their application.2
As these cases show, there can be significant risks – of both increased discovery costs and various discovery sanctions – if a litigant fails to properly calibrate its selection criteria.
The good news is that a reasonable, defensible, best practice approach to using selection criteria and cost-effective discovery are not mutually exclusive. An iterative approach to calibrating selection criteria will not only be easier to defend from attack by opposing parties, but in many cases will reduce the overall cost of discovery by eliminating more irrelevant documents from processing and review. Corporate counsel don’t have to choose between containing discovery costs and increasing the defensibility of their process – with the iterative approach to selection criteria, they can do both.
The goal for producing parties is to carve from the overall universe of electronic documents a relevant subset of documents for review and ultimate production in a legally defensible and reasonable manner. Litigants faced with document requests quickly realize that there is an impossible volume of data that must be assessed. In most cases, the entirety of electronic documents cannot be produced without some level of selection and review. Neither the producing party nor the requesting party would be able to handle the volume, the effort might not be proportionate to the value of the matter, and the production of non-responsive, unrelated business information might be harmful to producing parties.
This task is usually accomplished by selecting documents via some search mechanism. This is not new. From the beginning of discovery, counsel and parties have been making choices of where and what to search for responsive documents. What text searchable electronic documents have done is abstracted the process.3 A party may choose to look at one custodian’s files over another because they are more likely to have responsive documents. A party may not open a filing cabinet or box because the label indicates it does not have any documents of value — a reasonable decision. However, applying search terms to a large volume of data does not necessarily have that same effect – especially if the party does not know the effectiveness of those criteria.
While the courts and other experts have recognized searches as a necessary tool in discovery, they have also recognized that text searching is a method with inherent difficulties. Almost always, search terms are both over- and under-inclusive. People may use a term out of context that creates a pool of non-responsive documents; likewise, employees may use jargon unknown to counsel (leading to responsive documents not being hit by search terms). Even the use of date ranges, which are seemingly objective and non-controversial, can cause errors: what happens when an e-document is misdated or not dated at all? Just as counsel take reasonable steps to ensure quality control on review calls made by human reviewers in discovery, they should consider undertaking quality control steps to ensure their selection criteria are properly calibrated to effectively recall relevant documents without collecting too many irrelevant documents.
Two important concepts are related to measuring the efficacy of searches: recall and precision. Recall refers to the comprehensiveness of a search – has it identified all of the ultimately responsive documents you sought? Precision refers to the accuracy of the search – of the documents returned by your search, how many were ultimately deemed responsive, and how many were false positives or false hits?
An effective search balances both good recall and good precision, which is sometimes referred to as “targeting” a result set. Judge Grimm refers to the issues of precision and recall in the Stanley decision when he notes that “[c]ommon sense suggests that even a properly designed and executed keyword search may prove to be over-inclusive or under-inclusive…”4 It is possible that a search result can have poor recall (being under-inclusive), or poor precision (being over-inclusive), or both.
The ultimate test of search effectiveness looks at responsiveness rates in review sets and ensures that the searches have not left responsive documents out of review altogether. A high percentage of ultimately responsive documents in the total universe of documents reviewed indicates good precision – not too many non-responsive documents were returned by the search. A low responsiveness rate would indicate poor precision, meaning many false hits were returned.
Looking at the responsiveness rates of review sets alone cannot help determine whether good recall was achieved, however. Good recall can only be determined by looking at the total universe of documents, or samples of those documents, and not just the sets actually returned by the searches. Ensuring comprehensive searches typically can be accomplished by selective expansion and broadening of search criteria, and by sampling what the searches are returning and not returning. While the case related to a privilege determination and not to responsiveness, Judge Grimm expressly refers to this practice in the Stanley case, “The only prudent way to test the reliability of the keyword search is to perform some appropriate sampling of the documents determined to be privileged and those determined not to be in order to arrive at a comfort level that the categories are neither over-inclusive nor under-inclusive.”5
A number of undesirable outcomes can occur if precision and recall are out of balance. Good recall accompanied by poor precision can result in an extremely over-inclusive review set, basically rendering the search attempt useless. This can occur when overly broad un-tested terms are used in the search, or when automated search tools – automated methods for expanding searches by employing a thesaurus, ontology, taxonomy, derived-related terms and concepts, alternate spellings, homonyms, and so on – expand searches in a nonselective manner. When these expansions are employed without human interaction and selective implementation, an unintended explosion of the search result can occur. The resulting flood of documents may reduce the likelihood of missing relevant documents by increasing recall, but it also renders the search useless by attaining such poor precision that virtually no targeting has occurred, as evidenced by responsiveness rates dropping precipitously low. Even excessively broad search results do not guarantee perfect recall, as there can still be responsive documents in the small number of documents not returned by the overbroad searches. Sampling documents not returned by overly broad searches is still prudent.
A search outcome with good precision but poor recall is easy to imagine. An extremely narrow search might achieve extremely high percentages of ultimately responsive documents in review – good precision – but large numbers of ultimately responsive documents might be missed by the search and left entirely out of review – poor recall. This scenario may be cost effective, but ultimately incomplete, possibly noncompliant and increases the risk of defensibility challenges. A search with good recall and poor precision lessens the likelihood of a defensibility challenge, but it may be anything but reasonable, practical or cost effective. These types of results can be caused by a lack of effort to determine actual effectiveness of search terms, instead relying on an assumption that searches will return an expected result.6
A possible outcome when parties employ the commonly accepted approach of brainstorming searches and applying them in an untested manner is both poor recall and poor precision. But this isn’t even the biggest problem with this method of determining and applying search criteria. The biggest problem is – determining that there is a problem. Looking at the records reviewed, if the search resulted in a large volume of records but very few responsive documents – is this because there are few responsive documents or because the search used the wrong terms or the wrong location? A search that results in a large number of responsive documents may be the result of a search that was targeted correctly – or because any net in so rich a fishery of responsive material would have captured a great many documents and perhaps many more were left behind in the data sea.
The goal of any e-discovery selection criteria is a search result where both good recall and good precision are achieved – a review set with a high responsiveness rate, accompanied by a reasonable assurance that everything that was supposed to be reviewed has in fact been included for review.
Historically, there have been two common approaches to the application of searches in an e-discovery project, namely: a brainstormed list of search terms; or an agreed upon set of search terms. In reality, one is an outcropping of the other as they are both simply brainstormed lists of search terms, the latter being brainstormed by a larger group representing multiple parties, and the former being determined by one party only. The brainstorming practice involves a group of individuals consulting and ultimately guessing (in ostensibly a reasonably educated way) about what search terms should be used to retrieve candidate documents for responsiveness. The problem is that information retrieval experts have repeatedly found that because of the difficulties in accurately predicting which search terms will bring about good precision and good recall, a brainstormed approach to determining search criteria may not prove effective in text searching scenarios.
While brainstormed lists determined by one party have other weaknesses, their primary fault is that they are prone to challenge in hindsight. Even the best searches will miss certain records that in a perfect world would be selected and will include irrelevant documents that one does not want to process or review, but searches based on brainstormed terms are particularly hard to defend – why were some terms chosen and not others?7 This is never truer than when an opposing party is asking specific questions about terms found in the responsive documents (or in deposition) that were not considered when the initial documents were collected. This recall failure (otherwise known as too many false negatives) can be an easy target for opposing parties – they only bring the challenges after some discovery has already been conducted, so they have better information about the content of the documents and possible appropriate terms. In essence they have sampled (i.e. tested) the responsive data and are challenging whether the search was complete. While the Qualcomm v. Broadcom case has become infamous for a number of e-discovery and other failures, one of the most basic problems for which the court chastised the Qualcomm lawyers was not using the most obvious search terms to identify and collect obviously relevant material.8 Even in cases where lawyers and parties brainstorm selection criteria in good faith, their imperfect knowledge as to how people actually created the documents and the terminology unnecessarily limits their results.
Agreed-upon search lists do have the advantage that they are harder to challenge. But while those in favor of agreed-upon terms might suggest that once agreement is reached on search strategy, all problems associated with the ineffective search are avoided, this is not the case. First, using agreed-upon selection criteria can handcuff a responding party and make it very difficult for it to research its own systems to find potentially valuable information to support its own case. Second, it can cause the responding party to spend an inordinate amount of money processing and reviewing irrelevant documents.9 Fewer disputes may arise about agreed-upon search terms, not because the searches are significantly better when one party chooses the terms, but because once a party agrees to using a certain set of terms, that party loses its right to complain – which may explain judges’ support for early negotiation and agreement on selection criteria, even when parties have not yet collected the data, much less examined it. Judges prefer to minimize disputes and conflict over discovery, and they may be unaware of the specific costs caused by the agreements or the actual effectiveness of the agreed-upon terms.
Of course, even agreed-upon search terms are sometimes questioned after the fact when one side or the other discovers the outcome of the searches to be different than what had been expected, diminishing the perceived value of the agreement in the first instance. In the Kipperman decision, the court had ruled that the defendant had implicitly agreed to plaintiff’s brainstormed search terms by neglecting to modify the searches when given the opportunity.10 The court let what turned out to be an overbroad search stand. If the defendant had put some effort into testing the effectiveness of the searches prior to review, there may have been a different outcome. A similar outcome arose in the Fannie Mae case where the producing party had been given the opportunity to participate in the determination of “appropriate search terms,” but failed to do so.11See In re Fannie Mae Sec. Litig., 552 F.3d 814 (D.C. Cir. 2009). The court let the overbroad search result stand which resulted in an extremely expensive review. Id. at *8. Notably, in both of these instances parties were given the opportunity to develop more effective searches, but did not do so.]
There is a better way, and to find it one should look to the industry experts (Sedona, Legal TREC and EDRM), as well as to recent court decisions. These experts in turn have been looking to the information retrieval world beyond e-discovery where these search inadequacies have long been identified, and where experts have been addressing the problems for quite some time. The challenges remain, but an awareness of them allows for planning and choosing methodologies we might use to meet those challenges. Importantly, the standard is reasonableness, not perfection,12 and the thoughtful use of search tools. The Sedona guidelines state that “Lawyers must recognize that, just as important as utilizing the automated tools, is tuning the process in and by which a legal team uses such tools, including a close involvement of lead counsel. This may require an iterative process which importantly utilizes feedback and learning as tools, and allows for measurement of results.”13
Whatever search technology one chooses to employ, it is important that one does so in the context of an iterative process. The process should include how the tools are to be employed; what human input and decision-making is required; iterative sampling and measurement of results; and validation that the technology worked as expected. The process must allow counsel to use the findings and alter the terms – there is no point in testing the search terms if it is prohibitively expensive to change them. To test and compensate for poor recall or precision, the process must iterate through multiple attempts at conducting the search correctly. By testing the data (both what is selected and what is not selected), one can mitigate the risk of systematically missing data and create documentation regarding the reasonableness of the process, while at the same time reduce the amount of money wasted on processing and reviewing irrelevant documents. In fact, in larger reviews, the iterative approach can reduce enough data that it will pay for itself many times over.
Search process and verification efforts should be documented – it is one of the benefits for undertaking the process in the first place. The documentation should include explanation of the general process within which the search technology was applied, including all steps taken to measure and validate results, and all corresponding changes made to search strategies based upon those results. This should be maintained in anticipation of explaining to the court if requested.14 This documentation will serve as evidence if it becomes necessary to assert that search terms are or are not appropriate, and the resulting empirical evidence may be bolstered with expert testimony.15
Search technology can also be employed effectively in other instances besides selecting a review set. Criteria can be used as a means to classify or prioritize documents – a privilege screen is an example of this kind of use. A review set has been determined and potentially privileged documents need to be carved out from documents deemed potentially responsive, but which are not likely to be privileged. Privilege classification criteria can be used to perform this division. And there are other reasons to run classification criteria to group documents by subject matter for review efficiency.
Generally, classification criteria have a higher tolerance for inaccuracy because when intended solely to provide review efficiency, and not to determine inclusion or exclusion from review entirely, the risks are lower – a mistake changes when a document is reviewed, not whether it is reviewed at all.16 Counsel should remember when using classification criteria to undertake privilege review that if privilege materials are produced inadvertently, the party may be obligated to establish that it used reasonable precautions,17 and a party may need to demonstrate that it employed a reasonable method for settling upon certain classification criteria. This is very different than the use of selection criteria in collection or processing, where the objecting party generally has that obligation.18
Criteria can also be employed in a quality control process. For example, to help ensure consistency of review calls and markings prior to final production, criteria can be run across production sets to try to identify anomalies.
In an effort to reduce costs by reducing volumes of data, it is becoming more and more common for parties to apply search terms earlier and earlier in the discovery process, including in preservation and collection filters. Of course, in many instances this practice is simply the same old wine in a new bottle, and in fact processing has simply become portable and easier to use behind a company’s firewall. Regardless of where in the EDRM lifecycle data selection is perceived to be occurring, practitioners should be careful when applying criteria. Be mindful that the realities of searching data still hold and in fact the consequences may become potentially more severe at the earliest stages of discovery. Remember that inadequate selection criteria that exclude documents from review have determined those documents to be non-responsive, and possibly incorrectly so. At the preservation and collection phase, correcting these mistakes long after the fact can be quite expensive. If necessary, take a conservative approach and err on the side of over-inclusion until a more controlled environment presents itself where the selection criteria can be tested and validated using iterative samples. Counsel are wise to heed the tone of Judge Peck’s decision in the Gross case: “This opinion should serve as a wake-up call to the Bar in this District about the need for careful thought, quality control, testing, and cooperation with opposing counsel in designing search terms…it appears that the message has not reached many members of our Bar.”19
Selection criteria are here to stay and, in fact, have always been with us. Increased scrutiny by courts and opposing parties require counsel to ensure that the process they use to calibrate search terms is reasonable. Counsel should look to iterative search and selection procedures to better choose their search terms and control costs. Selecting more refined selection criteria is not only about being more defensible and finding more responsive records; it is also about discarding irrelevant material early in the process and containing costs. Put differently, fewer search terms (and less time thinking, analyzing, and refining them) does not necessarily lead to smaller search results and lower cost.
Unless otherwise noted, all opinions expressed in the EDRM White Paper Series materials are those of the authors, of course, and not of EDRM, EDRM participants, the author’s employers, or anyone else.