EDRM Evergreen/Processing/Selection
From Working EDRM
| Comments: Please submit comments to the EDRM Evergreen Processing forum |
Categories
Data culling is the umbrella term used to describe the technical tactics or processes employed to reduce a large document population to a much smaller set. Preparation and Selection are the 2 phases which make up the data culling process. Selection is the process of selecting documents which will ultimately make into the document review.
Outlined below are the common data culling techniques employed today. Each strategy can reduce the number of non-responsive files and emails substantially, a critical step in controlling the explosive growth of electronic information.
Filtering can narrow a dataset by selecting responsive files based on file-level criteria such as metadata. The custodian list, file type and timeframe associated with the matter are standard criteria. When using time as a culling criteria, it is important to remember that there are several time stamps that can be associated with any one file (i.e., for application files, create, modified and accessed dates may be available for use, and for email files, time sent, received, created and last modified may be usable). Being specific about which of these time stamps is to be used as the filter parameter will help to ensure the resultant dataset meets expectations.
Other methods of data culling can also yield significant results. System files can be culled from a responsive data collection. Known system types recognized and assigned a common MD5 hash value provided a virtual digital fingerprint that can be used to weed out files not of interest. Such a mechanism allows large system files to be set aside as demonstrably non-responsive so that the size of the document collection is correspondingly reduced, a significant cost reducing strategy. Segregating the operating system files from a source hard drive early in the processing phase can greatly cut down the data set, helping to minimize processing time, when there is no intrinsic value that could be gained from these common types of file. Other file types, such as internet cookies - files stored to a person’s hard drive as the person uses the internet, can also be programmatically segregated. These options help hone the dataset to the most relevant documents.
One of the most common culling strategies is the use of search terms, such as the names of key employees implicated by a lawsuit or investigation and specific words likely to be at play in the litigation, whether contained within the text of a document or within the metadata. These methodologies are helpful in wading through vast quantities of file types and sources, and should be considered as one of the first tactics employed by the legal team. The ability to use technical tools does not, of course, replace the legal analysis required of every case; it does, however, create an environment whereby the electronic information may become more manageable, from both a cost and strategic perspective.
Contents |
Text Search Criteria
Keyword searches are a commonly used technique in the legal industry. Search terms to be applied to a given data collection are typically determined by the legal team and approval is obtained by the opposing party and/or government regulators. The goal of the term list is to narrow the dataset to include all relevant documents and segregate documents that are non-responsive. The choice of terms is crucial to a successful result. Many suppliers of electronic discovery services can provide guidance on choosing terms that will result in the teams expected result. Additionally, by applying sampling techniques along with search term lists, the legal team can get a sense for the responsiveness of particular search terms.
Boolean operators such as “AND,” “OR,” and “NOT” are frequently employed to help further refine a keyword search.
Proximity or adjacency operators such as “NEAR,” “PRE,” and “WITHIN” are frequently employed to help further refine a keyword search.
Keyword searches should be tried and refined early in the process. It is useful to understand not only the hit rate, but also the relevancy rate of such searches. In other words, a keyword search that returns 15,000 documents that match the keywords but only yields three relevant documents may not be the best use of time or client resources. Understanding what makes a document relevant can translate into a more focused search strategy. Keywords are a great starting place, but a good search strategy will enable you to go after more meaningful documents faster by leveraging your knowledge of the matter or your discovery of key documents. Many tools today enable reviewers to use key documents to identify similar documents. Keyword searches can also be useful for generating “potentially privileged” sets of documents for early review.
Metadata Search Criteria
In the world of Electronic Data Discovery, processing electronic files potentially enables the search of Metadata. New E-discovery amendments explicitly recognize the existence and discovery of this type of data, which exists primarily in two distinct formats: Application Metadata or System Metadata. Basically, Metadata is information embedded within an electronic file, tracing the history, access or use of a file.
Application Metadata may include important items such as:
- The original author of a document.
- Anyone that revises a document.
- When the document was initially created, modified, saved or printed.
- Prior versions of the document, deletions and hidden comments.
Application metadata is information not visible on a printed page, but rather is embedded within the document, remaining with the file if copied from one media type to another. This data typically can be viewed only from electronic files in native file format through the use of tools designed to search for this information. When dealing with legacy paper documents, these documents would be scanned into an electronic format, such as TIFF’s or PDF files. These files allow Metadata information to be applied to the electronic version of a scanned paper document, and protect that information from being modified.
System metadata is not embedded into the individual files. Instead, it is stored externally within the computers operating system files, and requires the services of a computer forensics expert.
Utilization of Metadata in the past has been the exception in the discovery process, rather than the rule. Until recently, the ability to use this type of data effectively has been limited in part by resources, and a clearer understanding within the court and litigation systems as to depth and importance of how Metadata can and does impact evidence. Only recently has it been realized that there have been missed opportunities for discovery of Metadata information that would have possibly affected the outcome of many cases, and much litigation. Another prohibitive factor in utilizing Metadata was cost. However, due to recent advances in technology involving this type of data processing and analysis, it has become much more cost effective, easier to utilize, and more widely called for. As Metadata is increasingly utilized in the E-Discovery process, it is having a decided impact on the outcome of these proceedings. These new advances can greatly reduce the need and cost of more specialized forensic services, many times limiting their need to only be called for where a full range of application or system metadata needs to be recovered.
Filtering Criteria
Data Filtering
In short, data filtering is a step in the data work flow that reduces the amount of data that move on to each subsequent step in the work flow. Definitions of data “filtering” can be found at http://www.edrm.net/wiki/index.php/Filtering. In addition to the textual search filtering methods described above, data from email servers, network shares, custodian workstations and other sources can be filtered using several methods. In many work-flows these other methods of data filtering are done prior to or in conjunction with keyword searching.
File Type Filtering
Generally, file type filtering separates the sub-set of potentially relevant user-created files from other files that by definition cannot have been created by the user (e.g. operating system files, application files). Very thorough file type filtering can be done using the National Software Reference Library (NSRL) Reference Data Set (RDS). It is essentially a list of digital file signatures - defined by their MD5 or SHA-1 hash – from known software applications - that have no forensic value. Although this method was originally designed for criminal investigations it can be effectively used for non-criminal electronic discovery purposes. For a complete explanation of this extremely thorough filtration methodology please see http://www.nsrl.nist.gov/. In other instances the involved parties may agree to a list of user-created file extensions to be used (e.g. .doc, .pst, .pdf, etc). The data is filtered to extract only these agreed-upon file types leaving behind all other file types. This is a useful method that generally reduces workstation data by a large percentage thus decreasing the costs of further data processing and review. This method also has its disadvantages. For instance, if the agreed-upon list includes “.xls” but the custodian/user has changed that standard file extension to “.xQ2” (and associated that file extension with the correct software application) those potentially very important files will be left behind during filtering and therefore not get reviewed nor produced if relevant. “File Signature” analysis can also be used for file type filtering (http://www.edrm.net/wiki/index.php/File_signature).
Custodian Filtering
Supersets of data collected from their original storage locations often contain data from custodians that are not relevant in the matter at hand. Data collected from an email server is a prime example. If an entire email server is collected (e.g. via the existing backup methodology) it may contain dozens of custodian mailboxes. However, only 5 of the custodian mailboxes are potentially relevant and must be moved to next step in the data work flow. There are several methods for filtering this superset of email data down to just the relevant custodian mailboxes. Another example comes from file share location(s) on a network file server. If the entire file server is collected it may contain data from dozens of custodians. However, only 5 custodians’ data is relevant and must be moved to next step in the data work flow. In some instances each custodian will only have rights to store data in a single, identifiable folder. In some instances, with some data types, custodian filtering can be done using the “author” or “to,” “from,” “cc” metadata fields.
Date Filtering
Data can be filtered using various date-related metadata fields such as “date sent” or “date received” for emails & “last modified date” for non-email user-created files. Example: A custodian email box contains emails dated from January 1, 2005 through today, but the involved parties have agreed that only data dated from January 1, 2006 through today is potentially relevant and therefore must be reviewed. The email can be filtered on the “date sent” &/or “date received” value and only the emails dated January 1, 2006 through today will be moved to next step in the data work flow.
Concept and Classification Criteria
While keyword searches are designed to return only the content that exactly matches the search, concept searching is designed to identify content that is conceptually similar to the search phrase. When using concept searching, it is critical to understand how the concepts are created as there are many different approaches to concept search on the market today. Careful consideration must be given to the ways in which concept searching is applied because concept searching does not deterministically conclude the presence or absence of targeted data that may be evidence.
Concept searching tools can be quite effective when applied to responsive datasets during the review phase. (See Review Node.) Further, it is important to establish the method to be used during processing because this will drive the data set (native files, metadata, body text, etc.) that is generated and supplied. The integration between processing technologies and review technologies must also be well established based on the requirements of the case. In some situations this can be a seamless transition from one phase to the next and in other situations the requirements can make this transition complex. The flexibility of the vendor solutions, particularly for processing documents proactively, during organizational discovery where the requirements for a case may not be known during processing, is a key consideration.
De-Duplication
In addition to the technical aspect of analyzing files for deduplication, it is important to determine to what dataset the files should be compared. For example, deduplication can occur at the custodian level or globally across all users and data sets. Different types of data may be subject to different types of deduplication, i.e. data from file servers may be de-duplicated globally, while email may be deduplicated at the custodian level. In its broadest application, deduplication can be applied across an entire data collection (global deduplication). In this case, only one original file will be provided. For example, if a company-wide memo is sent over email, only the first instance of that memo that is processed will be available for review. Through the use of metadata, it can be determined who received this email, but only one instance of the memo is made available for review. This technology has enormous significance from a cost savings standpoint, and since deduplication can occur at various stages of processing (tape restoration, keyword search and/or upload), the cost savings often applies to various processing stages. While deduplication of exact duplicates is the most precise way to deduplicate a dataset, many times files that are materially similar are not bit-level duplicates and may be reviewed and produced. In an attempt to reduce the occurrence of such near duplicates and to avoid the costs associated with reviewing them, a number of technologies are emerging that promise to manage this issue. If files that are similar to each other are grouped together by such technologies, it may also be possible to manage the review process such that the same reviewer is always reviewing similar documents (for example, Word documents that are similar – various draft versions, for example – or threads of emails). Collectively, these technologies are being referred to in the market as “Near Deduplication” technologies. If near deduplication technologies mature to the point that it is possible to assure that no material evidence will be omitted and courts accept the elimination of near duplicates as appropriate data culling strategies, the application of these technologies may grow in popularity. Deduplication can occur at the custodian level as well, and for the case above, where global deduplication would yield one instance of a file, the custodian level deduplication would yield one unique instance for each custodian. This use of deduplication can be important in cases where the fact that the memo was in a person’s email collection is pertinent. For example, in a white collar criminal case, deletion activity can be important. Each custodian’s collection would contain one original copy of every document that was found within their collection, giving the review team a complete view of each custodian’s data collection. The application of deduplication technologies can be used for any subset of data as well. For example, it is possible to identify duplicate files within a custodian level, such as in only their email. Deduplication methodologies can be applied to any subset of data, such as within a specific time period. The specifics of the matter at hand determine the appropriate criteria and the automatic means to execute the criteria ensure that each document is accounted for and tracked, while eliminating the need for redundant review. Applying deduplication criteria can significantly reduce the amount of data to be reviewed.
[updated Jan. 29. 2008]

