Processing Guide

Aims: Perform actions on ESI to allow for metadata preservation, itemization, normalization of format, and data reduction via selection for review.

Goal: Identify ESI items appropriate for review and production as per project requirements.

* Although represented as a linear workflow, moving from left to right, this process is often iterative. The feedback loops have been omitted from the diagram for graphic simplicity.


1. Overall Processing

At a point in the e-discovery lifecycle (“Lifecycle”) following preservation, identification and collection it often becomes necessary to “process” data before it can be moved to the next steps of the Lifecycle. Some primary goals of processing are to discern at an item-level exactly what data is contained in the universe submitted; to record all item-level metadata as it existed prior to processing; and to enable defensible reduction of data by “selecting” only appropriate items to move forward to review. All of this must happen with strict adherence to process auditing; quality control; analysis and validation, and chain of custody considerations.

Data may arrive at the processing stage in various formats which then need to be restored before subsequent work can be done (tapes, backups, etc.); individual files and e-mail may need to be extracted from container files (PST, NSF, zip, rar, etc.); and certain types of data may need to be converted to facilitate further processing (legacy mail formats; legacy file formats). During these processing stages individual items are cataloged and their associated metadata is captured.

Rarely is it necessary to review all items that are submitted for processing. A number of data reduction opportunities are usually available. Processing is further broken into four main sub-processes, namely: Assessment; Preparation; Selection; and Output. Assessment may allow for a determination that certain data need not move forward; Preparation involves performing activities against the data which will later allow for specific item-level selection to occur (extraction, indexing, hashing, etc.); Selection involves de-duplication; searching; and analytical methods for choosing specific items which will be moved forward; Output allows for transport of reviewable items to the next phases of the Lifecycle.

1.1. Assessment

Assessment is a critical first step in the workflow as it allows the processing team to ensure that the processing phase is aligned with the overall e-discovery strategy, identify any processing optimizations that may result in substantive cost savings and minimize the risks associated with processing. A critical aspect of this step is to ensure that the processing methodology will yield the expected results in terms of the effort, time and costs, as well as expected output data streams.

It is imperative that an appropriate QA strategy be developed at this initial phase before undertaking the actual processing tasks. This should include methodology, goals, expectations, reporting and exceptions handling. A critical element of success is developing protocols for timely communication/reporting with data custodians/users regarding any issues as they arise so corrective measures can be undertaken as quickly as practicable.

Issues to examine:

  • What data streams are to be processed
  • What complexities/trouble spots are typically associated with these data streams, including additional information that may be required for proper processing
  • What processing methodologies and/or vendors are most likely to be successful
  • Development of specific agreements re all processing steps including de-dupe methodology, culling (metadata based or other) strategy, search strategy etc.
  • Any risk factors involved (unanticipated data types, source data/media errors, unexpected volumes, etc.)
  • QA methodologies during and post-processing
  • Exceptions handling
  • Reporting/audit trails
  • Acceptability criteria
  • Target formats and media
  • Communications/reporting protocols (timing and details)
  • Delivery/production schedules, and hand off protocols, including rolling delivery if feasible
  • Roles and responsibilities
  • Exceptions handling protocols
  • Clear definition of success

1.2. Preparation

During assessment a determination is made as to which classes of data need to be moved forward through processing. At that point there may be a number of activities required to enable handling and reduction of that data. Some possible such activities are as follows:

  • Restoration of backups and other archival sets of data
  • Conversion of legacy formats of e-mail or other file types
  • Extraction of container files (including e-mail and compressed file sets)
  • Cataloging and itemization of all extracted files, e-mail, attachments and loose files
  • De-duplication hashing
  • Near de-duplication hashing
  • Similarity hashing
  • Concept identification and extraction
  • Full text indexing
  • Exception identification and handling

Once the data chosen to be moved forward through processing has undergone a number of the above activities the “selection” of data to be included into a review set can occur.

1.3. Selection

One of the primary reasons for “processing” data in an e-discovery project is so that a reasonable selection can be made of data that should be moved forward into an attorney review stage. Selection by its nature reduces the amount of data that ultimately needs to be reviewed. Once data has gone through “preparation” there are a number of techniques for selecting the items to move forward, and thus also identifying those to leave behind. De-duplication and some forms of near de-duplication can be used to suppress redundant data from being reviewed multiple times. Search terms can be applied as part of a validated approach to find certain items for review while leaving others behind. Concept extraction and other forms of document similarity identification can be used to classify items being moved forward into review.

1.4. Output

The data that has been selected to move forward to review is transformed into any number of formats depending on requirements of the downstream review platforms, or in certain circumstances simply passed on to a review platform in its existing format; or it may be exported in a native format.

The culmination of all the previous efforts, this step is also the last opportunity to identify and correct any issues that arise during processing. It would be advisable to implement final QA procedures that match the results of the processing against previously mapped expectations, including identifying and explaining exceptions. Often, last minute visual inspection of statistically significant samples of the data is part of this process. Any significant variances from expectations need to be accounted for, audit reports correlated with the results produced and differences flagged. A surprising number of projects produce results that are not anticipated due to any number of issues such as a lack of sufficiently accurate information regarding source data streams or poorly defined target formats. An undue haste to produce output can quickly backfire and escalate the overall costs of production when re-processing is required.

2. Overall Analysis / Validation

Throughout the four phases of processing there are opportunities to analyze the data or results of certain sub-processes to ensure that overall results are what was intended, or that decisions as to the handling of the data are valid and appropriate. Some possible analysis / validation opportunities are as follows:

  • Assessment – During this phase representative samples of certain data types may be looked at to ascertain exactly what types of data they are, and to consider how likely they are to be potentially relevant and worthy of further processing. Other data type samples may be looked at to ascertain what levels of processing and preparation are required to appropriately perform selection of subsets of that data.
  • Preparation – During this phase of processing representative samples may be looked at to ascertain the effectiveness of different types of data preparation. This is not necessarily to ascertain whether any sub-process technically worked as expected (which would fall into a quality control consideration), but rather to ascertain whether the application of that sub-process makes sense and adds value in practice. A simple example is text indexing. A QC process may check to see that a PDF document was properly indexed. An analysis/validation process would attempt to ascertain whether the index-able text of that PDF was in fact document content or simply metadata about a graphic image with no searchable text of the content of the image.
  • Selection – The selection phase may offer the most useful opportunities for analysis of data. Testing sample results of applied search terms can greatly enhance the value and accuracy of selected review sets which are determined via search terms. The review of samples of non-selected data can ensure comprehensiveness of a selection process. The review of items designated for duplication or near de-duplication suppression can alert a discovery team of potential flaws in its approach. And application of automated analytic tools can teach a discovery team about the data – enabling them to make better informed decisions on selection of a review set.
  • Output – Looking at samples of output data can add a level of comfort that the overall processing endeavor is garnering results as expected. To the extent that unexpected results are seen in output the process can be appropriately modified.

3. Overall Quality Control

Validation is the testing of results to ensure that appropriate high level processing and selection decisions have been made, and ensuring that ultimate results match the intent of the discovery team. Quality Control (“QC”) involves testing to see that specific technical processes were performed as expected, regardless of what the results show. For example, a QC task may check to see that a particular set of data was properly indexed for full text searching; that certain search terms were applied to those indices; and that resulting search hit items were identified. A validation process would check to see that the search terms applied actually garnered items that contained false hits, or in the alternative returned items which were relevant to the case based upon stated criteria.

4. Overall Reporting

To meet the needs of project management; status reporting; exception reporting; chain of custody and defensibility it is important that processing systems track the work performed on all items submitted to processing. Every item should have tracking information as to the various tasks that have been performed on each item. In addition, systems should be able to roll up this item level tracking information to show reports representing overall status, or status of any particular item or group of items whether they are ultimately moved forward into review or not. It is important to be able to document all high level processing and selection decision pertaining to the processed data, and the system should be able to show the effect of those processing and selection decisions on the data.