Aims: Perform actions on ESI to allow for metadata preservation, itemization, normalization of format, and data reduction via selection for review.
Goal: Identify ESI items appropriate for review and production as per project requirements.
* Although represented as a linear workflow, moving from left to right, this process is often iterative. The feedback loops have been omitted from the diagram for graphic simplicity.
At a point in the e-discovery lifecycle (“Lifecycle”) following preservation, identification and collection it often becomes necessary to “process” data before it can be moved to the next steps of the Lifecycle. Some primary goals of processing are to discern at an item-level exactly what data is contained in the universe submitted; to record all item-level metadata as it existed prior to processing; and to enable defensible reduction of data by “selecting” only appropriate items to move forward to review. All of this must happen with strict adherence to process auditing; quality control; analysis and validation, and chain of custody considerations.
Data may arrive at the processing stage in various formats which then need to be restored before subsequent work can be done (tapes, backups, etc.); individual files and e-mail may need to be extracted from container files (PST, NSF, zip, rar, etc.); and certain types of data may need to be converted to facilitate further processing (legacy mail formats; legacy file formats). During these processing stages individual items are cataloged and their associated metadata is captured.
Rarely is it necessary to review all items that are submitted for processing. A number of data reduction opportunities are usually available. Processing is further broken into four main sub-processes, namely: Assessment; Preparation; Selection; and Output. Assessment may allow for a determination that certain data need not move forward; Preparation involves performing activities against the data which will later allow for specific item-level selection to occur (extraction, indexing, hashing, etc.); Selection involves de-duplication; searching; and analytical methods for choosing specific items which will be moved forward; Output allows for transport of reviewable items to the next phases of the Lifecycle.
Assessment is a critical first step in the workflow as it allows the processing team to ensure that the processing phase is aligned with the overall e-discovery strategy, identify any processing optimizations that may result in substantive cost savings and minimize the risks associated with processing. A critical aspect of this step is to ensure that the processing methodology will yield the expected results in terms of the effort, time and costs, as well as expected output data streams.
It is imperative that an appropriate QA strategy be developed at this initial phase before undertaking the actual processing tasks. This should include methodology, goals, expectations, reporting and exceptions handling. A critical element of success is developing protocols for timely communication/reporting with data custodians/users regarding any issues as they arise so corrective measures can be undertaken as quickly as practicable.
Issues to examine:
During assessment a determination is made as to which classes of data need to be moved forward through processing. At that point there may be a number of activities required to enable handling and reduction of that data. Some possible such activities are as follows:
Once the data chosen to be moved forward through processing has undergone a number of the above activities the “selection” of data to be included into a review set can occur.
One of the primary reasons for “processing” data in an e-discovery project is so that a reasonable selection can be made of data that should be moved forward into an attorney review stage. Selection by its nature reduces the amount of data that ultimately needs to be reviewed. Once data has gone through “preparation” there are a number of techniques for selecting the items to move forward, and thus also identifying those to leave behind. De-duplication and some forms of near de-duplication can be used to suppress redundant data from being reviewed multiple times. Search terms can be applied as part of a validated approach to find certain items for review while leaving others behind. Concept extraction and other forms of document similarity identification can be used to classify items being moved forward into review.
The data that has been selected to move forward to review is transformed into any number of formats depending on requirements of the downstream review platforms, or in certain circumstances simply passed on to a review platform in its existing format; or it may be exported in a native format.
The culmination of all the previous efforts, this step is also the last opportunity to identify and correct any issues that arise during processing. It would be advisable to implement final QA procedures that match the results of the processing against previously mapped expectations, including identifying and explaining exceptions. Often, last minute visual inspection of statistically significant samples of the data is part of this process. Any significant variances from expectations need to be accounted for, audit reports correlated with the results produced and differences flagged. A surprising number of projects produce results that are not anticipated due to any number of issues such as a lack of sufficiently accurate information regarding source data streams or poorly defined target formats. An undue haste to produce output can quickly backfire and escalate the overall costs of production when re-processing is required.
Throughout the four phases of processing there are opportunities to analyze the data or results of certain sub-processes to ensure that overall results are what was intended, or that decisions as to the handling of the data are valid and appropriate. Some possible analysis / validation opportunities are as follows:
Validation is the testing of results to ensure that appropriate high level processing and selection decisions have been made, and ensuring that ultimate results match the intent of the discovery team. Quality Control (“QC”) involves testing to see that specific technical processes were performed as expected, regardless of what the results show. For example, a QC task may check to see that a particular set of data was properly indexed for full text searching; that certain search terms were applied to those indices; and that resulting search hit items were identified. A validation process would check to see that the search terms applied actually garnered items that contained false hits, or in the alternative returned items which were relevant to the case based upon stated criteria.
To meet the needs of project management; status reporting; exception reporting; chain of custody and defensibility it is important that processing systems track the work performed on all items submitted to processing. Every item should have tracking information as to the various tasks that have been performed on each item. In addition, systems should be able to roll up this item level tracking information to show reports representing overall status, or status of any particular item or group of items whether they are ultimately moved forward into review or not. It is important to be able to document all high level processing and selection decision pertaining to the processed data, and the system should be able to show the effect of those processing and selection decisions on the data.