Collection - Cost Drivers
From EDRM
Contents |
Volume of Data
As many companies manage terabytes of information containing potentially relevant information, it is essential to use a culling methodology to try to get the data down to a responsive set of information. Companies can have thousands of computers and hundreds of complex servers that it controls. Not all of those systems contain data that is relevant to a pending litigation. Efficiently narrowing down the relevant from the irrelevant is key to controlling the costs in an electronic discovery project. This is illustrated by the graphic.
This process may be illustrated by the following example. A corporation generates three backup tape sessions each month for their email and file server information. Each backup contains approximately one terabyte (1 billion pages), much of which is duplicative information. There are 20 target users whose data is being requested from January 1, 1999 to December 31, 2003. The request is for their documents and correspondence related to an employee matter.
The first step is to extract the data using one of the tape extraction methodologies described above (NE restoration vs. NNE extraction).
The next step is to process the data so that it is searchable and reviewable by target user.
One methodology is to ingest all of the data (all three terabytes) into a search engine and then provide the search criteria to limit the data to target users
The other methodology is to locate the Target Users' data in its raw form and then extract that data for further processing.
In this example using a date filter should be applied to minimize the responsive data set. There are several things to consider when deciding on the date issues. The meta-data fields used for email date filtering should be "Sent Date" AND "Received Date." The meta-data fields used for user files should be "Last Modification Date" and "Create Date." (Note: these dates may be modified or not valid depending on the collection methodology.)
Keyword searching for the employment subject matter can then be applied by the processor of the data. It is recommended that some statistical sampling be done at this stage to determine if the keywords may over-produce. Most vendors and tools can provide statistics prior to finalizing the keyword list so that you can be sure the information will be relevant.
Attorney review - After narrowing the dataset using the above technologies the quantity of data needing attorney review should be quite limited. This is important as attorney review can often be the most expensive part of preparing electronic documents for production. Typically the attorney review will identify relevant and non-relevant documents, identify documents that are privileged and therefore should not be produced and sometimes classify the documents by the issues pertinent to the case. There are a number of different choices available for the review stage including review of extracted text, native files or TIFF or PDF images. Regardless of the review approach used, it is important that the reviewers have the ability to examine the native file with the native application if necessary.
Once the reviewers have selected the documents to be produced, the responsive data can be produced in numerous ways, such as paper, TIFF/PDF, load file with TIFF/PDF, native file, etc.
These steps are illustrated by the graphic below:
(back to top)
Location of Data
The location of the data is important factor in determining the strategy and costs affecting a collection methodology. Items that are not in daily use, such as CD/DVDs, backup tapes, and removable hard drives.can often be sent to the law firm or the vendor for processing. However, security, legal and other reasons there are some things that simply cannot be sent. In many cases computer forensic vendors are requested to go into an organization and create forensic images of the target computer’s hard drive so that the user either does not know that their computer was captured or so that their business day is not affected by the capture process. Backup tapes create special issues because they can be placed anywhere including offsite locations. Accurate information about the organization’s disaster recovery plan and whether any deviations or exceptions to this plan have been made. In a large organization where a single disaster recovery system can take hundreds of backup tapes, it may be difficult to locate and process all of the backup tapes.
Other portable storage devises create similar priorities to quickly identify, locate and determine the best approach to restore relevant data. Additionaly, offsite data housed by third parties or at the home of a custodian present unique challenges. Therefore, identifying the location and type of information to be collected is one of the most important steps to control costs in the electronic-discovery process.
(back to top)
Keyword Searching
Expert Assistance
Doing keyword searching is truly an art form if done properly. It is recommended that an electronic evidence expert (with court experience) work with the legal team very closely throughout this process. This process is designed to assist the client design the most efficient search methodology in order to minimize the producible records (email, attachments, and user files) in the native format.
Minimizing the documents significantly reduces the cost of generating a reviewing database (concordance, summation, online repository), which is generally priced on the gigabyte input as well as the per page output. Also, by focusing on case specific search requirements, documents not relevant to the discovery request will not be included in the privilege review process increasing the review time and minimizing non-responsive records documentation.
The use of an expert in this process allows the producing party to have a third party resource able to create affidavits and protocols justifying the methodology.
(back to top)
Statistical Sampling
This also supports the use of statistical sampling. If, for example, one of the requested keywords is “cat”, but “cat” results in 1,000,000 documents that upon closer inspection appear to have no relevancy. The legal team now can try to different variations of keyword strings, such as (cat W/10 dog) which would limit the resulting documents contextually down to a reasonable 10,000 documents.
(back to top)
Vetting Keyword Lists
The general best practices process to vet a keyword list is:
- Client provides first iteration of keyword requests (“Request List”) in Microsoft Word format;
- Expert returns red-line version of Request List and discusses the logic requirements with the Client;
- Client returns comments in red-line format;
- Expert runs test and provides statistics on results;
- Client signs off (via email) on final Request List;
- Processing begins.
This process can be included with the use of contextual searches as well in order to reduce the reviewable population.
(back to top)
Boolean Searching
There are many different types of search engines out there that are used with varying degrees. Boolean searching is the most common way of querying data. Boolean logic refers to the logical relationship among search terms, and is named for the mathematician George Boole.
The most typical Boolean search operators are:
- AND = cat AND dog = the document must have both "cat" and "dog"
- OR = cat OR dog = the document can have either word or both words
- " " = "catalog" = the exact word "catalog" must be in the document. Another example is “cat a log” = the exact term “cat a log” must be in the document, including the spacing.
- NOT = cat NOT dog = the document must have "cat", but if the word "dog" is in the document, it will not be a responsive document
- ( ) = (cat AND dog) NOT (bird OR mouse) = grouping of words = the document must have both words "cat" and "dog", but if the words "bird" or "mouse" are in the document, it will not be responsive.
- Wildcard = * = cat* = the document must have a word(s) that start with “cat”, but can have any ending – i.e. “cat”, “catalog”, “cats”, “catastrophe”, etc.
- W/ = proximity search = the combination of words need to be within a specific number of words of each other - i.e. (cat W/5 dog) = cat needs to be within 5 words of dog, either before or after = “the cat ate the dog” or “the dog crept up on the cat”.
(back to top)
Other Search Options
Other search options include:
Stemming
This process compares the root forms of a search words and returns documents that contain works that derive from a common stem. For example, if a STEM search was applied to the word “instructional”, documents containing the word, “instruct”, “instructs”, and “instruction”. Stemming is significantly different from using a wildcard (*) search since the wildcard search would be based on a group of characters versus the linguistic analysis done in a STEM search.
(back to top)
Fuzzy
Fuzzy searching allows users to find documents, even if the word being searched is misspelled. A fuzzy search is done by means of fuzzy matching software that returns a list of results based on likely relevance even though the search string doesn’t exactly match. Fuzzy searches generally can be fine tuned and ranked depending on the search engine being used. Fuzzy searches are no stranger to those that have used Optical Character Recognition (“OCR”) from paper documents.
(back to top)
Noise Words
Depending on the search engine being used, Noise Words may become an issue. Noise Words are certain common words that are ignored by indexing or search engines. The most typical Noise Words, but not limited to, are: “the”, “and”, “of”, “his”, “my”, “when”, “there”, “is”, “are”, “or”, and “it”. The search strategy must take into account the Noise Words prior to finalization of the term list, especially if the opposing counsel is involved in the process. It may be a very difficult to re-negotiate keywords based on a perceived technological deficiency. Each vendor and software package has different methodologies for dealing with Noise Words. It is important to have the discussion related to Noise Words prior to creating an index of the documents’ words.
(back to top)
Asian Character Sets
Asian (or Double-Byte) character searches are another challenging piece of the keyword search puzzle.
In general it is imperative that the environment be setup properly to maximize the searching of international Unicode information. Asian and foreign language projects are generally handled by specialists due to the complexity of the process. Most vendors can index MS Outlook mail as well as the typical user files types (i.e. Office, HTML, PDF, etc.). Once the data has been indexed by the appropriate search engine, the provided keywords can be applied.
The most common litigation support output for Asian language cases is to create a load file with the associated TIFF/PDF and meta-data. Depending on the data set, vendors may be able to provide the OCR text of the English characters and maybe the Asian characters.
If necessary, Summary Translation services can be provided. This is the process where native speakers review the document and code information, including a summary of the document. This is generally a fairly expensive process, but works well when the language can't be indexed or searched via the normal lit support review tools.
As for keywords, the search terms should be provided in exactly the form as needed searched in a Microsoft Word document (Word supports Unicode characters that we can export to the search tool). One of the challenges dealing with Asian languages is that spaces are not used many between words, so words will run together. That makes doing complex searches such as "exact phrase" or "cat AND dog" very difficult. It is a general recommendation, depending on the search criteria, to create single keywords that we can put wildcards around, such as "*johnny*", so that iterations like "johnnysmith" or "johnnysaidhelikestosearchdata" will be found. Search experts will work with the legal team and assist refining the words once data is indexed.
(back to top)
Old Technology/Legacy Systems and Databases
Another critical factor affecting the cost of a collection effort is the currency of the technology involved. Many cases involve legacy databases that use outdated or obsolete technology, including outdated operating systems and hardware. When this is the case unique, customized solutions are often required to collect potentially relevant data. This normally requires extensive reliance on the company’s IT staff that would typically have access to the legacy operating systems and hardware.
Even when legacy data can be read and used, a database by definition contains a large quantity of data that is typically unformatted and becomes useful only when it is put into a report. Databases contain entries and complex table structures appear nonsensical if just providing the raw data. Usually what’s relevant is the reports and queries that are populated by the information in the databases. Many database systems do not permit the creation of customized reports containing the information in the form that is deemed potentially relevant in the litigation. Therefore, a third party is often needed to write customized software to extract data from various locations in the database and create a formatted document that can be reviewed.
Because of the costs involved in a customized approach to databases, an organization may be tempted to make portions of a database available to the opposing party so that it will be required to pay the costs of securing the data that it feels may be potentially relevant in the litigation. Great caution should be taken in agreeing to turn over unreviewed databases to opposing counsel. Whatever cost may be saved with this approach could be overwhelmed by the ultimate outcome of the case.
When dealing with older technology, legacy systems or databases, it is important to identify the complexity of the collection and production of that information as early as possible. It may be required to secure outside experts who can provide testimony regarding the complexity, timeliness or cost of securing legacy information. If the parties to the case cannot agree on a reasonable approach to this problem, a court may need to render judgment on the appropriateness and limits of collecting and producing legacy data. The courts are split over granting access to the producing party’s databases for the requesting party to run searches. See, for example, In re Honeywell Int’l, Inc. Securities Litigation, 2003 WL 22722961 (S.D.N.Y.)(for providing access); In re Ford Motor Company, 345 F.3d 1315 (11th Cir. 2003)(against providing access).
(back to top)


