EDRM Evergreen/Processing/Analysis and Validation

From Working EDRM

Jump to: navigation, search
Comments: Please submit comments to the EDRM Evergreen Processing forum

Categories

add introduction

Contents

Reporting

Whether using a services provider or a software solution, corporations and their law firms alike need to maintain control over all of the documents in a discovery matter. Reporting is an important control that can be exercised throughout the processing timeline. Information about processing can be extremely valuable in establishing a downstream review strategy. Valuable questions that can be asked of the service provider include:

  1. What reports and/or information are available from the system in place?
  2. How frequently are these reports produced?
  3. Do the corporation or law firm have immediate (online) access to important project information?

Reports will allow both the consumer and the provider with the information necessary to effectively communicate and manage the project.

Typical report types include:

Media Analysis Reports

Media Analysis reporting typically includes information regarding the number of files contained on a given piece of media, the type of files contained on the media, and the size of the data contained on the media. In some cases, directory lists of the file names are available. When tapes are the media type, these reports are called catalogues and contain similar information as well as information about when and how the tape backup was completed. This information is useful because it can be used to get a handle on how much data is available to be processed. Also, media analysis reporting has been used to determine whether it is necessary to restore or process particular pieces of media based on the information provided regarding what information was contained. As discussed previously, these reports help guide both cost and delivery schedule estimations.

Custodian Reports

Cases revolve around people, so the custodian list is critical. Being able to associate the data with particular custodians is also critical. Custodian level reports provide data volume by custodian. In addition, percentage and/or volume of data culled by deduplication, searching or other culling technique(s) is available. This information can be used to ensure all data for a particular custodian has been delivered as the legal team prepares for depositions.

Data Culling Reports

Data culling is the process of segregating files in a collection based on specific criteria, prior to processing those files. When files are not being delivered back to the client, reports allow the client a means to ensure that all data has bee handled properly. To that end, deduplication reports include the name of the file, the location path of the file, as well as information regarding other instances of the same file. Depending on the size of the data collection these reports can contain millions of entries and may be best provided in a database format.

Deduplication/Search/Filter Reports

Search and Filter reports contain similar file level information as name and path location, and also contain specific information regarding the reason the file was segregated, as well as to which search term or piece of metadata information the file was responsive. Search reports can also contain analysis regarding the search terms, the number of files responsive to the search, and the number of times those terms were contained within the files. Search reports can be used to validate search term choices. In some cases, these reports have been used to renegotiate search terms when the agreed upon terms did not yield an expected result; either too many or too few files were responsive to particular terms. These reports can be used to substantiate that each and every file contained in the source media has been handled appropriately.

Metrics

One of the biggest challenges that occurs when dealing with electronic data, is estimating the volume when all that is known is the total GB to process. Since the overall volume will have significant impact on the project as a whole, it is important to understand the circumstances that will drive that estimate.

Means of measuring include:

Pages

In a lot of cases the overall review time and cost for a project can be determined by the total number of pages that will be reviewed, and eventually produced. This can be better estimated the more you know about the collection. If you can separate the total volume, and identify the amount of email data, application data, and non-printable data, you can get a more accurate estimate then you would base on volume alone.

Number of Documents

Since another important driver in how much effort will need to be put in to the document review, is the number of documents that will be reviewed, estimating this can be a valuable statistic. Although there are quick ways to identify the number of documents in the collection, it becomes more challenging to quickly identify the documents that will be removed from the culling process.

Culling Rate

The amount of deduplication can vary greatly based on the nature of the data (backups, live data, or a combination), the scope of the deduplication (within or across custodian), and the custodian retention habits. Searching/Filtering is another aspect that is important to consider when estimating the overall volume that will be delivered for review. Depending on the on the number of terms, and the nature of the documents the results can vary greatly.

Non-Printable Files

Non-printable files are documents that in general will not be delivered or reviewed. Therefore it is important to exclude them from the document/GB/page estimates in order to yield more accurate results. (back to top)

Industry Benchmark Survey

The table below lists some industry averages that can be used as a tool for guidance for estimating a document collection:

Benchmark

Value

 

High

Median

Low

Images[1] per GB

78,671

47,213

18,534

Images per file email

11

4

2

Images per file app files

63

10

3

Files per GB email

36,530

22,572

9,934

Files per GB app files

20,305

15,791

7,553

GB per custodian email

5

2

1

GB per custodian app files

4

1

0

Culling Rate Percentages

 

 

 

Deduplication

51%

21%

6%

Searching/Filtering

64%

61%

23%

Non-printable files

22%

5%

2%

Processing Speeds

 

 

 

Process time per GB native

117

33

11

Process time per GB image

35

32

23

Process time to first deliverable

53

35

21

Process time by file type

4

3

2

Process time by file type

6

4

3

Process time by file type

2

3

2

Quality

 

 

 

First pass quality yield %[2]

57%

78%

73%

Paper-to-Electronic Estimate Conversion Table

Boxes of Documents

Approximate Total Pages

Megabytes, Gigabytes, Terabytes

1

2,500

50

Megabytes

10

25,000

500

Megabytes

20

50,000

1

Gigabyte

100

250,000

5

Gigabyte

200

500,000

10

Gigabyte

300

750,000

15

Gigabyte

400

1,000,000

20

Gigabyte

500

1,250,000

25

Gigabyte

1,000

2,500,000

50

Gigabyte

2,000

5,000,000

100

Gigabyte

5,000

12,500,000

250

Gigabyte

10,000

25,000,000

500

Gigabyte

20,000

50,000,000

1

Terabyte

40,000

100,000,000

2

Terabyte

60,000

150,000,000

3

Terabyte

Footnotes

  1. ^  Images are counted one per page, so that a 4-page multi-page TIFF would count as 4 images.
  2. ^  The percentage of data that runs through without intervention or exception handling.

[updated Jan. 31, 2008]

Personal tools
additional information