EDRM Evergreen/Processing/Analysis and Validation
From Working EDRM
| Comments: Please submit comments to the EDRM Evergreen Processing forum |
Categories
add introduction
Contents |
Reporting
Whether using a services provider or a software solution, corporations and their law firms alike need to maintain control over all of the documents in a discovery matter. Reporting is an important control that can be exercised throughout the processing timeline. Information about processing can be extremely valuable in establishing a downstream review strategy. Valuable questions that can be asked of the service provider include:
- What reports and/or information are available from the system in place?
- How frequently are these reports produced?
- Do the corporation or law firm have immediate (online) access to important project information?
Reports will allow both the consumer and the provider with the information necessary to effectively communicate and manage the project.
Typical report types include:
Media Analysis Reports
Media Analysis reporting typically includes information regarding the number of files contained on a given piece of media, the type of files contained on the media, and the size of the data contained on the media. In some cases, directory lists of the file names are available. When tapes are the media type, these reports are called catalogues and contain similar information as well as information about when and how the tape backup was completed. This information is useful because it can be used to get a handle on how much data is available to be processed. Also, media analysis reporting has been used to determine whether it is necessary to restore or process particular pieces of media based on the information provided regarding what information was contained. As discussed previously, these reports help guide both cost and delivery schedule estimations.
Custodian Reports
Cases revolve around people, so the custodian list is critical. Being able to associate the data with particular custodians is also critical. Custodian level reports provide data volume by custodian. In addition, percentage and/or volume of data culled by deduplication, searching or other culling technique(s) is available. This information can be used to ensure all data for a particular custodian has been delivered as the legal team prepares for depositions.
Data Culling Reports
Data culling is the process of segregating files in a collection based on specific criteria, prior to processing those files. When files are not being delivered back to the client, reports allow the client a means to ensure that all data has bee handled properly. To that end, deduplication reports include the name of the file, the location path of the file, as well as information regarding other instances of the same file. Depending on the size of the data collection these reports can contain millions of entries and may be best provided in a database format.
Deduplication/Search/Filter Reports
Search and Filter reports contain similar file level information as name and path location, and also contain specific information regarding the reason the file was segregated, as well as to which search term or piece of metadata information the file was responsive. Search reports can also contain analysis regarding the search terms, the number of files responsive to the search, and the number of times those terms were contained within the files. Search reports can be used to validate search term choices. In some cases, these reports have been used to renegotiate search terms when the agreed upon terms did not yield an expected result; either too many or too few files were responsive to particular terms. These reports can be used to substantiate that each and every file contained in the source media has been handled appropriately.
Metrics
One of the biggest challenges that occurs when dealing with electronic data, is estimating the volume when all that is known is the total GB to process. Since the overall volume will have significant impact on the project as a whole, it is important to understand the circumstances that will drive that estimate.
Means of measuring include:
Pages
In a lot of cases the overall review time and cost for a project can be determined by the total number of pages that will be reviewed, and eventually produced. This can be better estimated the more you know about the collection. If you can separate the total volume, and identify the amount of email data, application data, and non-printable data, you can get a more accurate estimate then you would base on volume alone.
Number of Documents
Since another important driver in how much effort will need to be put in to the document review, is the number of documents that will be reviewed, estimating this can be a valuable statistic. Although there are quick ways to identify the number of documents in the collection, it becomes more challenging to quickly identify the documents that will be removed from the culling process.
Culling Rate
The amount of deduplication can vary greatly based on the nature of the data (backups, live data, or a combination), the scope of the deduplication (within or across custodian), and the custodian retention habits. Searching/Filtering is another aspect that is important to consider when estimating the overall volume that will be delivered for review. Depending on the on the number of terms, and the nature of the documents the results can vary greatly.
Non-Printable Files
Non-printable files are documents that in general will not be delivered or reviewed. Therefore it is important to exclude them from the document/GB/page estimates in order to yield more accurate results. (back to top)
Industry Benchmark Survey
The table below lists some industry averages that can be used as a tool for guidance for estimating a document collection:
Benchmark |
Value |
||
|
High |
Median |
Low |
Images[1] per GB |
78,671 |
47,213 |
18,534 |
Images per file email |
11 |
4 |
2 |
Images per file app files |
63 |
10 |
3 |
Files per GB email |
36,530 |
22,572 |
9,934 |
Files per GB app files |
20,305 |
15,791 |
7,553 |
GB per custodian email |
5 |
2 |
1 |
GB per custodian app files |
4 |
1 |
0 |
Culling Rate Percentages |
|
|
|
Deduplication |
51% |
21% |
6% |
Searching/Filtering |
64% |
61% |
23% |
Non-printable files |
22% |
5% |
2% |
Processing Speeds |
|
|
|
Process time per GB native |
117 |
33 |
11 |
Process time per GB image |
35 |
32 |
23 |
Process time to first deliverable |
53 |
35 |
21 |
Process time by file type |
4 |
3 |
2 |
Process time by file type |
6 |
4 |
3 |
Process time by file type |
2 |
3 |
2 |
Quality |
|
|
|
First pass quality yield %[2] |
57% |
78% |
73% |
Paper-to-Electronic Estimate Conversion Table
Boxes of Documents |
Approximate Total Pages |
Megabytes, Gigabytes, Terabytes |
|
1 |
2,500 |
50 |
Megabytes |
10 |
25,000 |
500 |
Megabytes |
20 |
50,000 |
1 |
Gigabyte |
100 |
250,000 |
5 |
Gigabyte |
200 |
500,000 |
10 |
Gigabyte |
300 |
750,000 |
15 |
Gigabyte |
400 |
1,000,000 |
20 |
Gigabyte |
500 |
1,250,000 |
25 |
Gigabyte |
1,000 |
2,500,000 |
50 |
Gigabyte |
2,000 |
5,000,000 |
100 |
Gigabyte |
5,000 |
12,500,000 |
250 |
Gigabyte |
10,000 |
25,000,000 |
500 |
Gigabyte |
20,000 |
50,000,000 |
1 |
Terabyte |
40,000 |
100,000,000 |
2 |
Terabyte |
60,000 |
150,000,000 |
3 |
Terabyte |
Footnotes
- ^ Images are counted one per page, so that a 4-page multi-page TIFF would count as 4 images.
- ^ The percentage of data that runs through without intervention or exception handling.
[updated Jan. 31, 2008]

