EDRM Production Standards Version 2

Lead author: Julie Brown (Vorys, Sater, Seymour and Pease LLP)

Updated April 25, 2014

Click to see Version 1.

The purpose of this document is to outline standards for of in discovery. The intent is for these standards to be easily communicated by attorneys at a meet and confer by referring to the category of production. The following definitions are provided regarding the forms of production (See the EDRM Production Guide for further clarification on the forms of production, http://edrm.net/resources/guides/edrm-framework-guides/production):

  • Native Format – Files are produced in the format in which they were originally created (Example: .docx produced in .docx; .pdf produced in .pdf, etc.)
  • Near-Native Format – Files are extracted or converted into another searchable format (Example: e-mails produced in .htm, .mht, or .rtf; Databases produced in .txt or .csv format)
  • Image (Near Paper) Format – Electronic files are converted to image format or paper is scanned to image format
  • Paper – Electronic files are printed to paper or paper files remain in paper format

The categories of production identified below include A, B, C, D and E. The descriptions of the standards are followed by a Quick Guide to Components of Productions A-D, a chart containing the Characteristics of Productions A-D and a chart containing the required metadata and other information fields. In addition to agreeing to one of these standards, the requesting party should tell the producing party which review tool they will be using. This information is needed to properly identify the components and formats required to successfully load the information into a review tool.

A. Native/Near-Native Production

E-mail, databases and proprietary files are produced in a near native format. Attachments and loose files are produced in native format. Only files requiring redaction are tiffed. Includes searchable text for redacted files.

  1. Each native /near-native file name matches the DocID. (I.e. DocID = ABC0000123; Filename = ABC0000123.doc for MS Word document.)
  2. Each searchable native/near native file has an extracted text file in .txt format named with the DocID of the corresponding file. Each non-searchable file containing text has a multipage OCR text file named with the DocID of the corresponding file. (I.e. DocID = ABC0000123; Filename = ABC0000123.txt.)
  3. Each file requiring redaction has group IV single page tifs. Each file requiring redaction has a unique bates number applied to images matching the DocID or Bates number. The same number may be applied to each page within a document or the numbers can increment by page.
  4. OCR for redacted files in multipage .txt format. Each file named the same as the DocID/Bates number of the corresponding document. (I.e. Image Filename = ABC0000123.tif; OCR Filename = ABC0000123.txt.)
  5. Load file(s) for native/near-native, images, extracted text and OCR files in EDRM xml or common format such as that required by Concordance or Summation.
  6. Data file including, at a minimum, the standard EDRM extracted metadata and other information fields to the extent they exist (see chart below). This data may be included in load file or produced as a separate text delimited file.

B. Image (Near-Paper)/Native/Near-Native Production

Most files are converted to image format (tif, pdf, etc.) with the exception of files like MS Excel that are not usable in image format and/or paper scanned to image format and OCR’d. Includes searchable text for redacted files.

  1. Most Native/near native files are converted to group IV single page tif. Each file has a unique bates number applied to images matching the DocID or Bates number.
  2. Each searchable native/near native file has an extracted text file in .txt format named with the DocID of the corresponding file. Each non-searchable file containing text has a multipage OCR text file named with the DocID of the corresponding file. (I.e. DocID = ABC0000123; Filename = ABC0000123.txt.)
  3. Spreadsheets and files that are not usable in .tif format are produced in native or near-native format and named the same as the Doc ID. (I.e. DocID = ABC0000123; Filename = ABC0000123.xls for MS Excel document.)
  4. OCR for redacted files in multipage .txt format. Each file named the same as the DocID/Bates number of the corresponding document. (I.e. Image Filename = ABC0000123.tif; OCR Filename = ABC0000123.txt.)
  5. Load file(s) for native/near-native, images, extracted text and OCR files in EDRM xml or common format such as that required by Concordance or Summation.
  6. Data file including, at a minimum, the standard EDRM extracted metadata and other information fields to the extent they exist (see chart below). This data may be included in load file or produced as a separate text delimited file.

C. Image Production

All files are converted to image format (tif, pdf, etc.) and/or paper is scanned to image format and OCR’d. Includes searchable text for redacted files.

  1. All Native/near native files are converted to group IV single page tif. Each file has a unique bates number applied to images matching the DocID or Bates number.
  2. All images are black & white except for those that require color for interpretation. Color images are produced in .jpg format unless otherwise agreed.
  3. Container files such as .zip or .rar may be converted to .tif format with a table of contents or referenced in the “folder” field containing the path to the original native file as it existed at the time of collection.
  4. Each searchable native/near native file has an extracted text file in .txt format named with the DocID of the corresponding file. Each non-searchable file containing text has a multipage OCR text file named with the DocID of the corresponding file. (I.e. DocID = ABC0000123; Filename = ABC0000123.txt.)
  5. OCR for redacted files in multipage .txt format. Each file named the same as the DocID/Bates number of the corresponding document. (I.e. Image Filename = ABC0000123.tif; OCR Filename = ABC0000123.txt.)
  6. Load file(s) for image files, extracted text and OCR in EDRM xml or common format such as that required by Concordance or Summation.
  7. Data file including, at a minimum, the standard EDRM extracted metadata and other information fields to the extent they exist (see chart below). This data may be included in load file or produced as a separate text delimited file.

D. Custom

  1. Images, Load File, Data file and no searchable text
  2. Images only
  3. Paper
  4. Other

E. On-line Production

Files presented for production via online review tool. Formats, fields, loads and exports to be negotiated on a case by case basis.

Quick Guide to Components of Productions A-D

Production Native Near Native Images Extracted Text OCR Text Searchable Text for Redacted Files Load File Data File
A x x x x x x x x
B x x x x x x x x
C x x x x x x
D x x x

Characteristics of Productions A-D

Characteristics A B C D
Increase costs for image conversion x x x
Increase turn around time for image conversion of majority of data set x x x
Increase cost and turn around time for OCRing redacted files x x x
Files are not searchable x
Files such as spreadsheets and small databases are not in a format conducive for review x x
Cannot individually number or endorse pages for document control x x
Cannot brand pages with confidentiality endorsements x x
Risk of accidental alteration is greater than with image format x x
Metadata may be hidden and not fully reviewed prior to production x x
May require native application or provision of client’s proprietary software to open files x x
Cost of conversion and printing  x
No link back to native file  x
No database or text for searching  x

Metadata and Other Information Fields

Fields Description
ATTACHMENTIDS Docids of attachment(s) to email/edoc. This can also be provided in an attachment range field.
AUTHORS Name of person creating document.
BATES RANGE Begin and end bates number of a document if it differs from DocID; this can be provided in one bates range field or 2 separate fields for the beginning and ending number.
BCC Names of persons blind copied on an email.
CC Names of persons copied on an email.
CUSTODIAN Name of person from whom the file was obtained.
DATECREATED Date document was created.
DATERECEIVED Date email was received.
DATESAVED Date document was last saved.
DATESENT Date email was sent.
DOCEXT Extension of native document.
DOCID Unique number assigned to each file or first page.
DOCLINK Full relative path to the current location of the native or near-native document used to link metadata to native or near native file.
FILENAME Name of the original native file as it existed at the time of collection.
FOLDER File path/folder structure for the original native file as it existed at the time of collection.
FROM Name of person sending an email.
HASH Identifying value of an electronic record – used for deduplication and authentication; hash value is typically MD5 or SHA1.
PARENTID DocId of the parent document.
RCRDTYPE Indicates document type, i.e., email; attachment; edoc; scanned; etc.
SUBJECT Subject line of an email.
THREAD ID Also known as conversation ID.  A unique number assigned to groups of emails from the same thread.
TIMERECEIVED Time email was received in user’s mailbox.
TIMESENT Time email was sent.
TO Name(s) of person(s) receiving email.
  • Electronically Stored Information or ESI is information that is stored electronically on enumerable types of media regardless of the original format in which it was created.
  • Electronically Stored Information: this is an all inclusive term referring to conventional electronic documents (e.g. spreadsheets and word processing documents) and in addition the contents of databases, mobile phone messages, digital recordings (e.g. of voicemail) and transcripts of instant messages. All of this material needs to be considered for disclosure.
  • Delivering to others in appropriate forms & using appropriate delivery mechanisms.
  • Delivery of data or information in response to an interrogatory, subpoena or discovery order or a similar legal process.
  • Electronically Stored Information or ESI is information that is stored electronically on enumerable types of media regardless of the original format in which it was created.
  • Electronically Stored Information: this is an all inclusive term referring to conventional electronic documents (e.g. spreadsheets and word processing documents) and in addition the contents of databases, mobile phone messages, digital recordings (e.g. of voicemail) and transcripts of instant messages. All of this material needs to be considered for disclosure.

30 comments to EDRM Production Standards Version 2

  • 14
    EDRM Production Standards " The Electronic Discovery ... says:

    EDRM Production Standards ” The Electronic Discovery ……

    […]The intent is for these standards to be easily communicated by attorneys at a meet and … images, extracted text and OCR files in EDRM xml or common format …[…]…

  • 13
    Michael Olig says:

    Thanks for response. My suggestions, both above and below, were provided with the assumption that the issues that this group is attempting to provide guidance on are best served by the development of a best practices standard. There will certainly be those who choose not to follow the standard and choices that will need to be addressed on a matter-by-matter basis, but that doesn’t mean that poor practices should be codified.

    1. My concern here is that, because document productions are more likely to involve both paper and electronic documents than not, the standard should address the paper component of productions in an explicit manner. The only mention of paper-based documents occurs in the preamble to the standards (e.g., “paper is scanned to image format” and “paper files remain in paper format”). Paper documents are not even alluded to in the standards themselves. I don’t understand the value in developing a production standard that largely ignores an often substantial component of productions.

    2. Sorry, I wasn’t clear on this point. I was referring to documents being converted to an image format for production purposes that require color for interpretation. You obviously don’t want to encourage the alteration of native files.

    4. Many processing tools have already implemented a “slip-sheet exception” method and I do not see a clear argument against suggesting it as the best practice standard.

    5. “ATTACHMENT” is an inappropriate and redundant RCDTYPE. The information conveyed by the RCDTYPE field serves to distinguish gross formats from one another. The most fundamental difference in format is between documents that were maintained in a paper format (“SCANNED”) and documents that were maintained in an electronic format (“EDOC”). “EMAIL” records have traditionally been considered to be distinct from “EDOC” records due to the unique differences in their relevant metadata information. Contrast that with the information conveyed by the term “ATTACHMENT”, which relates directly to a relationship possessed by the record and obscures the actual format information (i.e., an “ATTACHMENT” could be an “EMAIL” or an “EDOC”). Attachments are already identifiable through their family relationships, regardless of actual the method used to identify family relationships.

    6. In the short term, yes. However, allowing multiple review tools to dictate separate methods is the specific behavior that standards are intended to counter. Assuming a reasonable set of standards is developed, they are likely to see widespread adoption. It is hard for me to imagine a software developer who would to implement their own methods in place of a set of well-developed industry standards.

    7a. Numbering each page of an image document with the same Docid is a REALLY BAD IDEA. Docids serve the same purpose as Bates numbering; to provide a unique identifier for each and every component of a production, allowing parties the ability to confidently and easily refer to a specific location within a group of records. I expect most people to understand that it would be a bad idea to endorse each page of a 10,000-page, paper-based production with the same Bates number. The same idea applies to TIFF images.

    I feel strongly about the method I suggest in 7a of my original post, as it conveys an incredible amount of information in and of itself. The method I suggest inherently allows for: i) the unique identification of documents; ii) the unitization of multi-page documents without the requirement of external metadata; iii) the unique identification of pages within a multi-page document; iv) the production format; and, v) the identification of the relationship between a native document and any associated generated images.

    • 13.1
      Julie Brown says:

      Hi Michael!
      Thanks for your response. Once again I am delayed in my response. My apologies. I think one thing we want to keep in mind is we are not trying to set the end all, be all standards at this point but rather a minimum standard. This means that the standards have to offer some guidance but also allow for differences between service providers and software on the market. If we set a standard that half of the population can’t produce, no one will adopt it. The true goal of these standards is to provide attorneys with some guidelines to communicate regarding various formats. I would expect in the future (possibly 6 months-1 year) these standards will be updated.
      That said, I will add a section regarding scanning paper to images under the image production sections. As far as compression the language for documents requiring colored images specifies only a format and not the compression. Are you suggesting that they should be specific enough to include the actual compression setting instead of just the format? Again, I think the slip sheet exception is optionals. If I produce 100,000 documents with gaps for 100,000 duplicates are you suggesting I have to produce 100,000 slip sheets? As far as the Record type I will generalize the field definition. I believe there are limitations in some review tools that require this information at this point in time (a year from now may be different). As for 7a, I will remove the option to bates stamp the pages the same. I will include a full renumber or extensions.

  • 12
    Michael Olig says:

    A few comments:

    1. I think paper-based documents should be considered in the development of these standards. My experience has been that few document productions are composed entirely of native files. As such, the standard’s methods should be as applicable to the production of scanned paper-based documents as to the production of native and imaged native documents.

    2. Regarding color images, I feel that LZW compression should be considered over JPEG compression. LZW compression is lossless and tends to yield smaller files than JPEG (though JPEG will yield higher compression ratios for complex or “noisy” documents, such as photographs).

    3. A standard resolution should be suggested for images (e.g., 300 PPI by 300 PPI).

    4. I agree with Brian Conrad’s suggestion of slip sheets for non-converted native files and exceptions. While eliminating numbering gaps, I feel that it also helps to eliminate the potential for confusion.

    5. I agree with Gillian Glass (and subsequently David Baldwin) that RCDTYPE should be simplified. Perhaps limiting the field to “SCANNED”, “EMAIL”, and “EDOC”?

    6. I agree with David Baldwin (and subsequently Thomas Bonk) that the PARENTID/ATTACHMENTID construct is a poor method. I would prefer a single FAMILY field containing the DocID range of the family group, with the lowest DocID being that of the parent.

    7. I agree with David Baldwin that externalized document unitization should be eliminated; however, I disagree with his suggestion of the use of multi-page file formats. While single-page format is a relic of paper-based productions, as Thomas Bonk alluded to, multi-page file formats can cause severe performance issues in networked environments. I suggest the inclusion of such unitization information in the DocID itself (e.g., BATES000001_000001). Such a numbering structure is also far more appropriate for use with native documents (i.e., a native file would be identified as BATES000001 and its associated rendered images would be identified as BATES000001_000001, BATES000001_000002, etc.)

    7. I completely disagree with Jay Spencer’s comments. A data file should always be produced with native files to ensure consistency among the parties. Similarly, parties should be encouraged to supply extracted text and OCR files for native files, which allows for consistent search capabilities in productions involving any combination of scanned documents, non-searchable native documents, and searchable native documents. A near-native (i.e., TIFFed native file) production format is preferable to a pure native format in that i) it allows for the petrifaction of native documents; ii) it allows for Bates and confidentiality endorsement; and iii) negates the potential for issues caused by proprietary native document formats and software rendering differences. Additionally, production in a petrified (i.e., non-native) format is preferable not because attorneys might intentionally alter such a document, but simply because it is incredibly easy for someone to inadvertently and unintentionally alter such a document.

    • 12.1
      Julie Brown says:

      1. Paper that is scanned to image is covered in section C, image production. Paper produced as paper is a custom format under section D. Does this work?
      2. I think this is a processing standard instead of production. You would produce the files in the native format in which they currently exist be it LZW or JPEG.
      3. This can be added to the “group IV single page tif” language.
      4. I think this is certainly an option but not required.
      5. Is the recordtype “attachment” an issue?
      6. I think this may need to be an either or depending on the requesting party review tool requirements.
      7a. this is an option. See my response to David Baldwin’s comments.
      7b. See my response to Jay Spencer’s comment.

  • 11
    Jay Spencer says:

    First, if you are doing native files you don’t need a load file for summation, I know for sure. Also, you don’t need to supply txt files for a native, that is just a waste of a clients money. A near native production should never be considered, that is just a waste almost all software can now handle native files. The true problem with native productions is that attorneys are worried that someone will alter that document which is crazy. If an attorney did that they would be disbared and sanctioned so there is no need to worry plus if you can always ask the judge to order a hash value screening of the original or a shaw evaluation. Tiffs should be forgotten about, I know that most of the software is set up on them but multi page pdfs are much more reasonable. Especially for smaller firms that cannot afford lit support software. The most important thing is to push for native productions, to many files were never intended to be printed, ie excel docs. This description makes something simple way to complicated.

    • 11.1
      Julie Brown says:

      Thanks for your comments Jay! Unfortunately is is nearly impossible to produce all files in native format, particularly e-mail. If you produce a pst or nsf both parties will have to process to load into a review tool. Because the processing of files varies between tools (embedded objects, container files, etc.), this could result in the files having different document IDs and the parties having different sets of documents. Additionally, it doesn’t make sense for both parties to incur processing costs on the same set of data. The native/near-native production standard eliminates these issues.

Add Comment Register



Leave a Reply