PLEASE COMMENT 
Lead author: Julie Brown (Vorys, Sater, Seymour and Pease LLP)
Updated February 10, 2011
The purpose of this document is to outline standards for production of electronically stored information in discovery. The intent is for these standards to be easily communicated by attorneys at a meet and confer by referring to the category of production. The following definitions are provided regarding the forms of production (See the EDRM Production Guide for further clarification on the forms of production, http://edrm.net/resources/guides/edrm-framework-guides/production):
- Native Format – Files are produced in the format in which they were originally created (Example: .docx produced in .docx; .pdf produced in .pdf, etc.)
- Near-Native Format – Files are extracted or converted into another searchable format (Example: e-mails produced in .htm, .mht, or .rtf; Databases produced in .txt or .csv format)
- Image (Near Paper) Format – Electronic files are converted to image format or paper is scanned to image format
- Paper – Electronic files are printed to paper or paper files remain in paper format
The categories of production identified below include A1, A2, B1, B2, C1, C2, D and E. The descriptions of the standards are followed by a Quick Guide to Components of Productions A-D, a chart containing the Characteristics of Productions A-D and a chart containing the required metadata and other information fields. In addition to agreeing to one of these standards, the requesting party should tell the producing party which review tool they will be using. This information is needed to properly identify the components and formats required to successfully load the information into a review tool.
A. Native/Near-Native Production
E-mail, databases and proprietary files are produced in a near native format. Attachments and loose files are produced in native format. Only files requiring redaction are tiffed.
- Includes searchable text for redacted files:
- Each native /near-native file name matches the DocID. (I.e. DocID = ABC0000123; Filename = ABC0000123.doc for MS Word document.)
- Each searchable native/near native file has an extracted text file in .txt format named with the DocID of the corresponding file. Each non-searchable file containing text has a multipage OCR text file named with the DocID of the corresponding file. (I.e. DocID = ABC0000123; Filename = ABC0000123.txt.)
- Each file requiring redaction has group IV single page tifs. Each file requiring redaction has a unique bates number applied to images matching the DocID or Bates number. The same number may be applied to each page within a document or the numbers can increment by page.
- OCR for redacted files in multipage .txt format. Each file named the same as the DocID/Bates number of the corresponding document. (I.e. Image Filename = ABC0000123.tif; OCR Filename = ABC0000123.txt.)
- Load file(s) for native/near-native, images, extracted text and OCR files in EDRM xml or common format such as that required by Concordance or Summation.
- Data file including, at a minimum, the standard EDRM extracted metadata and other information fields to the extent they exist (see chart below). This data may be included in load file or produced as a separate text delimited file.
- Does not include searchable text for redacted files:
- Each native /near-native file name matches the DocID. (I.e. DocID = ABC0000123; Filename = ABC0000123.doc for MS Word document.)
- Each searchable native/near native file has an extracted text file in .txt format named with the DocID of the corresponding file. Each non-searchable file containing text has a multipage OCR text file named with the DocID of the corresponding file. (I.e. DocID = ABC0000123; Filename = ABC0000123.txt.)
- Each file requiring redaction has group IV single page tifs. Each file requiring redaction has a unique bates number applied to images matching the DocID or Bates number. The same number may be applied to each page within a document or the numbers can increment by page.
- Load file(s) for native/near-native, images, extracted text and OCR files in EDRM xml or common format such as that required by Concordance or Summation.
- Data file including, at a minimum, the standard EDRM extracted metadata and other information fields to the extent they exist (see chart below). This data may be included in load file or produced as a separate text delimited file.
B. Image (Near-Paper)/Native/Near-Native Production
Most files are converted to image format (tif, pdf, etc.) with the exception of files like MS Excel that are not usable in image format and/or paper scanned to image format and OCR’d.
- Includes searchable text for redacted files:
- Most Native/near native files are converted to group IV single page tif. Each file has a unique bates number applied to images matching the DocID or Bates number.
- Each searchable native/near native file has an extracted text file in .txt format named with the DocID of the corresponding file. Each non-searchable file containing text has a multipage OCR text file named with the DocID of the corresponding file. (I.e. DocID = ABC0000123; Filename = ABC0000123.txt.)
- Spreadsheets and files that are not usable in .tif format are produced in native or near-native format and named the same as the Doc ID. (I.e. DocID = ABC0000123; Filename = ABC0000123.xls for MS Excel document.)
- OCR for redacted files in multipage .txt format. Each file named the same as the DocID/Bates number of the corresponding document. (I.e. Image Filename = ABC0000123.tif; OCR Filename = ABC0000123.txt.)
- Load file(s) for native/near-native, images, extracted text and OCR files in EDRM xml or common format such as that required by Concordance or Summation.
- Data file including, at a minimum, the standard EDRM extracted metadata and other information fields to the extent they exist (see chart below). This data may be included in load file or produced as a separate text delimited file.
- Does not include searchable text for redacted files:
- Most Native/near native files are converted to group IV single page tif. Each file has a unique bates number applied to images matching the DocID or Bates number.
- Each searchable native/near native file has an extracted text file in .txt format named with the DocID of the corresponding file. Each non-searchable file containing text has a multipage OCR text file named with the DocID of the corresponding file. (I.e. DocID = ABC0000123; Filename = ABC0000123.txt.)
- Spreadsheets and files that are not usable in .tif format will be produced in native or near-native format and named the same as the Doc ID. (I.e. DocID = ABC0000123; Filename = ABC0000123.doc for MS Word document.)
- Load file(s) for native/near-native, images, extracted text and OCR files in EDRM xml or common format such as that required by Concordance or Summation.
- Data file including, at a minimum, the standard EDRM extracted metadata and other information fields to the extent they exist (see chart below). This data may be included in load file or produced as a separate text delimited file.
C. Image Production
All files are converted to image format (tif, pdf, etc.) and/or paper is scanned to image format and OCR’d.
- Includes searchable text for redacted files:
- All Native/near native files are converted to group IV single page tif. Each file has a unique bates number applied to images matching the DocID or Bates number.
- All images are black & white except for those that require color for interpretation. Color images are produced in .jpg format unless otherwise agreed.
- Container files such as .zip or .rar may be converted to .tif format with a table of contents or referenced in the “folder” field containing the path to the original native file as it existed at the time of collection.
- Each searchable native/near native file has an extracted text file in .txt format named with the DocID of the corresponding file. Each non-searchable file containing text has a multipage OCR text file named with the DocID of the corresponding file. (I.e. DocID = ABC0000123; Filename = ABC0000123.txt.)
- OCR for redacted files in multipage .txt format. Each file named the same as the DocID/Bates number of the corresponding document. (I.e. Image Filename = ABC0000123.tif; OCR Filename = ABC0000123.txt.)
- Load file(s) for image files, extracted text and OCR in EDRM xml or common format such as that required by Concordance or Summation.
- Data file including, at a minimum, the standard EDRM extracted metadata and other information fields to the extent they exist (see chart below). This data may be included in load file or produced as a separate text delimited file.
- Does not include searchable text for redacted files:
- All Native/near native files are converted to group IV single page tif. Each file has a unique bates number applied to images matching the DocID or Bates number.
- Each searchable native/near native file has an extracted text file in .txt format named with the DocID of the corresponding file. Each non-searchable file containing text has a multipage OCR text file named with the DocID of the corresponding file. (I.e. DocID = ABC0000123; Filename = ABC0000123.txt.)
- Load file(s) for image files, extracted text and OCR in EDRM xml or common format such as that required by Concordance or Summation.
- Data file including, at a minimum, the standard EDRM extracted metadata and other information fields to the extent they exist (see chart below). This data may be included in load file or produced as a separate text delimited file.
D. Custom
- Images, Load File, Data file and no searchable text
- Images only
- Paper
- Other
E. On-line Production
Files presented for production via online review tool. Formats, fields, loads and exports to be negotiated on a case by case basis.
Quick Guide to Components of Productions A-D
| Production | Native | Near Native | Images | Extracted Text | OCR Text | Searchable Text for Redacted Files | Load File | Data File |
|---|---|---|---|---|---|---|---|---|
| A1 | x | x | x | x | x | x | x | x |
| A2 | x | x | x | x | x | x | x | |
| B1 | x | x | x | x | x | x | x | x |
| B2 | x | x | x | x | x | x | x | |
| C1 | x | x | x | x | x | x | ||
| C2 | x | x | x | x | x | |||
| D1 | x | x | x | |||||
| D2 | x |
Characteristics of Productions A-D
| Characteristics | A1 | A2 | B1 | B2 | C1 | C2 | D1 | D2 | D3 |
|---|---|---|---|---|---|---|---|---|---|
| Increase costs for image conversion | x | x | x | x | x | x | x | ||
| Increase turn around time for image conversion of majority of data set | x | x | x | x | x | x | x | ||
| Increase cost and turn around time for OCRing redacted files | x | x | x | ||||||
| Files are not searchable | x | x | x | ||||||
| Files such as spreadsheets and small databases are not in a format conducive for review | x | x | x | x | x | ||||
| Cannot individually number or endorse pages for document control | x | x | x | x | |||||
| Cannot brand pages with confidentiality endorsements | x | x | x | x | |||||
| Risk of accidental alteration is greater than with image format | x | x | x | x | |||||
| Metadata may be hidden and not fully reviewed prior to production | x | x | x | x | |||||
| May require native application or provision of client’s proprietary software to open files | x | x | x | x | |||||
| Cost of conversion and printing | x | ||||||||
| No link back to native file | x | x | |||||||
| No database or text for searching | x | x |
Metadata and Other Information Fields
| Fields for email (Not All Inclusive) | Description |
|---|---|
| ATTACHMENTIDS | Docids of attachment(s) to email/edoc |
| BATES RANGE | Begin and end bates number of a document if it differs from DocID; this can be provided in one bates range field or 2 separate fields for the beginning and ending number |
| BCC | Names of persons blind copied on an email |
| CC | Names of persons copied on an email |
| CUSTODIAN | Name of person from whom the file was obtained |
| DATERECEIVED | Date email was received |
| DATESENT | Date email was sent |
| DOCEXT | Extension of native document |
| DOCID | Unique number assigned to each file or first page |
| DOCLINK | Full relative path to the current location of the native or near-native document used to link metadata to native or near native file |
| FILENAME | Name of the original native file as it existed at the time of collection |
| FOLDER | File path/folder structure for the original native file as it existed at the time of collection |
| FROM | Name of person sending an email |
| HASH | Identifying value of an electronic record – used for deduplication and authentication; hash value is typically MD5 or SHA1 |
| PARENTID | DocId of the parent document |
| RCRDTYPE | Indicates document type, i.e., email; attachment; edoc; scanned; etc. |
| SUBJECT | Subject line of an email |
| TIMERECEIVED | Time email was received in user’s mailbox |
| TIMESENT | Time email was sent |
| TO | Name(s) of person(s) receiving email |
| Fields for edocs & Attachments (Not All Inclusive) | Description |
|---|---|
| ATTACHMENTIDS | DocIds of attachment(s) to email/edoc |
| AUTHORS | Name of person creating document |
| BATES RANGE | Begin and end bates number of a document if it differs from DocID; this can be provided in one bates range field or 2 separate fields for the beginning and ending number |
| CUSTODIAN | Name of person from whom the file was obtained |
| DATECREATED | Date document was created |
| DATESAVED | Date document was last saved |
| DOCEXT | Extension of native document |
| DOCID | Unique number assigned to each file or first page |
| DOCLINK | Full relative path to the current location of the native or near-native document used to link metadata to native or near native file |
| DOCTITLE | Title given to native file |
| FILENAME | Name of the original native file as it existed at the time of collection |
| FOLDER | File path/folder structure for the original native file as it existed at the time of collection |
| HASH | Identifying value of an electronic record – used for deduplication and authentication; hash value is typically MD5 or SHA1 |
| PARENTID | DocId of the parent document |
| RCRDTYPE | Indicates document type, i.e., email; attachment; email attachment (email); edoc; scanned; etc. |








I think there needs to be further clarification on exactly where the data is mined for the DATECREATED and DATESAVED fields (ie, Date Modified and Date Created maintained by the operating system? Created or Modified from the file’s metadata properties for Office files? In Adobe products, the metadata field names are slightly different, etc)
Hi Thomas! Thanks for your comment and apologies for my delayed response. I think this is something we should address in Processing standards as opposed to production standards. EDRM Evergreen is attempting to develop standards this year for each of the nodes.
Password Protected (PP) Native Files
With respect to password protected (PP) native files and the current draft EDRM production standard A (i.e., a Native Format production), is there a consensus as to whether native files should be produced in their PP state (a literal interpretation of “native” file)? If so, should the password for the native PP file be provided as metadata, e.g., if it was obtained through an automated recovery process? Should decrypted native files be produced if responsive? This consideration will depend, in part, on how PP exception files are handled during processing. If no attempt is made during routine processing to decrypt PP native files, during the review process PP native files would likely not be identified as responsive. In most instances, a reviewer would be unlikely to base a responsiveness determination solely on the metadata of a PP file, such as original path name and/or original file name.
Related question: During routine ESI processing, should a decryption attempt be made on all PP native files?
Confidential Native Files
With respect to production of confidential native files, it should be sufficient to label the media with an appropriate “CONFIDENTIAL” designation and actually treat the information as confidential by applying encryption and using appropriate information security protocols.
Even if a responding party were to assert that the content of encrypted, PP native files are “not reasonably accessible” - [under FRCP Rule 26(b)(2)(B), by virtue of such ESI being too costly and burdensome to decrypt] - a judge may shift to the requesting party the cost to attempt decryption of just certain PP files, if that party demonstrates good cause, e.g. those PP native files that are attributable to “key” custodians. A production standard for PP native files should accommodate this scenario.
Thanks Marcus! I agree passwords should be produced if pp files are being produced and that they would not be produced if they can’t be reviewed prior to production. If a judge gets involved in shifting the attempted decryption it would be an exception to the rule and therefore not addressed in the standards. If we add too many exceptions to the standards I’m concerned they will become unwieldy.
As for confidential designations, I believe this can vary based on the jurisdiction/venue and the requirements in a protective order. I’m not sure we can control how these are handled.
Comments on the source and a couple from other comments:
1. I notice that all the image production formats are specified as single-page TIFF. I would argue for at least the inclusion of multi-page TIFF or multi-page PDF (even searchable). I would prefer that the standard try to move away from single-page, which in my view is a paper-centric anachronism–in the e-world, we should have one document, one file. Externalizing the document unitization is an artifact of the first review products, and is no longer necessary even for those products.
2. Can we move away from the “load file” terminology? What I need is a readable data file that contains all data related to a numbered document entity. I should not need nor should I expect the other side to provide me with a format (like .lfp or .opt or .dii) that is tailored for whatever product I might be using. A “model” should be propounding open, generic formats. EDRM XML qualifies, but is not terribly user friendly, and the fact is we have to inspect and hack these things all the time. A text-based delimited counterpart (this is, in fact, what the Concordance .dat format is) would be a good idea. MS Access works and embodies many of the right qualities (plus more!), but is, yes, a proprietary product.
3. I think it might be a good idea to decide whether “same number to each page” (people do that?) or “increment by page” are allowed in the same production. The way the standard reads now, this would be allowable but probably not desirable.
4. From the data modeling perspective, the parentid/attachmentids construct is not optimal. A single field (can be in addition to or instead of these two) that identifies a family/message unit/attachment group is much more manageable and more correct. Lowest DocID in the family=parent.
5. We probably have to define a set of values for RECRDTYPE–otherwise there is no way to standardize/normalize what people might decide to use for “record type” data values.
6. I think it might be necessary to spell out that archive file contents need to be broken out to individually identifiable entites for purposes of production identification. Eml and .msg are essentially archive formats. I might be keeping them this way on my system, but I am not identifying to you the individual documents they contain, which is the standard.
Regarding item 1, the problem is that there are too many litigation software applications that do not support at all (ie, Relativity) or have severe performance limitations with using multi-page TIFs.
Regarding item 2, the main problem with promoting a XML standard format is there are not standard tools to support there creation and validation available like there are for ASCII delimited fielded files.
Regarding item 4, I whole-heartedly agree with David’s commentary on why a single attachment group field is more desirable.
1. I think single page tifs are still used for performance. When doing a document review if you land on a 50+ page tif if takes a while to come up. Single page tifs render more quickly than multi-page.
2. Unfortunately manyof the review tools still require some form of load file so I don’t think we can get rid of the terminology just yet. The EDRM xml is a step in the right direction but many people still have tools or older versions of tools that don’t support xml. I hope someday the xml file will become the standard.
3. If native files are converted to tif for production purposes you can renumber the production set with new sequential numbers for each page. The downside to doing this is the attachment information no longer matches the new docids. To avoid this the strategy may be to use the original docid. If so incrementing each page can be a challenge. You could use the Docid followed by a page number (i.e. ABC00001-001, ABC00001-002, etc.). An alternative would be to number each page with the same docid.
4. I’m not sure on this but would like to hear from others on how this should be structured. Maybe this depends on the review tool being used? Should it be either or?
5. Recordtype values currently include: email; attachment; edoc
6. I’m not sure I understand this one. It sounds like it might be a processing standard? In the production standard e-mail is produced in near-native format.
Seems to me that the Pros and Cons chart is not appropriately named. Does the “x” indicate a Pro, or a Con? Perhaps, rather than “Pros and Cons” it should be simply “Characteristics.” Especially if the “intent is for these standards to be easily communicated.”
Agreed! I will update. Thanks for the suggestion.
For those who are interested, the LiST group in the UK publihed a draft Data Exchange Protocol several years ago that went a little further but is, I suggest, still relevant. Be aware that it is in two parts, one of which is deliberately (and necessarily) technical. I have been lucky to be part of various of LiST’s working groups.
http://www.listgroup.org/publications.htm
Thanks Jonathan! I wasn’t aware of these. I plan on reviewing them in detail and will get back with my comments.
Will there be a standard for confidentiality designations in native productions? It’s common to have protective orders requiring the level of confidentiality be attached to each doc. Sometimes it’s added to the file name, but we see cases where files with different confidentiality levels are produced in different folders or where the producing party sends a list. A standard would be helpful.
Nitpicking the metadata fields:
1. It appears that the default is to have a one field for Bates range and one field for attachment ranges. There are several programs that don’t deal well with having beginning and ending IDs in the same field. We see beginning and ending Bates and attachments in separate fields most of the time.
2. Might help to clarify the descriptions of the the doclink, filename and folder fields. I presume the doclink field is the path to the native file as produced, but is the filename field the produced doc (with the docid name) or the original filename?
3. If the goal is to make a standard that’s easy for attorneys to understand perhaps field names such as RCRDTYPE could be simplified.
Hi Gillian!
Agreed it would be helpful to develop standards in this area. I’ve never put the confidential designation in the title of the document but It sounds like others have. Our current standard would be to convert files requiring confidential stamps to image or produce them on media containing a confidentiality label. One of the biggest reasons attorneys tell me they need stamps (instead of labeling media) is if the document is printed out it won’t contain the confidential designation. I don’t think changing the title will fix this either unless the document contains a header/footer and the doctitle is set to print.
The bottom line is that native/near native productions are not stamp friendly. I would guess there are numerous work-arounds people may use to try and do this without converting to image but I don’t know which method would be right for the standard.
My hope is that someone will develop a way for us to stamp native files in the future. I would think then we would have a standard. I would like to hear others thoughts on this as to whether we can add a standard at this time.
As for the metadata fields.
1. I agree and can add that the standard should be one (Bates Range) or two fields (Begbates, Endbates).
2. We can certainly clean these up. How’s this?:
DOCLINK: Full relative path to the current location of the native document used to link metadata to native or near native file.
FILENAME: Name of original native file.
FOLDER: File path for the original native file as it existed at the time of collection.
3. Agreed. We can certainly make the field names more descriptive.
Thanks so much Gillian! I really appreciate your comments.
Well Done! Some additional topics:
*Standard use of Substitution Pages or Slip Sheets for files that are not converted to TIFF intentionally or as exceptions.
*TIFF representation of container files such as Zips. Some apps attempt to create a TIFF page with a Table of Contents listing as a placeholder. Not having a TIFF representation of a zip may be troublesome especially as an email attachment where parent/child relationships are maintained.
Thanks Brian! I think I need to understand a little bit more about the pages/slip sheets for files not tiffed and exceptions. Is the reason for doing this to avoid gaps in numbering? If so, should this be a standard or an option/consideration?
With the container files, I think one option would be to tif the container files with a table of contents as you suggest or to maintain the name of the zip file in the doclink/filepath field. Does that sound ok? If so, I can add a section in the Image Production section for container files.
What about PDFs which are part of the native production? PDFs are a challenging case because they may contain both image, vector and textual information, layers, multimedia, etc. If PDFs are part of the original production, are they intended to stay in that format? Converting richer PDFs to TIFF will result in data loss.
Hi Rick!
In a native production, PDFs would be produced in .pdf format with all layers, etc.. Any conversion or alteration of a file would not be consistent with a native production. In an image production, pdfs are converted to single page tif. Does this help clarify?
Thanks for your comments! Much appreciated!
I believe Near Native format is meant for productions in which preview files are used rather than files that have attachments(msg,eml) or file containers(zip,rar). Since the client is already receiving the attachments as a separate record many of them prefer to receive an html version of the email rather than an msg which could double or triple the total size of their deliverable.
Hi Fausto!
On the EDRM Production website, near-native is defined as: Files are extracted or converted into another searchable format. It goes on to say:
“Some files, including most e-mail, cannot be reviewed for production and/or produced without some form of conversion. Most e-mail files must be extracted and converted into individual files for document review and production. As a result, the original format is altered and they are no longer in native format. There is no standard format for near-native file productions. Files are typically converted to a structured text format such as .html or xml. These formats do not require special software for viewing. Other common e-mail formats include .msg and .eml.”
I agree that we typically see (and I personally prefer) e-mail converted to a structured text format such as .htm or .mht. I think there is still a debate whether .msg is native or near-native. The question I have is should .msg format be a standard for production or should the standard be limited to structured text files.
In the above comment Mark mentioned that we didn’t define near native in the document. Would that help clarify?
Thanks for you comment Fausto! I really appreciate it!
Excellent work Julie! Just a few comments, and let us know if you could use any help.
1. Consider addressing color in Image production. Files containing color are usually converted to JPG. This, however, can be problematic for some applications where image viewer does not support JPG. In those cases, color TIFFs are used. Of course, color tiffs are very large and undesirable.
2. Near Native is not defined here (although may be somewhere else) I presume you speaking of email being produced in MSG format where a native would be the entire PST. Consider clarification.
3. I wonder if the standard should not go a bit further and address use of native. As you know, even files produced in native with doc id in file name are not useful when printed for use. You show that as a disadvantage. Would it be possible to write a standard process for Native “inspection” and the “Tiff Production” for use in proceeding?
4. This may be wishful thinking, but there are of course new lit sup apps hitting the market every day. Consider writing a single load file format into the standard rather than mention of Concordance, Summation, ect… The EDRM XML, for example. Much time and effort is spent by service providers, software companies and law firms being proficient in multiple load file formats. The receiving party then only need build an XML conversion for their specific application.
Thanks for your comments Mark! I have outlined my thoughts on each item below:
1. I will add color images as an additional consideration to the standards.
2. I will add a near native definition to the standards.
3. I think what you are asking for here are standards for presenting ESI. I agree we need standards in this area and would like to see them developed as “Presentation Standards”. If I got this wrong and you meant something different please let me know.
4. I included EDRM XML format in the description. We could remove the reference to Concordance and Summation but they are still consistently used in the industry. In fact, I am asked to deliver one of these formats on a regular basis and haven’t been asked to deliver an EDRM XML file yet. I do agree that in the future I would hope we would have one standard load file but I don’t think we are there yet. I would love to hear what others think about this issue.
Thanks again for you comments. I will revise the standards over the next couple of weeks to include your suggestions.
Julie,
I concur with all of the comments applauding you for your tremendous work here! You have organized and written about a complex subject matter in a very logical and concise manner, it’s excellent!
Following Mark Walker’s earlier comments and your response below, I too am one of the masses who would love to see a standard litigation database or production load file format be encouraged, mandated and/or enforced, but agree that we’re probably not there yet. That being said, Mark’s idea, Item # 4, is a great one, and along those same lines, perhaps we could consider outlining the requirements for what I would consider to be three main production load file format categories: (1) flat file database load file formats (i.e. Concordance, Summation, IPRO, etc.); (2) relational database load file formats (i.e. Ringtail, Relativity, etc.) and (3) web-based or web-compatible production load file formats, such as the EDRM .xml or similar web supported file formats, since each of the three contain significant structural differences from one another.
Again, you are doing a fantastic job with this much needed initiative, so thank you again.
Steph