EDRM Evergreen/Processing/Output Preparation

From Working EDRM

Jump to: navigation, search
Comments: Please submit comments to the EDRM Evergreen Processing forum

Categories

Once the target dataset is selected, electronic data can be transformed into a variety of formats for purposes of review and subsequent production, or for transfer to another service provider in the review cycle. The formats may include various “flavors” of native documents; renderings of native docs in standard formats such as html, xhtml or xml; or converted image formats.

Contents

Native Documents

We mention “flavors” of native documents because a native document can in fact come in a variety of formats. For example, typical e-mail can be encountered in an MSG format, which is a single e-mail container with its attachment(s) contained therein; or they can be found in PST or NSF format, which are collections of e-mails and their associated attachments.

During output preparation decisions need to be made regarding the format of exported native documents. If email is exported in a container format it will likely be necessary to provide cross-reference files with item level metadata and tracking information related to the items within the containers (see metadata formatting below). A single PST file may be exported, but it may represent thousands of exported items. It may also be important to provide detailed information regarding parent-child relationships of e-mail and attachments in the e-mail population. One to one metadata and tracking information is more straightforward for loose office files (as opposed to e-mail attachments) like Word documents, Excel Spreadsheets, Powerpoints, etc.

“Rendered” E-mail

NOTE: Some service providers will render e-mails in html or xml format which can give the appearance of a native view of an e-mail, but it is arguably more akin to an imaged format of an office document in that it is typically an artifact of litigation processing. However, unlike an imaged version of a native document an html rendering of an e-mail is editable.

If e-mail is exported in a “rendered” format (html, xml) then cross-referencing of metadata and tracking information is more straightforward than exported container format, however, it is also important to cross-reference parent-child information in this type of e-mail export (often referred to as “unpacked” export.)

Image Formats

Often electronic documents are processed to an image format such as TIFF (Tagged Image File Format, usually represented by a file extension of .tif) or PDF (Portable Document Format, usually represented by a file extension of .pdf). These image formats have been fairly common because they can be used as both a review format and a production format that cannot be altered (pdf files may be searchable, depending on the form used). In addition, imaged versions of native documents can be redacted for attorney-client privilege, privacy issues or non-relevance.

To accommodate an efficient review, these image formats are often accompanied by associated files or database records containing the text of the document. When used in a document review system, a link can tie the document’s image files, text, metadata information and native document to aid in an efficient review.

[COMMENT by Gene Eames] – perhaps we move the discussion (set apart below by begin
and end comments) of native review and production, and discussion of “native application”
vs “native viewer” to the review node?]

An alternative to converting electronic documents to images at the outset of a review project is to perform an initial review of documents in their native format. It has been estimated that 80% or more of the reviewed material is deemed irrelevant to the legal matter, resulting in wasted conversion fees. If a converted format is preferred for production, this approach enables the review team to only convert what is relevant, privileged or otherwise produced. (See Review Node.) To accommodate native file review, many service and software processing providers have developed technologies to provide reviewers the ability to review native files after the metadata has been preserved and linked to the document. Some allow native files to be opened and viewed in their native application, while others allow documents to be viewed by using viewer technology. The technology that is chosen must be determined by the requirements of the case and the processing constraints (scope and schedule) of the case.

In addition to native file review, there has been a recent interest in native file productions. Some regulatory bodies have been requiring productions to be made in native format. Under the new FRCP, production formats will be determined during the meet and confer session. Determining the production format up front may influence the review format needed. Additionally, the new requirement that files be made available consistent with the manner that they were maintained further can be interpreted to support a native file production, or at least require a mechanism that allows a party to trace back to the native document as it was originally maintained. Review and production formats will be decided by an analysis of the pros and cons of different methods in response to the needs of the specific matter. (See Review and Production Nodes.)

When converting native documents to image formats it is important to understand the details. The goal of creating an image of an electronic document is to render the document in a non-modifiable form that allows all document contents to be reviewed. In processing documents to an image format, some software and service providers use viewer technology to determine the rendering of the file. Viewer technology allows a variety of application files to be viewed without using the native applications. This can be useful in avoiding significant application license fees and increasing the speed with which the contents of a file can be viewed. These efficiencies are gained at the expense of completeness. No viewer renders all of the underlying application data. Other software and service providers use the native applications to render the information contained within the file. However, even when using native applications to render images care must be taken to properly configure the rendering of the image to ensure a complete rendering of its content. User created information can be nested within native files in ways that are not immediately apparent to the reader.

For example, in Word documents, comments can be stored in a document, but the print mechanism can be set to include comments or not. Similarly, comments in Excel spreadsheets may not easily be seen without specifically configuring the print mechanism to include those items. Also, in spreadsheets, entire pages of a worksheet may be hidden or protected. It is crucial to unhide and unprotect this information to reveal all the contents within the file for review purposes. Frequently, users protect files or components of files (e.g., sheets or cells in a spreadsheet). It is important to unprotect such files by cracking passwords. This process must occur prior to the application of any culling strategies, including any textual search, if the responsive dataset is to be complete. Files that are protected and are not successfully cracked should be segregated and reported.

Once the image format has been created, the images can be delivered along with the text that has been extracted for each file and its metadata information. This information is usually provided with load files to allow the data to be loaded into a document review system for review. Each image in the collection must have a unique identifier, typically a Bates Number or other document ID. This information can also be packaged for production by a processing software or service provider. Production sets can include images with Bates Numbers for tracking purposes, various endorsements based on the specific case matter, or native files with their associated metadata and extracted text preserved.

Metadata / Extracted Text

Whether the output format of the original data is native or image there is usually a need to also export extracted metadata and extracted text. Metadata is usually extracted and recorded from the native files and e-mail during the cataloging and extraction process. Extracted text may be extracted during that process, or can be done during the output preparation stage. Extracted text may be important in a number of scenarios. For example, in a native file output with associated metadata the review application may require that extracted text accompany the metadata to enable full text searching of the native documents in the review application. For image formatted output is is very likely that extracted text will be required for the downstream review application,

Metadata

Typical metadata accompanying output documents or images includes information about the original native documents (modified date, original file and path names, file size, e-mail participants; e-mail subjects; etc.). Metadata may also include process tracking information to ensure the ability to trace back to original files and to record the process history of documents.

Metadata may also include information related to parent-child relationships amongst review items.

Extracted Text

Textual information extracted from processed documents can be of various types, including unformatted text extracted directly from native documents; formatted text extracted directly from native documents; or text derived from print-image renderings of the native documents.

Note: When extracting text from native documents or images it is important to consider the configuration of the extraction process, and whether it has accounted for hidden or protected parts of the original documents.

Note: If any redaction of images takes place after the creation of metadata records and extracted text records it may be necessary to also redact the pertinent portions of the metadata and text, or it may be prudent to re-extract text from the redacted image version of the document prior to production.

Format of Metadata and Extracted Text Output

Once the metadata and text is captured and accumulated for output the information needs to be formatted in a specific output format. Those formats may include textual CSV files, XML format; or proprietary database load formats.

[Updated Jan. 9, 2008]

Personal tools
additional information