EDRM Processing Standards Guide, version 0.1

Last updated March 10, 2015

DRAFT – This is a draft document. Please comment.

EDRM’s Processing Standards Guide continues to be open for public comment.

Comments can be added at the bottom of this page, sent to us via email at mail@edrm.net, or sent to us by filling out our comment form.

Introduction

This guide addresses considerations and concerns that arise when one processes data from an electronic storage device into an e-discovery database. This guide is a resource for anyone who would like to use the processing stage of e-discovery to streamline review and improve analysis of information in the database. From it you should learn more about the process of processing data for e-discovery and better be able to ask questions whose answers will help you improve your approach to this challenging aspect of e-discovery.

There are various tools available to process data for e-discovery, such as Ipro’s eCapture and eScan-IT, LexisNexis’s LAW PreDiscovery and Early Data Analyzer, kCura’s Relativity Processing, and Nuix’s eDiscovery Workstation and Enterprise Collection Center, to name a few. While this guide is meant to be software-agnostic, it does draw heavily on examples from kCura’a system.

Contributors

Virus Protection

When processing data it is important to remember you are opening your system. This creates a level of vulnerability to virus files. There are virus programs that monitor new files but they can conflict with processing tools. There are various methods to keep files virus-free but still allow processing to occur.

One approach is to keep your systems separate from each other. Process data on one system, then work with the processed data on a second, separate system.

Because anti-virus programs generally alter data as they remove or disable viruses, using anti-virus software during processing can compromise the integrity of the data being processed. Therefore a second approach is to avoid using virus protection before or during processing. Instead, use virus protection only when moving files from processing servers to other locations. If you are unable to turn off anti-virus programs make sure to perform backups before copying or moving data.

Container Files

A container file is a single file that contains one or more other files. Container files are used for various purposes including as a way to transport multiple files in one file package. Often it is a simpler to use a container file than to attach many files to an email message or to transport multiple files. Examples of container files include ZIP, RAR and PST files.

When considering how to process container files, it is useful to keep the following considerations in mind:

  • Numbering – While container files may be important for day-to-day processes, they might not be used when producing documents and therefore they do not need to be included as part of a family group.
  • Placeholders – Container files such as Zip or RAR files are a way to package other groups of files. Often it is not necessary to know the files were grouped together and there is not any data associated with a container file just the files within it. Therefore often there is not a reference to a container file in the database but just all the files grouped in the database. If you want a reference to the container file to note the grouping or original storage format of other files a placeholder can be created.
  • Size of data – Costs associated with processing data can quickly become substantial; generally this happens when data sizes increase after container files are uncompressed during processing. Therefore it is important to know whether processing fees are calculated based on the sizes of files before they are uncompressed or after.

Metadata Fields

Microsoft Office files have hundreds of associated metadata fields. Different file types contain different fields of metadata. Determining which fields to process can be a daunting task. There is a list of suggested fields later in this document. It is common to see standard fields processed but there are many others that might be helpful. Only by actually processing files can the range of available metadata fields be determined, suggesting that some form of sampling be used to identify which metadata fields ought to be processed for a particular project or matter. To the extent a meet-and-confer process is provided for, as part of that process it can be beneficial to discuss what metadata will be exchanged.

Numbering or Item Identification

During processing, individual files are extracted from container files and attachments and embedded objects are extracted from individual files. Each extracted item is saved as a separate file.

Each extracted file is assigned a unique numeric or alpha-numeric identifier (referred to as DocId). Document identifiers typically consist of an alphabetical prefix followed by a unique number. The prefix might be the client name, custodian name or initials, but other protocols are used as well.

Although attachments and embedded objects are extracted as separate files, those separate files link back to their parent files via the parents’ DocIds.

Filenames are copied to a filename field and native files are renamed with unique DocId.
When tiff images are generated from native or near-native files during processing, information needs to be generated and saved that can be used to maintain the relationships between the tiff images and the native or near-native files.

Deduplication

It is common to have multiple copies of a file in a dataset. In order to minimize duplicative work, secondary copies can be removed so that only one copy is present. There are different features about a file that are utilized to make a unique identifier that represents the finger print of the file. Any variation of the files down to one character can differentiate it to a new identifier. This identification is the file hash. For further information on file hashes, see the sections on Hash Values and File Hash Analysis Basics, below.

The NIST List

Many system files are standard files installed by programs such as Microsoft Windows or Office. Because of this, these files contain known hash values and can be easily identified and removed from any given data set based on these known values.

The National Institute of Standards and Technology (NIST) (www.nist.gov) compiles a list of hash values for files of these types. NIST does this with a sub-project called the NSRL or National Software Reference Library (http://www.nsrl.nist.gov), generally referred to by e-discovery practitioners as “the NIST List”:

The National Software Reference Library (NSRL) is designed to collect software from various sources and incorporate file profiles computed from this software into a Reference Data Set (RDS) of information. The RDS can be used by law enforcement, government, and industry organizations to review files on a computer by matching file profiles in the RDS. This will help alleviate much of the effort involved in determining which files are important as evidence on computers or file systems that have been seized as part of criminal investigations.

The RDS is a collection of digital signatures of known, traceable software applications. 1

This list can be used to remove common system files presumed to be irrelevant to almost all litigation matters.

DeNISTing decreases the amount of time to process ESI and removes irrelevant files from the processing set of useable documents and can dramatically reduce the size of any given collection.

How The NIST List Is Used in E-Discovery Processing

Systems that process data for e-discovery use the NIST List’s digital signatures and known hash values to identify system and application files and segregate them from user-generated files.

The NIST List, in conjunction with the file signatures, is typically used with eDiscovery program databases to compare file signatures of collected data for discovery purposes. Any file matching one in the NIST List is “de-NISTed” – that is, excluded and not processed or analyzed any further.

Difficulties

Though using the NIST List will remove many non-useable files, it is not comprehensive. Due to the high number of new and old applications in existence as well as ever-growing number of new ones being developed, many program files are not yet included the NIST List. This can mean that a de-NISTed data set still will contain non-usable files.

The number of overlooked files also is a function of each individual system as individual systems can be configured with many different versions of operating systems and program files.

Hash Values

A computer file’s digital signature can be viewed as a digital fingerprint, also known as a hash value. Theoretically and for practical purposes, every file has a unique hash value. If two files have the same hash value, they are considered to be duplicates.

In today’s digital world, most software applications contain hundreds, often thousands, of files known as system files. These are common files and are easily identified by their consistent and known hash values. Typical examples of system files are:

  • Dynamic Linked Library; Microsoft shared library – .dll
  • Executable files – .exe
  • Command files containing commands to be issued to the operating system – .com

By contrast, common examples of user-generated files – files often of potential interest – include:

  • Microsoft Word documents – .doc, .docx
  • Microsoft Excel spreadsheets – .xls, .xlsx
  • Microsoft PowerPoint presentations – .ppt, .pptx
  • Microsoft Outlook e-mail files – .pst
  • Lotus Notes e-mail files – .nsf

File Hash Analysis Basics

MD5 HASH

MD5 (message-digest algorithm 5), is a widely used cryptographic hash function producing a 128-bit (16-byte) hash value, typically expressed as a 32 digit hexadecimal number. It is calculated based on an algorithm developed by Rivest, Shamir, and Adleman (RSA) in 1991. It is often called an electronic fingerprint because it uniquely identifies any stream of data or file. The odds of any two files having the same MD5 are 1 in 2218, which is, more graphically, 1 in 340,282,366,920,938,000,000,000,000,000,000,000,000. Needless to say, when two files have matching MD5 values, there is an extremely high confidence factor in stating the contents of the two files are identical.

The idea behind this algorithm is to take up a random data (text or binary 2) as an input and generate a fixed size “hash value” as the output. The input data can be of any size or length, but the output “hash value” size is always fixed.

A MD5 hash is nothing but a 32 digit hexadecimal number which can be something as follows

A Sample MD5 Hash:

e4d909c290d0fb1ca068ffaddf22cbd0

This hash is unique for every file irrespective of its size and type. That means two .exe files with the same size will not have the same MD5 hash even though they are of same type and size. So MD5 hash can be used to uniquely identify a file.

Characteristics of a hash value are:

  • It is deterministic; the hash value which is generated for a given message remains the same no matter how many times it is calculated
  • It returns a bit string of specific size (the hash value)
  • It is easy to compute the hash value for any given message
  • It is not feasible to generate a message that has a given hash value
  • It is not feasible to change a message without changing the hash value
  • It is not feasible to find two different messages with the same hash value

What Affects the MD5 Hash?

If a single character were to change and the data were fed back through the MD5 hash algorithm, the resulting hash value would change as well. This could be any change in the characteristics of the document.

Virtually any non-malicious change to a file will cause its MD5 hash value to change; therefore the MD5 hash is used to verify the integrity of files. Typically, MD5 is used to verify a file has not been changed as a result of a faulty file transfer, a disk error or any type of change to the file. The following example below illustrates this point.

File Name: File Hash Analysis.docx
File Path: C:\Users\rrostas\Documents\Documentation\File Hash Analysis.docx
Created Date: 12/3/2013 8:41:55 AM
Last Accessed: 12/3/2013 8:41:55 AM
Last Modified: 12/3/2013 8:41:55 AM
File Size: 19932
CRC32 Digest: 3B6B26C2
MD5 Hash: E6A0A941658254D152AE405BAEA9EA1C

This above hash value is the document in the draft phases as it is saved based on the last modified date.

File Name: File Hash Analysis.docx
File Path: C:\Users\rrostas\Documents\Documentation\File Hash Analysis.docx
Created Date: 12/3/2013 8:41:55 AM
Last Accessed: 12/3/2013 8:48:43 AM
Last Modified: 12/3/2013 8:48:43 AM
File Size: 19957
CRC32 Digest: 0C348099
MD5 Hash: 7248BEB15FA633E1A8524DA062D45F73

External/Internal Metadata

There is a great misconception that moving a file by using drag and drop or cut and paste will change the file’s hash value. The reason the hash value does not change is because the actual file has not be internally manipulated. By this we mean the file was not open or internally accessed and manipulated. The information that changes by cut and paste or drag and drop is the system metadata. System metadata fields include, date and time accessed, date and time modified, date and time created. These can change; however, they do not necessarily affect the MD5 hash.

We have to be careful of other metadata fields that are internal to certain file types. An example can be any MS Word document. Internal metadata fields can change the MD5 hash. These internal metadata fields are stored internally within the Microsoft Word document. These fields include “Modified” and “Accessed” dates. Typically they should duplicate the metadata maintained by the operating system. “Creation” date can be and often is different, because a Word document will keep its internal creation date even when the file is copied to a new name. Other internal fields include revisions, versioning, template utilization, “printed” “Last saved by” “Revision number” and “Total editing time”. These are just a sample; however they are listed here to show the numerous internal fields that can change and affect the MD5 hash value of the file. The hash value will not change just from opening the file. An edit of some sort must be made to change the hash value.

Hash Value Creation

Loose files

Relativity calculates the SHA256 hash in a standard way—all the bits and bytes that make the content of the file are involved in hash calculation. Metadata is excluded from the hash value for loose files. Relativity then compares this hash to other loose files to identify duplicates.

The following is the standard method for computing a checksum for large and small files:

  1. Open the file.
  2. Read 8k blocks from the file.
  3. Pass each block into an MD5/SHA1/SHA256 collator, which uses the corresponding standard algorithm to accumulate the values until the final block of the file is read. The final checksum is derived.

Emails

The Processing engine generates four different SHA256 hashes:

  • Body hash – takes the text of the body of the e-mail and generates a hash
  • Header hash – takes the message time, subject, author’s name and e-mail, and generates a hash
  • Recipient hash – takes the recipient’s name and emails and generates a hash
  • Attachment hash – takes each SHA256 hash of each attachment and hashes the SHA256 hashes together

The following is the process for computing Email HeaderHash:

  • A Unicode string containing <crlf>SenderName<crlf>SenderEMail<crlf>ClientSubmitTime is constructed
  • A SHA256 hash is derived from the above
  • ClientSubmitTime is formatted with: m/d/yyyy hh:mm:ss AM/PM

The following is a constructed string: RE: Your last email Robert Simpsonrobert@kcura.com10/4/2010 05:42:01 PM

The following is the process for computing Email RecipientHash:

  • A Unicode string is constructed by looping through each recipient in the email and inserting each recipient into the string
  • Once the loop completes, the SHA256 hash is computed from the string RecipientName<space>RecipientEMail<crlf>

The following is an example of a constructed recipient string of two recipients: Russell Scarcella rscarcella@kcura.comKristen Vercellino kvercellino@kcura.com

The following is the process for computing Email MessageBodyHash:

  • If the PR_BODY tag is present in the MSG, capture it into a Unicode string
  • If the PR_BODY tag is not present, get the native body from the PR_RTF_COMPRESSED tag and either convert the HTML or the RTF to Unicode text
  • Construct a SHA256 hash from the above string

The following is the process for computing Email AttachmentHash:

  • Compute the loose file standard SHA256 hash from each attachment
  • Encode the hash in a Unicode string as a string of hexadecimal numbers without <crlf> separators
  • Construct a SHA256 hash from the composed string

The following is an example of constructed string of two attachments: 80D03318867DB05E40E20CE10B7C8F511B1D0B9F336EF2C787CC3D51B9E26BC9974C9D2C0EEC0F515C770B8282C87C1E8F957FAF34654504520A7ADC2E0E23EA

In all email scenarios, the following is the process for deriving a SHA256 from a Unicode string:

  • The string is converted to a byte array of UTF8 values
  • The resulting array of bytes is fed to a standard SHA256 subroutine which computes the SHA256 hash of the UTF8 byte array

Global

Global deduplication involves comparing hash values for incoming documents against all other documents present in a database. Different software use different fields for hashing files, therefore results can vary depending on the tool used and the settings selected.

Advantages to this are not having duplicates across the database.

Custodial

Custodial deduplication differs in that it involves comparing hash values for incoming documents against documents in the database for the selected custodian. Different software use different fields for hashing files, therefore results can vary depending on the tool used and the settings selected.

This method will leave copies in each custodian but not leave multiple copies in each custodian.

Time Zone Considerations

One of the fundamental characteristics of Electronically Stored Information (ESI) is time zone. Most electronic data 3 is stored in UTC (Coordinated Universal Time 4). The user’s operating system uses regional settings on the user’s system to convert the UTC time to the user’s local time zone. In order to avoid discrepancies caused by custodians who travel between multiple time zones, or projects with custodians in multiple time zones, normalization 5 is needed.

Consider for a moment what would happen if we were to process data under different time zones.

Two employees (Custodian A and Custodian B) are key subjects in a lawsuit. Custodian A resides in New York, and Custodian B resides in Los Angeles. Their laptops are forensically imaged 6, and their data is processed for Relativity hosting. The Houston, TX based attorney instructs the processing team to handle the data in the “time zone for the custodian”. Without normalization, this instruction will cause huge issues for determining timelines of communications for emails sent to and some from the custodian which in turn may affect the review and production of the processed data.

During deduplication, date and time metadata are key fields. In the above example, if Custodian A and Custodian B have copies of the same email sent from a third party on Sunday, March 9, 2014 at 7am UTC, both copies would not deduplicate. The metadata for Custodian A’s email has been extracted and using Eastern time zone settings (3am EDT or UTC -4). The metadata for Custodians B’s email was offset to Pacific time zone settings (Saturday, March 8th at 11pm PST or UTC -8). If date and time metadata is used to identify duplicates, both copies would be seen as unique. The result is that there will be two copies of the same email showing different received date/time. To add additional complication, the email was sent at the time of year when a change to Daylights Savings Time occurs in most of the US. If the daylight savings offset was not included in the offset calculation an additional mismatch would occur. Because not all states/countries recognize daylight savings time, another layer of complexity exists.

Second, consider the impact on the review. Not only do you have multiple instances of the same document that survived deduplication, but without a standard or normalized date/time field the review team cannot run searches or sort documents for the purposes of creating a chronology. Finally, perhaps the most critically important, but often overlooked consideration, is uniformity between parties. If, for example, one side decided to process and subsequently produce data in PST (Pacific Standard Time), and the other side decided to process and data in EST (Eastern Standard Time), it would result in the production sets having a three-hour discrepancy, leading to confusion and possible discovery protocol disputes.

Having a normalized and standard time zone for all data processed is a critical aspect of data processing, but that is not to suggest that other time zones cannot also be displayed in the review environment or that another time zone cannot be used as the base time zone for processing. Both of these options can and should be explored depending on the matter; however, there are a few things to consider in both situations.

  • If additional time zone offsets are displayed during the review, it is important that the review team understand which time zone is/will be displayed on any images for production. It is important that a single time zone is selected so that a chronology can be created across custodians/time zones easily.
  • In some cases, both sides agree to process and produce data in a time zone other than UTC, and that is perfectly acceptable. Remember, date/time information is stored in UTC, it is simply the workstation settings that offset the files to a particular time zone. Let us consider a case where the subject company and all of the employees are based in New York, NY, counsel is in New York, NY and the case is filed in NY court – does it make sense to process data using UTC? Probably not. It may make more sense for this case to process everything using Eastern Time.

In both instances, transparency and normalization is key. Some processing applications allow the technician to include the time zone in the date/time displayed on any images. While this setting is not always available, this simple inclusion can address a lot of questions when reviewing data. In addition, some processing application will provide the time zone (EDT) as a field value which can be used by the review team to determine chronology for emails and can be requested as a field to include in production deliveries.

Reporting

Document sets often contain many different types of files. Not all of them can be reviewed, nor do they need to be. Creating reports based on the processed data can help streamline review and increase productivity by identifying files that do not need to be included in the review set. In many cases, reports can allow the first pass review to eliminate many files missed by deNIST and deduplification.

Reports can be used as part of the culling process. For example, summarizing date ranges or tallying custodians can yield information that can help identify missing data, or data that does not need to be passed on to reviewers. Reports can also provide insight into the number of documents that are responsive to a certain search term, making them an important step in creating and revising search term lists.

Another way reports can be implemented is to summarize data to send back to the client for input. An example would be to provide the client with a list of the files and file types that need proprietary software for review. Often the client can provide different versions of those files which can easily be reviewed in standard programs, such as Adobe.

Passwords

When processing files often times you encounter password protected items. The ideal situation is to receive unprotected files. However if you do encounter them during the processing stage many types of software offer the option to enter a password and retry the file. This can be done by providing a list of items to the client after the tool has determined the password protected files.

Other options include password cracking software. Depending on the native file type and the encryption method some off the shelf products exist for password cracking. If files are still unable to be cracked a list should be provided to indicate protected files. This can be reviewed with file locations to determine the necessity of taking it further and having an expert work to open the file.

Extraction of Embedded Images

Emails often contain images that are incorporated into the signature line. This is often a corporate logo or sometimes a design element around a signature. Processing software can mistake these items for something embedded that needs to be a separate document. Some processing software can detect this and only extract images that are true images. Pictures attached to email photographed or created separate of the email can be extracted to be their own files. The key is not removing items added as part of a signature block or stationary. This can create separate documents and pollute the database with extra documents that should be not be separated.

Processing and Problem Files

What are Exceptions?

e-Discovery exceptions are documents that cannot be correctly processed by the processing platform.

Types of Exceptions

Corrupt Files

Corrupt files are files that have structural problems which prevent them from being opened or manipulated in even their native application. File corruption can be caused by numerous factors such as network transmission errors, errors in the medium where files were stored (e.g. bad sectors on a hard drive) or unexpected termination of the software that was being used to edit the file (e.g. a power failure).

When handling corrupt file exceptions, the first course of action usually is to investigate the possibility of obtaining a replacement. If a replacement copy is not available, depending on the nature of the case and how critical the corrupt file is, attempting to repair the file may be a viable option (e.g. recovering a corrupt mailbox). Alternatively, the corrupt file can be excluded from processing and delivered in native format. In any case, the exception should be logged and all steps taken should be thoroughly documented.

Unprocessable Files

Unprocessable files are files that do not support the common e-Discovery actions such as text and metadata extraction. For example, system files such as executables and dynamic link libraries are typically unprocessable file types.

Encrypted Files

Encrypted files are files that were protected by a password, via digital rights management (DRM) or other encryption schemes. Encrypted files can be single documents such as Ms Office files or PDFs, or encrypted containers such as TrueCrypt volumes.

Attorneys may occasionally be able to obtain the passwords for the encrypted files in the data set. If passwords are not available, they can often be discovered by strategically reviewing neighbor documents or by attempting to crack the passwords.

Exception Handling: How Should Exceptions be Tracked, Handled and Reported?

The processing software should provide the following mechanisms for exception tracking, handling and reporting:

  • All encountered exceptions should be logged. The log files should contain detailed information about the exceptions such as the full file path, file name, hash value and a description of the exception. These logs should be sent to the sponsoring attorney to decide whether or not pursue replacements, passwords or static images from the software where the file originated.

Extracted Text (part of Fields from Processing)

Files should be processed to include extracted text. For any files where the processing system is unable to extract text, i.e., non-searchable pdf files, the files should be imaged and then processed through an OCR generator. Any images created for generating OCR should be loaded to and maintained in the review platform.

Note: See Item Numbering Identification for more information on numbering images created for OCR generation.

Fields from Processing

There are hundreds of available fields on a simple Office Document. The important fields are those which can be used for searching, sorting and production purposes. This list is the key fields which are useful and originates from sources such as government production requirements.

Key Fields necessary to process data

Processing Field Name Field Type Description
Container Extension Fixed-Length Text Document extension of the container file in which the document originated.
Container ID Fixed-Length Text Unique identifier of the container file in which the document originated. This is used to identify or group files that came from the same container.
Container Name Fixed-Length Text Name of the container file in which the document originated.
Custodian Single Object Custodian associated with (or assigned to) the processing set during processing.
Common Custodians Choice The list of all custodians who have this email.
Extracted Text Long Text Complete text extracted from content of electronic files or OCR data field. This field holds the hidden comments of MS Office files.
Last Published On Date Date on which the document was last updated via re-publish.
Level Whole Number Numeric value indicating how deeply nested the document is within the family. The higher the number, the deeper the document is nested.
Originating Processing Set Single Object The processing set in which the document was processed.
Originating Processing Data Source Single Object A single object field that refers to the processing data source.
Processing Duplicate Hash Fixed-Length Text Identifying value of an electronic record that is used for de-duplication during processing.
Processing File Id Fixed-Length Text Unique identifier of the document in the processing engine database.
Processing Errors Multiple Object Any associated errors that occurred on the document during processing. This field is a link to the associated Processing Errors record.
Relativity Native Time Zone Offset Whole Number The hour offset based on the Time Zone ID. Numeric field that controls how header dates and times appear for email messages in the viewer or on redacted or highlighted images. This does not modify actual metadata associated with the displayed values.
Relativity Native Type Fixed Length Text The type of native file loaded into the system.
Supported By Viewer Boolean Yes/No field that indicates whether the native document is supported by the viewer.
Time Zone Field Single Object Indicates which time zone is used to display dates and times on a document image.
Virtual Path Long Text Folder structure and path to file from the original location identified during processing.

Optional Fields

Processing Field Name Field Type Description
Attachment Document IDs Long Text Attachment document IDs of all child items in family group, delimited by semicolon, only present on parent items.
Attachment List Long Text Attachment file names of all child items in a family group, delimited by semicolon, only present on parent items.
Author Fixed-Length Text Original composer of document or sender of email message.
BCC Address Long Text The full SMTP value for the email address entered as a recipient of the Blind Carbon Copy of an email message.
CC Address Long Text The full SMTP value for the email address entered as a recipient of the Carbon Copy of an email message.
Child MD5 Hash Value Long Text Attachment MD5 hash value of all child items in a family group, only present on parent items. Note: This value is not populated if your processing server is FIPS compliant.
Child SHA1 Hash Value Long Text Attachment SHA1 hash value of all child items in a family group, only present on parent items.
Child SHA256 Hash Value Long Text Attachment SHA256 hash value of all child items in a family group, only present on parent items.
Comments Long Text Comments extracted from the metadata of the native file.
Company Fixed-Length Text The internal value entered for the company associated with a Microsoft Office document.
Contains Embedded Files Yes/No The yes/no indicator of whether a file such as a Microsoft Word document has additional files embedded in it.
Control Number Beg Attach Fixed-Length Text The identifier of the first page of the first document in a family group. This is used for page-level numbering schemes.
Control Number End Fixed-Length Text The unique identifier of the last page of a document. This is used for page-level numbering schemes.
Control Number End Attach Fixed-Length Text The identifier of the last page of the first document in a family group. This is used for page-level numbering schemes.
Conversation Long Text Normalized subject of email messages. This is the subject line of the email after removing the RE and FW that are added by the system when emails are forwarded or replied to.
Conversation Family Fixed-Length Text Relational field for conversation threads. This is a 44-character string of numbers and letters that is created in the initial email.
Conversation Index Long Text Email thread created by the email system. This is a 44-character string of numbers and letters that is created in the initial email and has 10 characters added for each reply or forward of an email.
Date Created Date Date and time from the Date Created property extracted from the original file or email message.
Date Last Modified Date Date and time from the Modified property of a document, representing the date and time that changes to the document were last saved.
Date Last Printed Date Date and time that the document was last printed.
Date Received Date Date and time that the email message was received (according to original time zones). This applies to emails only; this field is not populated for loose files.
Date Sent Date Date and time that the email message was sent (according to original time zones). This applies to emails only; this field is not populated for loose files.
Delivery Receipt Yes/No Indicates whether a delivery receipt was requested for an email.
Document Class Single Choice This field can be one of Email, Edoc, or Attach.
Document Extension Fixed-Length Text Character extension of the document that represents the file type to the Windows Operating System. Examples are PDF, DOC, or DOCX.
Document Subject Long Text Subject of the document extracted from the properties of the native file.
Domains (Email BCC) Multiple Object Domains of ‘Blind Carbon Copy’ recipients of the email message. See the Note below.
Domains (Email CC) Multiple Object Domains of ‘Carbon Copy’ recipients of the email message. See the Note below.
Domains (Email From) Multiple Object Domains of Originator of the email message. See the Note below.
Domains (Email To) Multiple Object Domains of ‘To’ recipients of the email message. See the Note below.
Email BCC Long Text Recipients of ‘Blind Carbon Copies’ of the email message.
Email Categories Long Text Category/categories assigned to an email message.
Email CC Long Text Recipients of ‘Carbon Copies’ of the email message.
Email From Fixed-Length Text Originator of the email message.
Email In Reply To ID Long Text The internal metadata value within an email for the reply-to ID.
Email Store Name Fixed-Length Text The identifier of the top-level container of an email message. For example, “jdoe.nsf.” If a document comes from a rar/zip file attached to the email, the container is referred to in that file.
Email Subject Long Text Subject of the email message.
Email To Long Text List of recipients or addressees of the email message.
File Name Fixed-Length Text The original name of the file.
File Size Decimal Generally a decimal number indicating the size in bytes of a file.
File Type Fixed-Length Text Description that represents the file type to the Windows Operating System. Examples are Adobe Portable Document Format, Microsoft Word 97 – 2003 Document, or Microsoft Office Word Open XML Format.
From Address Long Text The full SMTP value for the sender of an email message.
Group Identifier Fixed-Length Text Group the file belongs to (used to identify the group if attachment fields are not used).
Has Hidden Data Yes/No Indication of the existence of hidden document data such as hidden text in a Word document, hidden columns, rows, or worksheets in Excel, or slide notes in PowerPoint.
Importance Single Choice Notation created for email messages to note a higher level of importance than other email messages added by the email originator.
Keywords Long Text The internal value entered for keywords associated with a Microsoft Office document.
Last Accessed Date/Time Date The date and time at which the loose file was last accessed.
Last Saved By Fixed-Length Text The internal value indicating the last user to save a document.
Last Saved Date/Time Date/Time The internal value entered for the date and time at which a document was last saved.
Lotus Notes Other Folders Long Text A semi-colon delimited listing of all non-primary folders that a Lotus Notes message or document was included.
MD5 Hash Fixed-Length Text Identifying value of an electronic record that can be used for de-duplication and authentication generated using the MD5 hash algorithm.
Meeting End Date/Time Date/Time The date and time at which a meeting item in Outlook or Lotus Notes ended.
Meeting Start Date/Time Date/Time The date and time at which a meeting item in Outlook or Lotus Notes began.
Message Header Long Text The full string of values contained in an email message header.
Message ID Fixed-Length Text The message number created by an email application and extracted from the email’s metadata.
Message Type Single Choice Indicates the email system message type. Possible values include Appointment, Contact, Distribution List, Delivery Report, Message, or Task. The value may be appended with ‘(Encrypted)’ or ‘Digitally Signed’ where appropriate.
Native File Fixed-Length Text The path to a copy of a file for loading into Relativity.
Number of Attachments Whole Number Number of files attached to a parent document.
OCR Text Yes/No The yes/no indicator of whether the extracted text field contains OCR text.
Office Document Manager Fixed-Length Text The internal value entered for the manager of a document.
Office Revision Number Fixed-Length Text The internal value for the revision number within a Microsoft Office document.
Other Props Long Text Metadata extracted during processing for additional fields beyond the list ofprocessing fields available for mapping. This includes TrackChanges, HiddenText, HasOCR, and dates of calendar items. Field names and their corresponding values are delimited by a semicolon.
Parent Document ID Fixed-Length Text Document ID of the parent document. This field is only available on child items.
Password Protected Single Choice Indicates the documents that were password protected. It contains the value ‘Decrypted’ if the password was identified, ‘Encrypted” if the password was not identified, or no value if the file was not password protected.
Primary Date Date Date taken from Email Sent Date, Email Received Date, or Last Modified Date in the order of precedence.
Read Receipt Yes/No Indicates whether a read receipt was requested for an email.
Sensitivity Single Choice The indicator set on an email to denote the email’s level of privacy.
SHA1 Hash Fixed-Length Text Identifying value of an electronic record that can be used for de-duplication and authentication generated using the SHA1 hash algorithm.
SHA256 Hash Fixed-Length Text Identifying value of an electronic record that can be used for de-duplication and authentication generated using the SHA256 hash algorithm.
Sort Date Date Date taken from the Date Sent field on email messages repeated for the parent document and all child items to allow for date sorting. For loose files (non-emails) that do not contain a specific Date Sent property, Relativity populates this field with the value that appears in that file’s Modified property.
Speaker Notes Yes/No The yes/no indicator of whether a PowerPoint file has speaker notes associated with its slides.
To Address Long Text The full SMTP value for the recipient of an email message, for example, “bob@example.com”
Track Changes Yes/No The yes/no indicator of whether tracked changes exist in the document.
Unified Title Long Text Subject of the document. If the document is an email, this field contains the email subject. If the document is not an email, this field contains the document’s file name. This field in tandem with Group Identifier helps alleviate the problem of non-sequential control numbers within families.
Unprocessable Yes/No The yes/no value indicating if a file was able to be processed. If the file could not be processed, this field is set to Yes.
Unread Flag Yes/No Indicates whether an email was not read.

Glossary

Term Definition
Processing The intake of file information and links to files for use in a database that provides a collaboration environment.
File system The area of the computer system that provides organization and storage of information and programmatic functions.
Virus protection An active application that guards against infection from a computer virus and alters files it finds problematic.
Cleansing Removal of data from file.
Network A central linked group of computers which [xx]
Custodian A person who owns data.
Container file File that holds multiple other files generally for compression or security.
Normalization Equalization across a dataset to make all things consistent.
Uncompressed Files extracted from container files are uncompressed and the file size is consistent with the original size before adding to the container file and compressing.
Unprocessable Files that cannot be opened or read for extracting the metadata and inserting into a database.
Metadata File information related to various aspects of a file generated by the file system and not user created.
Alpha-numeric identifier Unique identifier that contains letters and numbers combined.
Doc id Document identifier is a unique name for each file in the system.
Tiffing Slang term to mean the creation of a tiff image file from a native version of the file.
Tifs A file format that is a static image of a file.
deNIST Referring to the National Institute and Standards Organization which creates a list of system files that belong with standard windows installation or other software use. These files are not client generated and are removed from the processing phase.
Hash values Unique algorithm generated identifiers based on file information for the purpose of creating a identifier which is used for duplicate identification.
MD5 Hash An algorithm used to calculate a hash value.
Compression The saving of a file in a reduced file size using a container file so that it can be smaller for purposes of easier transport among devices. It also often includes encryption of files.
Compression The saving of a file in a reduced file size using a container file so that it can be smaller for purposes of easier transport among devices. It also often includes encryption of files.
Loose File A stand alone file not part of something else such as a group of file or attached to an email.

Potential Future Topics

  • Language Identification
  • Special Considerations: Parallel Processing and Extraction
  • Project Details Form
  • Native Support does not use MAPI
  • EML Files
  • Mime vs Text File ID and Extraction
  • Processing Lotus Notes email
  • Support for Non-Email Databases Output of Lotus Notes
  • MHT, Rich Text and HTML
  • VCF, ICX Formats
  • Other Folders Field


Notes

  1. http://www.nsrl.nist.gov/.
  2. A binary file is a computer file that is not a text file; it may contain any type of data, encoded in binary form for computer storage and processing purposes. Binary files are usually thought of as being a sequence of bytes, which means the binary digits (bits) are grouped in eights.
  3. Some applications store the time zone/location of the user (e.g., Bloomberg). In these instances, special processing and/or conversion may be required.
  4. Coordinated Universal Time (UTC): Primary time standard by which the world regulates clocks and time. Time zones around the world are expressed as positive or negative offsets from UTC. For example, 3:00 a.m. Mountain Standard Time = 10:00 UTC – 7.
  5. Normalization: The process of reformatting data so that it is stored in a standardized form, such as setting the date and time stamp of all ESI data for a matter to a specific zone, often UTC, to be used for de-duplication.
  6. Forensic Image: An exact bit-stream copy of all electronic data on a device, performed in a manner that ensures that the information is not altered. (NIST IR 7298 Revision 2, Glossary of Key Information Security Terms)

3 comments to EDRM Processing Standards Guide, version 0.1

  • Very fine first effort. Kudos to all.

    Respecting application metadata, you state, “An example can be any MS Word document. Internal metadata fields can change the MD5 hash. These internal metadata fields are stored internally within the Microsoft Word document. These fields include “Modified” and “Accessed” dates.”

    I question the statement that “Accessed: dates are stored internally with respect to Word documents. I learn something new every day; but on what authority is this statement based please?

    Also, this comes across as unnecessarily Relativity-centric. It might be wiser to make it more product agnostic. The various references to a single product struck me as unnecessary.

    Thank you.

  • jpolichak

    Language Identification should not be difficult, especially if you limit the scope to the most common languages in a relatively short list. You will find much more Japanese than Arabic, and much more Arabic than Laotian.

    Most major languages have either a distinct alphabet/syllabary set/ideograph set, one or more unique characters, or at the minimum distinctive co-occurence frequencies for character pairs.

    The more interesting problem would be to determine the native language of an author based on the patterns that appear when they are using English.

  • kingkillian

    Nice start. Add “Hidden Content” to your future topics list.

Leave a Reply