EDRM Evergreen/Processing/Preparation

From Working EDRM

Jump to: navigation, search
Comments: Please submit comments to the EDRM Evergreen Processing forum

Categories

add introduction

Contents

Restoration

Tape restoration can be an important element of a case, particularly those cases that involve legacy data. Typically, data that is no longer in use can be found on back-up tapes. The purpose for using back-up tapes is for disaster recovery, not data retention, so access to this data can often be difficult, which may affect the timeframes and costs of the project. Due in large part to its portability, durability, and low cost per MB, magnetic tape media is the most widely used form of media for the archiving of data, and has been for approximately four decades. Considering these and other numerous benefits, it is easy to fall into the trap of believing tape is a fail-safe technology, where one simply has to place a tape into a drive to retrieve the contents. Nothing could be farther from the truth. Nothing can be written to backup tape media without software controlling the process. As the volume of data being archived has grown tremendously over the past decade, developmental emphasis (by archiving software vendors) was placed on writing data to tape more rapidly. Since tape media was perceived as a means of aiding in disaster recovery, no special emphasis was placed on restoration of data from the tape, as restore times were perceived as acceptable and it was presumed that the system to which the data would be restored was the system from which it was archived. Fast-forward to the 21st century with newly implemented SEC/NASD, Sarbanes-Oxley, HIPAA, and other regulatory requirements. Backup software manufacturers are now playing catch-up with their technologies in an attempt to aid their customers to be time-responsive to many of these regulatory requirements. Failure to be responsive can mean fines, economic sanctions, or even litigation. As a result, backup software customers are demanding improvements to the technology and are even exploring other means of archiving data, such as more costly, but more rapidly accessible, hard disk-based systems. Processing tapes during the restore process usually is done with 1 of 2 results in mind: to restore a complete system or to restore a subset of files, each having its own method. The method for the first option would require a complete backup of the system and, depending on the way backups were performed, the last differential tape set or all of the incremental backup tape sets would also be needed. Preliminary work may also have to be done on the system that is going to be restored prior to the restore process, before recovering data, based on the application that was used. Once the preliminary work is complete and the restore process can begin the data from the complete backup would be restored and then the differential set or all the incremental sets inserted chronologically. Some software packages may need the media inserted in a different order to gather media information prior to restoring but the restoration typically follows this order. The method for the second option can be varied. This variety of restoration methods is based on the modification dates of the files in the restoration subset and the backup method used in archiving the data. If the modification and/or creation dates are within the same time period of a backup job they should be part of the same backup set, whether it is a full backup, differential, or incremental set. If the files reside outside the range of one backup job, the backup sets for each necessary file will be needed. This could include a combination of tape sets much like the first method.

Conversion

to be added

File Type Verification

to be added

Cataloging and Itemization

Just as physical media containing electronic documents must be treated as evidence, the same rule holds true for each individual file. One of the benefits of an automatic, technical process is the ability to substantiate exactly what process a file went through prior to admission in a case. This benefit is even greater when the case consists of millions of files; through automation every file is subject to the same processes. Additionally, the ability to account for every file that was processed, even files that are segregated based on data culling criteria or exceptions, is critical to ensure that each piece of evidence was handled properly.

Container Extraction

Once the data has been restored from the media on which it exists, the next step to consider is the relationships betweens the files or documents contained on that media. Electronic data processing becomes more complex when files have different relationships to one another. For example, an email message is a file, but it may also have attachments that contain embedded or linked data. The attachments can give the email message a different legal significance and context for processing. The email message is said to be the “parent” file, and the attachments are each a “child.” The terminology is inherited from computer science, where the original concepts of directory structures emerged. All file systems have directory structures and most of them support nested parent/child relationships among the files contained within them. This concept of parent/child relationships between messages and attachments is an attribute that is generally required to be captured, preserved, and available as metadata. Files can also be packed inside of other files. Messages and attachments are contained within containers called archives (.PST for Outlook mail files, .NSF for Lotus Notes), and archives can be nested within other archives. Again, those relationships between files must be maintained and available as the files are reviewed and produced. The requirements of a case will drive the technologies that must be employed to extract and process this data as well as report the parent-child relationship of the data for use by the review team.

Metadata Extraction

Metadata refers to the digital attributes of electronic documents that are appended to those documents either during their creation or use in their native application. Metadata is created and exists in its natural state before the electronic discovery process is initiated. Metadata should be considered the file attribute fields that are not present when the document is printed in its default format or fields that are not the "body" of the document. Metadata can be generated from two sources, the operating system and the software application itself. The types of metadata that are recorded by the operating system are name, dates (create, modified and accessed), file type, and size for all files. Sent, received and modified dates, subject and recipient information for email files. Some of that same data is recorded by the file itself, such as dates and file type. There is other important data captured by the file itself; authorship, revision information, and comments. This information can be an important component of a legal strategy. Metadata provides search criteria and contextual information, which may not be in the body text. The ability to search for and correlate data during the review process is dependant on how well the metadata was processed. (Example: Data where metadata has been altered may compromise a search based on dates of files that were modified between two dates. If this happens, you may end up with more responsive data than if all metadata for all files is pristine.) Whether a legal team decides to use the metadata or not, it is a commonly provided component in electronic discovery deliverables. Metadata is extracted and archived as part of processing the source data so that it is available during review. Although metadata may not be used during processing, it is still critical that it be maintained for purposes of electronic discovery. If not, the integrity and authenticity of the data can be brought into question.

De-Dupe Hashing

The amount of electronic data in corporations has grown, particularly for those corporations who have had overly-broad or non-existent record retention policies. Out of this data growth, data culling has become a vital element of electronic discovery. A common data culling technique is deduplication. Hardware, software, and service companies have developed technical solutions to reduce duplicate data. Deduplication is the process of identifying and segregating those files that are exact duplicates of one another. The goal is to provide a deliverable that contains one copy of each original document, while maintaining the information associated with each instance of that document within the collection. There are several ways duplicates are identified. A combination of metadata information can be compared to match files. An electronic fingerprint of each file can be taken and compared using a mathematical hashing algorithm such as MD5 Hash, SHA-1, or SHA-180. In some cases, a hashing algorithm is used in combination with metadata.

Indexing

to be added

[updated Jan. 29, 2008]

Personal tools
additional information