advertisement
Share

Data Set

EDRM Data Set Project

Featured Participants





Data Set

Mission | FAQs | Downloads | Posts | Participants

The EDRM Data Set Project provides industry-standard, reference data sets of electronically stored information (ESI) and software files that can be used to test various aspects of e-discovery software and services, through three initiatives:

  • EDRM ESI Reference Data Sets
  • EDRM Software Reference Data Set
  • EDRM Probabilistic Hash Data Set

PLEASE NOTE: These files may contain viruses, as can be the case with any set of files collected during discovery. Appropriate caution should be used when handling the files.

EDRM ESI Reference Data Sets

This initiative collects, evaluates, and publishes ESI data sets for use in testing e-discovery software and services. There are currently four data sets available:

EDRM Enron Email Data Set v2: An updated set of Enron e-mail messages and attachments:

  • More custodians (150), more email
  • 153 zipped .pst and 159 zipped .xml files
  • Approximately 107 GB zipped
  • Email now organized by custodian folder, not by collection + custodian folder; to remove duplicates that occurred in the collection process and make the set appear more like users’ standard mailboxes
  • Email now fixed to handle multi-line MIME headers
  • Now with corresponding xml files in EDRM XML format

The EDRM Enron Email Data Set v2 and EDRM Enron PST Data Set are now a public data set on Amazon Web Services. AWS hosts these public data sets at no charge to the community in order to enable faster innovation by researchers across a variety of disciplines and industries. For more information about AWS public data sets, go to aws.amazon.com/publicdatasets.

EDRM Enron PST Data Set: Enron e-mail messages and attachments organized in 32 zipped files, each less than 700 MB in size, containing 168 .pst files.

The data in the EDRM Enron PST Data Set files is sourced from the FERC Enron Investigation release made available by Lockheed Martin Corporation, and has been reconstituted as PST files with attachments by ZL Technologies for the EDRM Data Set Project. It is our understanding that Lockheed Martin has not placed any restrictions on any the Enron material that it has released to the public.

EDRM File Format Data Set: 381 files covering 200 file formats.

EDRM Internationalization Data Set: A snapshot of selected Ubuntu localization mailing list archives covering 23 languages in 724 MB of email.

EDRM Software Reference Data Set

With the EDRM Software Reference Data Set initiative, EDRM seeks to augment the NIST Reference Data Set hashes used in e-discovery with additional hashes of known software files that can be further culled for review purposes.

While the NIST list focuses on a selection of software applications and only as the software exists on installation media (e.g. DVDs, and CDs), this initiative will provide the hashes for the software after it has been extracted from compressed media containers and installed on a system, as well as for software not currently being handled by NIST, e.g. software that is downloaded from the Internet as opposed to received on DVD and/or CD media.

This initiative will modernize and enhance the list of hashes available for culling software files to reduce e-discovery costs.

EDRM Probabilistic Hash Data Set

To further improve the culling process, the Probabilistic Hash Data Set initiative seeks to collect as many anonymous hashes as possible of files encountered in real world e-discovery.

The frequency of the appearance of hashes can then be used to determine the likelihood that a particular file could be classified as probably not relevant. This initiative seeks to sig- nificantly improve the performance of automated culling of non-ESI files for e-discovery, resulting in both more reliable results and lower cost.

25 comments to Data Set

Go to top | leave a comment

Leave a Reply