advertisement
Share

Data Set

EDRM Data Set Project

Featured Participants





Data Set

Mission | FAQs | Downloads | Posts | Participants

TheĀ EDRM Data Set Project provides industry-standard, reference data sets of electronically stored information (ESI) and software files that can be used to test various aspects of e-discovery software and services, through three initiatives:

  • EDRM ESI Reference Data Sets
  • EDRM Software Reference Data Set
  • EDRM Probabilistic Hash Data Set

PLEASE NOTE: These files may contain viruses, as can be the case with any set of files collected during discovery. Appropriate caution should be used when handling the files.

EDRM ESI Reference Data Sets

This initiative collects, evaluates, and publishes ESI data sets for use in testing e-discovery software and services. There are currently four data sets available:

EDRM Enron Email Data Set v2: An updated set of Enron e-mail messages and attachments:

  • More custodians (150), more email
  • 153 zipped .pst and 159 zipped .xml files
  • Approximately 107 GB zipped
  • Email now organized by custodian folder, not by collection + custodian folder; to remove duplicates that occurred in the collection process and make the set appear more like users’ standard mailboxes
  • Email now fixed to handle multi-line MIME headers
  • Now with corresponding xml files in EDRM XML format

The EDRM Enron Email Data Set v2 is now a public data set on Amazon Web Services. AWS hosts these public data sets at no charge to the community in order to enable faster innovation by researchers across a variety of disciplines and industries. For more information about AWS public data sets, go to aws.amazon.com/publicdatasets.

EDRM Enron PST Data Set: Enron e-mail messages and attachments organized in 32 zipped files, each less than 700 MB in size, containing 168 .pst files.

The data in the EDRM Enron PST Data Set files is sourced from the FERC Enron Investigation release made available by Lockheed Martin Corporation, and has been reconstituted as PST files with attachments by ZL Technologies for the EDRM Data Set Project. It is our understanding that Lockheed Martin has not placed any restrictions on any the Enron material that it has released to the public.

EDRM File Format Data Set: 381 files covering 200 file formats.

EDRM Internationalization Data Set: A snapshot of selected Ubuntu localization mailing list archives covering 23 languages in 724 MB of email.

EDRM Software Reference Data Set

With the EDRM Software Reference Data Set initiative, EDRM seeks to augment the NIST Reference Data Set hashes used in e-discovery with additional hashes of known software files that can be further culled for review purposes.

While the NIST list focuses on a selection of software applications and only as the software exists on installation media (e.g. DVDs, and CDs), this initiative will provide the hashes for the software after it has been extracted from compressed media containers and installed on a system, as well as for software not currently being handled by NIST, e.g. software that is downloaded from the Internet as opposed to received on DVD and/or CD media.

This initiative will modernize and enhance the list of hashes available for culling software files to reduce e-discovery costs.

EDRM Probabilistic Hash Data Set

To further improve the culling process, the Probabilistic Hash Data Set initiative seeks to collect as many anonymous hashes as possible of files encountered in real world e-discovery.

The frequency of the appearance of hashes can then be used to determine the likelihood that a particular file could be classified as probably not relevant. This initiative seeks to sig- nificantly improve the performance of automated culling of non-ESI files for e-discovery, resulting in both more reliable results and lower cost.

25 comments to Data Set

Go to top | leave a comment

  • 19

    [...] Data Set « The Electronic Discovery Reference ModelThe EDRM Data Set Project provides industry-standard, reference data sets of electronically stored information (ESI) and software files that can be used to test … « Gigger bites [...]

  • 18

    [...] messages along with PST files and EDRM XML files. Some projects I’ve been involved with are EDRM Data Sets open data project and the NIST-sponsored TREC Legal Track email relevance project. In all of these situations, [...]

  • 17
    David Kovar says:

    Thank you very much for making the file format collection available. I’d like to expand on the collection as there are some specific file types that I need that are missing. Is there a recommended method for producing new files for the collection, and how does one submit new files?

    Thank you.

    -David

  • 16

    [...] the EDRM VI Kickoff Meeting in Minneapolis and wanted to provide everyone with an update for the Data Set Project. The Data Set Project’s goals have expanded to cover projects that will not only make testing [...]

  • 15
    Dima Diall says:

    This really a great resource – I want to congratulate and thank the entire team for the tremendous effort to put all of this together.

    I am curious about one thing, though – is there a particular reason why mailboxes for key executives in the Enron case are missing, e.g. Jeffrey Skilling (former CEO), Andrew Fastow (CFO), Richard Causey (CAO)? This appears not to have been originally disclosed as part of the FERC investigation, however these individuals were involved in the trials, etc.

    • 15.1
      George Socha says:

      The Enron data set currently available through the EDRM site is approximately 40 GB. We are awaiting deliver of a more complete 100 GB version. Once we receive that version, we will make it available for downloading. At this point, I do not know what additional custodians will be available.

  • 14
    Julie Garrett says:

    From the Posse List above “a file list with over 13,000 extensions”. Is that available?

  • 13

    Has anyone converted the Enron PSTs into other email formats…specifically, Notes or Notes XML format (.NSF and .DXL)?

    • 13.1
      George Socha says:

      While conversion to other email formats is on the list of action items for the EDRM Data Set group to consider, it is not yet something we have done. As far as I know, no one else has performed these conversions.

  • 12
    Amit S. Tolmare says:

    Hello,

    We would like to use this Enron data for our internal testing purposes only. However, before we use this data we would like to understand how did you get access to this data and if you guys have the appropriate rights to distribute this data through this website. Your response would be greatly appreciated.

  • 11
    Sid Newby says:

    Thanks, George for making this available to the public again. Its a great reference set.

    Cheers!

    Sid Newby
    PLATINUM|IDS

Leave a Reply