Data Set

Mission | FAQs | Data Set Files | Posts | Members

The EDRM Data Set Project provides industry-standard, reference data sets of electronically stored information (ESI) and software files that can be used to test various aspects of e-discovery software and services.

 These files may contain viruses, as can be the case with any set of files collected during discovery. Appropriate caution should be used when handling the files.

 These files may contain personally identifiable information, in spite of efforts to remove that information. If you find PII that you think should be removed, please notify us at

EDRM ESI Reference Data Sets

This initiative collects, evaluates, and publishes ESI data sets for use in testing e-discovery software and services. There are currently four data sets available:

EDRM Enron Email v1 Data Set: An updated set of Enron e-mail messages and attachments.

EDRM File Format Data Set: 381 files covering 200 file formats.

EDRM Internationalization Data Set: A snapshot of selected Ubuntu localization mailing list archives covering 23 languages in 724 MB of email.

EDRM Micro Datasets: EDRM has begun to publish what will be a series of “Micro Datasets,” some available to the general public and some for EDRM members only. These datasets are designed for eDiscovery testing and process validation. Software vendors, litigation support organizations, law firms and others may use these smaller sets to qualify support, test speed and accuracy in indexing and search, and conduct more forensically oriented analytics exercises throughout the eDiscovery workflow.

26 comments to Data Set

  • David Kovar

    Thank you very much for making the file format collection available. I’d like to expand on the collection as there are some specific file types that I need that are missing. Is there a recommended method for producing new files for the collection, and how does one submit new files?

    Thank you.


Leave a Reply