The EDRM Data Set Project provides industry-standard, reference data sets of electronically stored information (ESI) and software files that can be used to test various aspects of e-discovery software and services.
These files may contain viruses, as can be the case with any set of files collected during discovery. Appropriate caution should be used when handling the files.
These files may contain personally identifiable information, in spite of efforts to remove that information. If you find PII that you think should be removed, please notify us at firstname.lastname@example.org.
EDRM ESI Reference Data Sets
This initiative collects, evaluates, and publishes ESI data sets for use in testing e-discovery software and services. There are currently four data sets available:
EDRM Enron Email v1 Data Set: An updated set of Enron e-mail messages and attachments.
EDRM File Format Data Set: 381 files covering 200 file formats.
EDRM Internationalization Data Set: A snapshot of selected Ubuntu localization mailing list archives covering 23 languages in 724 MB of email.
EDRM Micro Datasets: EDRM has begun to publish what will be a series of “Micro Datasets,” some available to the general public and some for EDRM members only. These datasets are designed for eDiscovery testing and process validation. Software vendors, litigation support organizations, law firms and others may use these smaller sets to qualify support, test speed and accuracy in indexing and search, and conduct more forensically oriented analytics exercises throughout the eDiscovery workflow.