Mission | FAQs | Downloads | Posts | Participants
The EDRM Data Set Project provides industry-standard, reference data sets of electronically stored information (ESI) and software files that can be used to test various aspects of e-discovery software and services, through three initiatives:
- EDRM ESI Reference Data Sets
- EDRM Software Reference Data Set
- EDRM Probabilistic Hash Data Set
PLEASE NOTE: These files may contain viruses, as can be the case with any set of files collected during discovery. Appropriate caution should be used when handling the files. 
EDRM ESI Reference Data Sets
This initiative collects, evaluates, and publishes ESI data sets for use in testing e-discovery software and services. There are currently four data sets available:
EDRM Enron Email Data Set v2: An updated set of Enron e-mail messages and attachments:
- More custodians (150), more email
- 153 zipped .pst and 159 zipped .xml files
- Approximately 107 GB zipped
- Email now organized by custodian folder, not by collection + custodian folder; to remove duplicates that occurred in the collection process and make the set appear more like users’ standard mailboxes
- Email now fixed to handle multi-line MIME headers
- Now with corresponding xml files in EDRM XML format
The EDRM Enron Email Data Set v2 and EDRM Enron PST Data Set are now a public data set on Amazon Web Services. AWS hosts these public data sets at no charge to the community in order to enable faster innovation by researchers across a variety of disciplines and industries. For more information about AWS public data sets, go to aws.amazon.com/publicdatasets.
EDRM Enron PST Data Set: Enron e-mail messages and attachments organized in 32 zipped files, each less than 700 MB in size, containing 168 .pst files.
The data in the EDRM Enron PST Data Set files is sourced from the FERC Enron Investigation release made available by Lockheed Martin Corporation, and has been reconstituted as PST files with attachments by ZL Technologies for the EDRM Data Set Project. It is our understanding that Lockheed Martin has not placed any restrictions on any the Enron material that it has released to the public.
EDRM File Format Data Set: 381 files covering 200 file formats.
EDRM Internationalization Data Set: A snapshot of selected Ubuntu localization mailing list archives covering 23 languages in 724 MB of email.
EDRM Software Reference Data Set
With the EDRM Software Reference Data Set initiative, EDRM seeks to augment the NIST Reference Data Set hashes used in e-discovery with additional hashes of known software files that can be further culled for review purposes.
While the NIST list focuses on a selection of software applications and only as the software exists on installation media (e.g. DVDs, and CDs), this initiative will provide the hashes for the software after it has been extracted from compressed media containers and installed on a system, as well as for software not currently being handled by NIST, e.g. software that is downloaded from the Internet as opposed to received on DVD and/or CD media.
This initiative will modernize and enhance the list of hashes available for culling software files to reduce e-discovery costs.
EDRM Probabilistic Hash Data Set
To further improve the culling process, the Probabilistic Hash Data Set initiative seeks to collect as many anonymous hashes as possible of files encountered in real world e-discovery.
The frequency of the appearance of hashes can then be used to determine the likelihood that a particular file could be classified as probably not relevant. This initiative seeks to sig- nificantly improve the performance of automated culling of non-ESI files for e-discovery, resulting in both more reliable results and lower cost.








[...] @edrm: The EDRM Enron PST files are now available on the EDRM website, and can be found at http://edrm.net/activities/projects/data-set [...]
[...] an organisation called EDRM (Electronic Discovery Reference Model) has made a version of the Enron email corpus available for download that includes attachments, which were missing from the widely used versions [...]
Does anyone have a set of MD5 Checksums for the files?
[...] [...]
[...] EDRM Enron PST files are now available on the EDRM Data Set website thanks to George Socha, EDRM, and ZL Technologies. I am co-lead of the EDRM Data Set [...]
Sounds like a good use case for BitTorrent.
What are the copyright and privacy issues regarding this dataset?
Uh, is no one going to answer this question? It seems important.
– Paul D. Bain, Esq.
Paul, as noted above,
Thanks
We are in the process of evaluating options for distributing the test data sets. We hope to be able to start making them available soon.
Our goal is to make the data sets available at no charge at anyone who would like copies. Whether we are able to achieve this will depend on the specific distribution mechanism, or mechanisms, we are able to put into place.
how can I gain access to the sample data set? it there a cost?
I am a member
[...] Data Set (2008): The EDRM Data Set Project has compiled more than 60 gigabytes of data that can be used to test various aspects of electronic discovery software and services. The data set foreign language data from 23 different countries, emails with attachments (including .pst files), file format data from 200 different file types, and a file list with over 13,000 extensions. The group is currently testing the compiled data as well as distribution processes. More information on these efforts can be found at: http://edrm.net/activities/projects/data-set [...]