Mission | FAQs | Downloads | Posts | Participants
Questions | Answers | Ask a Question
- How do I get the EDRM Data Set files?
- How much do I have to pay to get the EDRM Data Sets? How much do I have to pay to use them?
- Where does the data in the EDRM Enron PST Data Set come from? Do you have the rights to redistribute it?
- Does the EDRM PST Data Set contain viruses?
- Is there a set of MD5 checksums for the EDRM PST Data Set files?
- Why does the EDRM Enron PST Data Set contain duplicates?
- How do the EDRM Enron PST Data Set and the CMU Data Set differ?
- How do the EDRM Enron PST Data Set and the Berkeley ANLP Categorization differ?
- I am interested specifically in Unicode data (foreign language content). I see the listed info about Ubuntu - If I download the international data set, I expect it contains email and Office Docs? thanks
- Hi there, I downloaded the Enron dataset. Thank you very much, it is truly a great resource for all of us. Is it complete? I was wondering why the dataset doesn't contain emails from some key custodians such as Andrew Fastow or Jeff Skilling?
Answers
-
How do I get the EDRM Data Set files?
EDRM Data Sets can be downloaded from the EDRM Data Set page. Go to the EDRM Data Set page,edrm.net/21, select the "Downloads" tab, select the desired data set, and follow any additional instructions.
-
How much do I have to pay to get the EDRM Data Sets? How much do I have to pay to use them?
We do not charge for access to the EDRM Data Sets, nor do we charge for use of the data sets.
We have made this content available under a Creative Commons Attribution 3.0 United States License. To provide the attribution required under that license, when sharing or remixing the content please cite "EDRM (edrm.net)".
-
Where does the data in the EDRM Enron PST Data Set come from? Do you have the rights to redistribute it?
The data in the EDRM Enron PST Data Set files is sourced from the FERC Enron Investigation release made available by Lockheed Martin Corporation, and has been reconstituted as PST files with attachments by ZL Technologies for the EDRM Data Set Project. It is our understanding that Lockheed Martin has not placed any restrictions on any the Enron material that it has released to the public.
-
Does the EDRM PST Data Set contain viruses?
We have been told that some of the files in the EDRM PST Data Set contain viruses. We view the task of addressing possible viruses as a responsibility that rests with the entity processing or otherwise working with the files, as in the case in real-world e-discovery undertakings.
-
Is there a set of MD5 checksums for the EDRM PST Data Set files?
Yes. A txt file containing MD5 hash values is available at EDRM-Enron-PST-MD5.txt
-
Why does the EDRM Enron PST Data Set contain duplicates?
We are attempting to match, as best we can, a real-world e-discovery situation. With the Enron data, multiple collections were made of many custodians’ email over a period of several months. Because the same email often was collected several times, the set contains duplicates. In the PST set, each collection can sometimes be seen as a top level folder in the PST file. De-duplication and near de-duplication can be used for this. Alternately, It may be beneficial to have each of those top level folders separated out as a separate PST file which would at least limit the duplicates to within a collection.
-
How do the EDRM Enron PST Data Set and the CMU Data Set differ?
There is some overlap between the email in the PST files and in the CMU corpus but it is not a 100% overlap. We have been talking about creating a mapping between the EDRM PST email and the CMU corpus but have not completed that project yet.
-
How do the EDRM Enron PST Data Set and the Berkeley ANLP Categorization differ?
Our participants have looked at this set and mapped it to some other Enron email data sets, but have not mapped it to the EDRM data set yet. We agree it would be useful to incorporate this into a data set offering for use as a training set and for other purposes.
-
I am interested specifically in Unicode data (foreign language content). I see the listed info about Ubuntu - If I download the international data set, I expect it contains email and Office Docs? thanks
At this point, all the Ubuntu files in this data set are email messages. The messages may have attachments; however, we have not yet checked for the presence or absence of attachments.
-
Hi there, I downloaded the Enron dataset. Thank you very much, it is truly a great resource for all of us. Is it complete? I was wondering why the dataset doesn't contain emails from some key custodians such as Andrew Fastow or Jeff Skilling?
The Enron data set currently available through the EDRM site is approximately 40 GB. We are awaiting deliver of a more complete 100 GB version. Once we receive that version, we will make it available for downloading.
Ask a Question
1 - Notification of when your question has been answered. (Optional)







