advertisement
Share

Data Set

EDRM Data Set Project

Featured Participants





EDRM Enron Email Data Set v2

The EDRM Enron Email Data Set v2 consist of Enron e-mail messages and attachments in two sets of downloadable compressed files: XML and PST.

Now an Amazon Web Services Public Data Set

The EDRM Enron Email Data Set v2 and EDRM Enron PST Data Set are now a public data set on Amazon Web Services. AWS hosts these public data sets at no charge to the community in order to enable faster innovation by researchers across a variety of disciplines and industries. For more information about AWS public data sets, go to aws.amazon.com/publicdatasets.

Accessing the EDRM Enron Data Set

To access the EDRM Enron Data Set, you need to create an AWS instance and mount the EDRM Data EBS volume. See:

Once you have mounted the EBS volume, you can use the IP address for the instance to access the system both to use the files locally and to download them. See How do I access my systems?

To view the files locally, you can SSH into the machine where you will be able to use them on that system.

To download the files, you can use SFTP.

AWS pricing is on a per-hour basis. For a small instance, the fee is currently $0.08 per hour for Linux and $0.115 per hour for Windows. See Amazon EC2 Pricing.

For additional information about AWS, go to http://aws.amazon.com.

Join EDRM

If you find these files useful, please consider joining EDRM!

32 comments to EDRM Enron Email Data Set v2

Go to top | leave a comment

  • 20
    rajesh singh says:

    can we have some other way like ftp address to download the whole PST files at once?

  • 19
    rajesh says:

    i don’t see any of the file list or the download button.

  • 18
    Mark says:

    In the dataset “EDRM Enron Email Data Set v2″, for each custodian, there are subdirs like \text_000, \text_001, etc. They consist entirely of .txt files. Some of these files are the .eml files without the embedded attachments and the others are the actual attachments to those .eml files (but converted to text files which is really useful).
    However, there are some custodians that this does not appear to have been performed for (no \text_nnn subdirs for them). Do you plan to finish those custodians? These are the ones I ran into:
    edrm-enron-v2_dasovich-j_xml.zip
    edrm-enron-v2_mann-k_xml.zip

    Thank you.

  • 17
    Ryan says:

    First, thank you for posting this data, it is extremely helpful to use in testing.

    There seems to be an issue with many of the recipient fields in the PST version. In cases where the Exchange contact address format was used rather than the SMTP address (e.g., <Tag TagName="#To" TagDataType="Text" TagValue="Williams III, Bill “) the PST version separates this out into 2 different recipients split by the comma. So instead of getting back the correct recipient you end up with 2 incorrect recipients:
    1) Williams III
    2) Bill
    Which makes it very difficult to effectively map the recipients across the dataset.

    It seems to be correct in the XML version. Any chance of an update on these PSTs?

    Thanks again for making this data available.

    • 17.1
      Aaron says:

      This same problem was present in the v1 set and has never been corrected.

      This problem is present in the XML zip archive in the EML version of the messages. That is, there is no native-file version of these messages that is correct.

      In both the EML and the PST files, the headers (on many messages) look like this:
      To: ,"Anna"
      which is clearly incorrect.

      It almost appears as though these files were rendered to text and then incorrectly reinterpreted to create these native files, although this is not the case on every message.

      Because a significant portion of the messages have these incorrect headers, these files are effectively useless for testing processing through the EDRM. No processing of native files that uses meta-data will produce useful or meaningful test results.

  • 16
    Robert Lauriston says:

    Are there versions of these files with the formatting intact?

  • 15
    John Wang says:

    @Theresa: For the EDRM Enron Data Set v2, there’s no mapping needed between the EDRM Message-IDs and the original Message-IDs because the EDRM data set uses the original Message-IDs where available. In cases where no original Message-ID is present, a Message-ID was generated such as the @PMZL04 example you mentioned. The EDRM Message-IDs can be considered authoritative because of this.

    Additionally, it appears that none of the @thyme Message-IDs are original, but were created for the CALO Enron Email Data Set. In that data set (and its derivatives), it appears that all messages have an @thyme Message-ID.

    If you want to use the original Message-IDs, I recommend using the EDRM Message-IDs.

    If you want to correlate the EDRM and CALO data sets, a mapping file would be useful but I’m not aware of one yet. However, if a mapping was made available, we would be happy to link to it or host it.

    Hope this helps.

    • 15.1
      Theresa Wilson says:

      John – Thank you very much for your reply and the helpful information! I didn’t realize that the @time Message-IDs weren’t the original ones. I guess the next step for me would be to ask the creators of the CALO data set if they have a mapping. Thanks again!

  • 14
    Theresa Wilson says:

    I’ve just started looking at this dataset, and I noticed that the Message-IDs (Example: 00000000DC16F437217B604BB0C11906781A15A0046D2200@PMZL04) in the .eml file
    do not match at all the Message-IDs in other Enron data sets that have been made
    available (Example: 11972760.1075842947758.JavaMail.evans@thyme). Is there a
    mapping some where of EDRM Message-IDs to the original Message-IDs?

  • 13

    [...] email formats, it’s useful to take a look at some email in the different formats. The EDRM Enron Email Data Set 2.0 supports multiple formats which can be [...]

  • 12
    John Wang says:

    The data set was prepared using ZL Unified Archive. A ZL system was set up to archive the Enron email files made available from FERC via Lockheed Martin. The email messages were archived into individual custodian accounts on the Unified Archive system. Once in the system, two sets of emails were created. First, a PST set was created by exporting each custodian mailbox using that format. Then the EDRM XML set was created by importing the email into ZL Discovery Manager and using the EDRM XML export capability to export the email for each custodian.

  • 11
    Chung says:

    Is there any documentation on how these files were prepared?

    Thanks!

Leave a Reply