Jaccard Index

Definition(s)

  • A measure of the consistency between two sets (e.g., Documents Coded as Relevant by two different reviewers). Defined mathematically as the size of the intersection of the two sets, divided by the size of the union (e.g., the number of Documents coded as Relevant by both reviewers, divided by the number of Documents identified as Relevant by one or the other, or both reviewers). It is typically used as a measure of consistency among review efforts, but also may be used as a measure of similarity between two Documents represented as two Bag of Words. Jaccard Index is also referred to as Overlap or Mutual F1. Empirical studies have shown that expert reviewers commonly achieve Jaccard Index scores of about 50%, and that scores exceeding 60% are rare. 1
  • A measure of agreement or efficacy. The Jaccard index compares the number of documents selected as responsive by both assessors divided by the number of documents that are selected as responsive by either assessor. If assessor A identifies 20 documents as responsive and assessor B identifies 25 documents as responsive, and they agree on their identification of 10 documents as responsive, then the numerator would be 10 and the denominator would 20 + 25 – 10, or 10/35 or 28.6%. 2

Notes

  1. Maura R. Grossman and Gordon V. Cormack, EDRM page & The Grossman-Cormack Glossary of Technology-Assisted Review, with Foreword by John M. Facciola, U.S. Magistrate Judge2013 Fed. Cts. L. Rev. 7 (January 2013).
  2. Herb Roitblat, Predictive Coding Glossary.