EDRM Evergreen/Review/Review Technologies
From Working EDRM
| Comments: Please submit comments to the EDRM Evergreen Review forum |
Categories
Modern-day electronic discovery is a time consuming and costly endeavor. Every additional hour of reviewer time that must be spent culling down large data sets to the small percentage of ultimately responsive documents is additional cost to the responding party. Utilizing technologies that reduce the number of documents requiring review or increase the speed of review can translate into significant cost savings to the responding party. Technology can also be used to increase the quality of review by making it easier to discern key facts or relationships.
Document review platforms have undergone many technological enhancements over the past several years. New technologies to improve the efficiency and accuracy of the review process are appearing everyday from fuzzy searching to concept searching, near-dupes, visualization and social network analysis. What are all of these tools and how can the assist in the review process?
Contents |
Sample Technologies
Keyword Search
Keyword searching was the first significant enhancement to the efficiency of the electronic document review process, reducing data sets that can be terabytes in volume to far more manageable sizes. Search effectiveness can be measured by recall, the number of responsive documents retrieved divided by the total number of responsive documents, and precision, the number of responsive documents retrieved divided by the total number of documents retrieved. A variety of technologies have been introduced to improve both recall and precision of keyword search. Boolean, which allows for use of AND, and NOT operators in search queries, and proximity search, which finds documents that contain terms within a specified distance of each other, have been used to improve precision by reducing false positives. Stemming, wildcard and fuzzy search, which find documents with different variations of the specified terms, such as differences in case, conjugation and spelling, have been used to improve recall by finding variations of the search word that have the same or similar meaning. In recent years, web search engines have popularized new methods of finding mis-spellings, such as Google’s “Did you mean” feature. Search performance and scalability are also critical to search effectiveness. Iteration is very important for effective keyword searching. Search technology needs to be able to return search millions of documents and return results in seconds in order to enable interactive and iterate searching and exploration of information.
Relevance Ranking
Relevance ranking is a way of scoring documents within a search result based on how well the document may match the need of the person who ran the search query. There are many ways of calculating relevance for a document given a keyword search query. Two of the most common measures tracked by search engine software are term frequency and inverse document frequency. Term frequency measures the number of times the keyword exists in a document, typically adjusted for the length of the document. Inverse document frequency measures the importance of a term within a set of documents or corpus by calculating the number of documents that contain the term out of the total number of documents in the corpus. Documents are scored higher if they have a high term frequency but are scored lower if the term appears in a lot of documents within the corpus. Relevance ranking helps reviewers focus on the most important documents first improving the quality of the review and the speed at which reviewers can find the most important documents.
Concept and Context Searching
‘Concept’ and ‘context’ searching are technologies which offer users the ability to increase the efficiency and effectiveness of electronic discovery searching, organizing and review. Concept search technology may be based on neural networks, Bayesian methods, latent semantic indexing, or other high-level mathematical algorithms designed to learn the underlying associations among the words within the document collection. Most methods rely on sophisticated linguistic analysis to identify sentence structure, part of speech information and noun phrases. The result of this analysis maps the language use, word patterns, concepts and ideas of a document much like a human. This ‘black box’ facilitates concept searching and allows the reviewer to search the documents for like ideas or similar concepts without having to match an exact keyword or phrase. These tools can also be used for categorization and clustering of documents. Concept-based tools may also use customized thesauri and semantic networks although these typically require human intervention and administration to build.
Context searching allows the user to define a search through keywords or phrases and then direct the system to find ‘similar’ or ‘like’ documents. There are many systems available that provide “Find more like this” functionality – even Yahoo offers Y!Q, a context-based search tool. This tool is particular useful when a reviewer stumbles upon a key issue previously unidentified.
A concern with both technologies is the precision of the results. They are very useful in increasing the recall (number of results), but the precision (or relevancy) of the results may suffer. Tools that rank the results by their overall relevancy to the concept submitted are most useful.
Auto-Coding or Clustering
Auto-coding or clustering utilizes search technology to automatically identify ‘like’ documents and form them into groups for review purposes. The underlying technology that performs this sorting typically utilizes some type of linguistic analysis, thesauri or concept searching.
Today, two basic approaches to grouping (sometimes referred to as “clustering”) like documents exist: rules-based and example-based. In a rules-based model, the review team establishes criteria (rules) that help determine relevancy rates in the overall document collection. A rules-based approach is similar to keyword searching but often provides higher recall than keyword searching as the search engine may use proximity, word patterns, co-occurrence of key concepts and/or thesauri to determine search “hits” and relevancy.
In an example-based approach, documents programmatically describe themselves based on the concepts that are identified within each document. The system then groups documents that are contextually similar together for review. An additional benefit of most example-based approaches is that reviewers can employ discovered material or their knowledge of the matter to more narrowly group potentially relevant content. For example, when relevant documents are identified during review, an example-based system can regroup the collection of documents based on the reviewer-provided example (or examples) of the relevant document(s).
Filtering
Filtering refers to searching documents by meta-data, such as custodian, date-range, file type, sender, recipient, etc. Filtering can be useful for removing, or filtering documents that don’t match specified meta-data, and for identifying potentially relevant information. Filter values can be automatically generated after a search or manually generated as part of a search query. Automatically generated filter values can be very useful in improving filtering because they show all the possible values for a meta-data category allowing a user to choose how to filter their document as opposed to having to make error-prone guesses. For more information on filtering for data culling purposes, go to the processing node (hyperlink). Automatically generated filters can also improve review by allowing the user to quickly learn key facts about their documents, such as who are the most frequent senders or recipients of emails in this set of documents.
Near-Duplicate Detection and Review
In addition to creating clusters of similar documents, content analytics can be used to identify near-duplicates. Near-duplicates are emails or files that are not identical but only have small differences in content and/or metadata. Near-duplicates can occur frequently in corporate environments. A word processing document that has been edited by a team of people is a typical example of a near-duplicate file. This document may exist in multiple versions on different custodian hard drives and may also be attached to multiple emails. Software can be used to detect near-duplicate documents and group them together for review. It is also possible to show only the difference in text between different versions of the file. This speeds review by reducing the number of documents and amount of text users need to review. While methods used for clustering can also be used to detect similar documents, shingles or n-gram methods can provide better results. These techniques not only take into account the number of words shared amongst documents but they also take into word order.
Discussion Threading
Software that recreates discussion threads aids the review process by making it easier for reviewers to follow conversations, understand the context of emails, identify who said what when, and tag all emails in a thread at one time. There are two primary methods by which software can identify individual emails as being part of a thread: metadata-based and content-based. Metadata-based discussion threading relies on discussion identifiers that email applications will associate with individual emails or on grouping emails by their subject. In a document review environment, metadata-based discussion threading has limitations. First, grouping emails by subject will miss situations in which a sender has changed the subject line, and will also falsely thread emails that share the same subject but are not part of the same thread. Second, discussion-related metadata is frequently “lost” when an email conversation crosses different email systems. Thus, metadata-based threading may skip emails that should be part of a thread resulting in missing or incomplete threads. In these situations, discussion threads are likely to be incomplete. Content-based threading can alleviate these limitations by recreating emails that are contained within other emails and using deep content analysis to identify which emails are part of a thread even in case of missing metadata or gaps in the data. More and more complete threads can significantly increase the efficiency of review.
Visualization
Tools that allow for a visual map of the content of the dataset can be very useful in organizing key documents and prioritizing the review process. These technologies are typically based on word counts (nouns and noun-phrases) and allow the reviewer to navigate through a visual representation of these groups and their relationships. The groupings allow for bulk identification of key documents as well as isolation of non-responsive concepts.
Social Networks
Software can also be used to map social networks, or email conversations. Quite often who knew what, when they knew it and who communicated it to them are key considerations of a case. Social network or people analysis technology allows the reviewer to determine who a custodian communicates with about certain topics, trace a custodians email conversations and easily see the history and direction of their email exchanges both within and outside of the organization.
Benefits of These Technologies to the Review Process
Review Less
The first way in which these technologies can benefit the review process is by reducing the number of documents that need to be individually reviewed. This can have a dramatic impact on the cost and time of review. It can also improve the quality of review by removing clearly irrelevant information allowing reviewers to focus on more relevant documents. The two principal ways of reducing the number of documents to be reviewed are to search for potentially relevant documents and only review those documents, or to cull out irrelevant documents. These two approaches are not mutually exclusive and can be used in conjunction with each other. Go to [link to processing node sections on search and culling] for additional information on searching and culling.
Several of these technologies can be applied to searching for relevant documents. Keyword searches have been the most common way to find potentially relevant documents for review. Increasingly, practitioners are supplementing basic keyword searching with more sophisticated keyword search functionality such as wildcard, Boolean, proximity and concept search in order to both increase recall and precision.
These technologies can also be applied to culling out irrelevant documents, or grouping data that needs to be reviewed separately. For instance, automatically generated sender domain filters, or clustering can be used to identify junk or spam email that can culled in bulk from a matter. Similarly, a combination of keyword searching, filtering and social network analysis can be used to identify relevant custodians, irrelevant custodians and people engaging in potentially privileged communications. Duplicate and near-duplicate detection can also reduce the number of documents to be reviewed by removing duplicates and making it easy to perform batch analysis and coding saving reviewer time and costs. Searching for or grouping of foreign language documents can be used to set aside these documents for analysis by reviewers with foreign language expertise.
Review Faster
Review technology can also be applied to increase the number of document decisions made by reviewers within a given time period. Discussion threading technology speeds review by making it easier for a reviewer to understand the context of emails within a thread and to tag all the emails in a thread at one time. Near-duplicate detection saves time by allowing a user to review all non-culled near-duplicates once and bulk code them. Auto-coding or clustering can be used to group similar documents together allowing a reviewer to more quickly tag these documents. In some instances, auto-coding of documents is also being considered as a replacement for human review. Faster review is attractive because it can have a direct impact on the overall cost of review by reducing the number of hours expensed by reviewers, and because it can help when there are tight deadlines. Faster review also helps improve the quality of review by making it faster to learn key facts about the case.
Improved Review Quality
In addition to reducing the costs for review by reviewing less or reviewing faster, review technologies can be used to improve the quality of review. Relevance ranking allows reviewers to identify and examine the most important documents first, which can be critical to rapidly understanding the nature of a case. Social network or people analysis, automatic filters, and discussion threading can make it easier to identify key people, custodians, discussions and documents. Concept search, clustering and visualization can help reviewers identify additional words or concepts that are relevant to a case of which they might not otherwise be aware. The ultimate benefit of all these technologies is that they enable reviewers to assess the nature of the case and develop a legal strategy faster and more thoroughly than before thus improving decision-making and the eventual outcome of a case.
Questions to Vendors
Some relevant questions to pose to vendors offering these technologies are:
- Can the solution return search results fast enough at the desired scale so that it is easy to iteratively refine search queries?
- Does the system rank the search results by relevance and what criteria does it use in that ranking?
- How do conceptual search systems determine the “concepts”? Does the user participate in the creation of a thesaurus or are the concepts automatically identified by the technology?
- Does the auto-categorization tool perform a first cut categorization automatically or require reviewers to submit criterion?
- Does the system allow the review team to further tailor the categories to its review requirements through the use of rules and/or examples?
- How does the solution allow reviewers to filter their keyword search results?
- What tools and/or algorithms are used to identify email threads, duplicates or near-duplicates?
- How does the solution determine the various people involved, and how does it map variants such as multiple email aliases, multiple email addresses to the same initial?
[updated Jan. 30, 2008]

