Talk:Analysis Node
From Working EDRM
Ssong 14:02, 3 March 2006 (PST)Sandra Song and Stannie Holt on behalf of H5, Inc.
Analysis – Techniques and Tools (Section 1: Search) General comment. The major problem with the existing discussion of collection analysis is its focus on ad hoc queries as the preferred approach to analyzing a document collection. Such approaches fail to provide the user with a stable partitioning of the collection into well-defined categories that can be used throughout the litigation process (deposition prep, motion practice, trial prep, etc.). In specifying industry standards for analytical techniques and tools, the focus should be on approaches that (a) ensure a close correspondence between the results of analysis and the requirements of the litigation team, (b) ensure that the analysis of collection meets acceptable performance standards with regard to both precision and recall, and (c) provide the litigation team with a stable partitioning of the collection that is robust enough to serve the full range of needs the litigation team will have throughout the litigation process.
1.3 Relevance Ranking The primary use case for relevance ranking is the situation in which an ad hoc query returns a very large set of documents, many of which are not relevant. While such result sets are the norm for the sorts of queries which the EDRM puts at the center of its approach to collection analysis, they are not the norm for approaches to collection analysis that give due attention to understanding the requirements of the litigation team and to measuring performance on the retrieval of target documents. When requirements are well-defined and performance well-tested, the problem that relevance ranking addresses (large amounts of irrelevant material in a result set) diminishes considerably – and therefore so does the value of relevance ranking.
1.5 Sorting Random selection functionality. An effective document management and analysis tool should have a random selection functionality, i.e., provide the user with the ability to make a random selection of a user-specified number of documents from either the full population or from the result sets of particular queries. Random selection functionality will enable users to collect data needed for statistically valid testing of query performance.
1.8 Benchmarking [a new section we propose adding] To ensure that analysis is on target, search performance needs to be measured and benchmarked in both precision and recall dimensions, using open standards such as NIST protocols. Ideally, the search technique should provide its own statistically valid benchmarking during the document review process, so that those supervising the review process can get feedback in time to adapt their queries to evolving case issues.
Analysis – Pitfalls to Avoid (This section is not numbered, but the following heading is 3rd on the bulleted list of sub-sections)
Incorrect Understanding of Discovery Tools Operators also need to understand that search tools can be highly effective in tasks such as reducing extremely large document sets to more manageable sizes, or classifying results by desired dimensions (such as topic, date, sender, or frequency that search strings appear). However, they are limited in their capacity to make sound judgments of relevance. The documented performance limitations of these technologies, and the context and constraints within which they operate, are often non-technological and must be considered independently of the search technology used.
First, even expert users reach only moderate levels of accuracy with currently available search tools. This is in large part because keywords and “concepts” are generally poor indicators of relevance. Numerous academic studies show that, at practicable levels of usage, even the most advanced search tools, used by expert searchers on a relatively small collection, miss 50% or more of the documents that human review would have assessed to be relevant.
Second, search software can’t tell reviewers whether or not they clearly understand what the lead litigator is looking for, notify them when they are inconsistent in their assessment of documents, or apprise attorneys of the actual performance of the overall review.
As a result, too heavy a reliance on search software may only increase the yield of documents that ultimately are not relevant. This opens the door to higher costs, for analyzing irrelevant documents; exposure to risk, from failure to retain, find, or produce needed documents; and delays in time-to-completion.

