Glossaries | EDRM Glossary | EDRM Search Glossary | EDRM Search Glossary index | Submit a Definition
The EDRM Search Glossary is a list of terms related to searching .
Single logical query or the progression of single logical queries performed interactively in an effort to accumulate intelligence.
Bayesian classifier is a process of identifying concepts using a certain representative documents in a particular category. The classifier has the ability to discern other responsive documents in the larger collection and place them in a category. Typically, a category is represented by a collection of words and their frequency of occurrence within the document. The probability that a document belongs to a category is based on the product of each word of the document appearing in that category across all documents. Thus, the learning classifier is able to apply words present in a sample category and apply that knowledge to other new documents. In the e-discovery context, a Bayesian classifier can quickly place documents into confidential, privileged, responsive documents and other well-known categories.
Objective information, often manually recorded from documents such as the document date, the authors or recipients of the documents, or the title of a document. Bibliographic coding usually takes place against documents originating as paper with no electronically stored information.
A search technique that utilizes Boolean Logic to connect individual keywords or phrases within a single query such as AND, OR, and NOT, within (w/5) , and NOT withinN (not w/5).
Specifying that the search must be case sensitive will match the exact case for all letters in the keyword and in the documents. For example, a case-sensitive search on Rose will match the name “Rose Jones” but it will not match the phrase “rose garden”.
Electronic data is represented as sequences of bits, or numbers. Each alphabet or script used in a language is mapped to a unique numeric value. This is referred to as character encoding. See also Unicode
To arrange or designate according to categorization such as potentially responsive or privileged versus non-responsive or not-privileged.
Searching for the purposes of identification of specified relevant information in response to a discovery request. A compliance search should be paired with a methodology search as Ad-Hoc or Iterative searching.
A search technique that provides words which are similar in concept to a query word. A concept search will return documents that relate to the same concept as the query word, regardless of whether the query word exists in the search results documents. Concept searches can be implemented as a simple thesaurus match, or by using sophisticated statistical analysis methods. Effectiveness of concept search in an e-discovery project depends greatly on the type of algorithm used and its implementation.
Coverage Bias can occur if the samples are not representative of the population due to the methodology used. In e-discovery, such coverage bias occurs when large portions of ESI get excluded from based on meta-data or type of ESI. As an example, Patent Litigation may require sampling technical documents in their source form, and care should be taken to include these documents in the sample selection process.
Custodian search is a common form of constraining search results. To search based on a custodian, the metadata search using the metadata name “Custodian” can be used. Custodian search may rely on assigning custodians to collected data during the Identification Phase so that searching doesn’t miss out on custodians. For example, instant messages with buddy-names may be missed if the search term is specified as last-name/first-name or as email addresses.
Date range search utilizes a document’s metadata to find search results where the creation dates, access dates, or modification dates of documents fall within a specified range of dates. Refer to specific technology utilized to process ESI to determine the available dates based on file types and consider the handling of time zones during ESI processing.
A diacritic specification is a phonetic marker added to letter (above or below) indicating a change in the way it is to be pronounced or stressed. For languages that include diacritic characters on certain characters (such as vowels), specifying whether the diacritics should match is a search option.
Documents may be split into multiple segments (such as Abstract, Body, Title, References, Citation, etc.). The Boolean operators may be limited to a specific document segment. In these situations, you may need to specify the search scope of the document.
Another form of validation utilized to ensure that Responsive items are not being inadvertently omitted through changes to the search criteria. As the search criteria set is being updated and modified during the initial investigation and analysis, a comparison would sample documents that were originally results of one search criteria set but are no longer results of the modified search criteria set. If Responsive documents are found upon review of dropped items, special attention should be paid to determine whether additional terms need to be created to capture these items or if modifications made to the criteria should be changed so these or similar items would be included in the results.
To search a keyword which contains a wildcard character such as a question mark, an escaping mechanism is needed to search. Availability of multi-character wildcards may be limited in some systems. Some search engines require a certain number of leading characters and do not support search terms that start with a wildcard.
Electronically Stored Information or ESI is information that is stored electronically on enumerable types of media regardless of the original format in which it was created.
While the evaluation order should be immaterial, some search engines produce different results if the order is specified differently. In other implementations, the performance of search is impacted by the order of specification.
Ad hoc or single logical query, likely to be employed in knowledge management effort on the left side, or as ad hoc search as part of case assessment, review or post-review witness prep.
Fielded searches are based on values stored as metadata rather than actual content of an electronic asset. Searches can be refined using metadata information extracted during processing, such as sender or receiver, creation date, modified date, author, file type and title, as well as subjective user-defined values that may be ascribed to a document as part of downstream review. See also Parametric Search
Formal search includes executing, tracking. reporting and measure impact, and iterate through sets of multiple logical queries. See also, Iterative Search
Utilized iteratively throughout the life cycle of a project as search criteria are modified, frequency analysis may be used to evaluate the effectiveness of the initial search criteria. The search terms are tested to determine whether they effectively discriminate between potentially relevant and clearly non-relevant data. Frequency analysis is a reality check on the search results versus the overall collection size and the reasonably expected proportion of relevant results. It does not address the recall or completeness of relevant items out of the collection.
Fuzzy search allows searching for word variations such as in the case of misspellings. Typically, such searching includes some form of distance and score computations between the specified word and the words in the corpus.
When a search identifies a document, the search operation has scored a hit. Search results at the hit level therefore capture the keyword, and its potential hit position within the document. A search query based on a search term, it is possible to have that term be found multiple times within the same document or for a search query that contains multiple keywords to have one or more search hits for some number of terms.
An index that maps a keyword to the list of documents that contain the keyword.
Formal search that includes executing, tracking. reporting and measure impact, and iterate through sets of multiple logical queries. See also, Formal Search
Indexing is a process that inventories the total content of a file and builds a searchable electronic index. This index typically maps from a keyword to all the documents that contain the keyword. Search indexes serve to function as tools designed to facilitate and expedite the retrieval of information. Search engines will use both common and proprietary technology to build indexes and service search queries.
Keyword occurrences are the counts of keywords that appear within the entire search results. When a search query involves multiple keywords or when one or more of the queries produces stemming, wildcard or fuzzy-based variations, a complete count of total occurrences for each keyword is useful for evaluating the value of searching using certain keywords. In some instances, the keyword counts both at an aggregate level (totaled over all the variations) as well as counts based on an individual variation level would each be helpful.
A common search technique that uses query words (“keywords”) and looks for them in ESI, using an index. A keyword search is a basic search technique that involves searching for one or more words within a collection of documents and returns only those documents that contain the search terms entered. The documents returned by the search engine are called the search results. Keywords often form a basic building block for constructing other more complex compound searches. Such compound searches use other search elements such as Boolean logic.
Latent semantic indexing (sometimes also referred to as Latent Semantic Analysis) is a technology that analyzes co-occurrence of keyword terms in the document collection.
In textual documents, keywords exhibit polysemy as well as synonymy. Latent Semantic Indexing refers to the additional factor that certain keywords are related to the concept in that they appear together. These relationships can be “is-a” relationship such as “motorcycle is a vehicle” or a containment relationship such as “wheels of a motorcycle”.
Support Vector Machines, Probabilistic Latent Semantic Analysis, Latent Dirichlet Allocation, and others.
Reach comfort level that reasonable steps were taken to find document(s), allowing for reasonable determination that document does not exist.
Measurement Bias occurs when the act of sampling causes the measurement to be impacted. In e-discovery, measurement bias could occur if the content of the sample is known before the sampling is done. For example, if one were to sample for responsive documents and during the sampling stage, content is reviewed, there is potential for higher-level litigation strategy to impact the responsive documents. If a project manager has communicated the cost of reviewing responsive documents, and it is understood that responsive documents should somehow be as small as possible, that could impact your sample selection. To overcome this, the person implementing the sample selection should not be provided access to the content.
Metadata search allows searching to be constrained based on certain metadata elements of a document. A general search specification allows for naming the metadata fields, specifying the inherent type of that metadata, and the value to search for.
To avoid creating an overly inclusive index, most indices utilize a noise word filter. Noise word filters includes a customized list of terms that are overlooked or ignored during indexing. Some common noise words include ‘a’, ‘and’, ‘the’, ‘from’, and ‘because’.
Non-Response Bias occurs when a portion of potential samples is not available for sampling. As an example, if an e-discovery effort is identifying potential responsive engineering documents, and if the documents are in a document format and/or programming language that could not be sampled or understood, there could be a significant non-response Bias. See also, Response Bias
Occurrence count search allows a legal professional to specify Occurrence count search allows a legal professional to specify that a word appear a certain number of times for the document to be selected.
Parameterized search allows searching to be based not on keywords but on certain parameters, such as a document’s metadata. Parameterized search is also known as fielded search, because it is frequently performed on data stored within the fields of a database table. Examples include Date Range, Metadata, Custodian, restrictions or promotions based on document tags/review calls.
A single word or expression having multiple meanings.
Precision measures the number of truly responsive documents in the retrieved set of responsive documents. See also, Recall
A list of a set of documents that a Producing Party did not produce on account of Privilege such as Attorney-Client Privilege.
A set of documents that a Producing Party is not required to provide, since they fall into Privilege such as Attorney-Client Privilege. The existence of such documents should be recorded in the Privilege Log.
A party that owns the complete collection of ESI, and is responsible for producing a portion of the ESI that is deemed to be relevant for a legal case or legal enquiry.
A Proximity Search searches for multiple keywords. The matching documents must contain all the keywords, with the keywords occurring within a specified number of words from each other.
Process of validation during post selection of data; throughout review, pre-production to identify inconsistencies in document productions, to test for conflicting review calls.
Relational Database Management System. This is a technical term for the class of software programs that manage data using a relational schema, such as Microsoft SQL Server or Oracle.
Recall measures the number of responsive documents retrieved compared to the total number of responsive documents in the corpus. Recall cannot be absolute unless all documents have been searched and all have been reviewed. Since Recall measures the ratio of responsive documents against the full corpus, the number of responsive documents in the corpus is difficult to determine. See the EDRM Search Guide regarding precision, recall, and sampling for more information. See also, Precision
A pattern that describes what the search should return based on special characters added to the keyword. For example, car* uses the character * as a wildcard, and the resulting documents should contain words that begin with the characters “car”, such as car, cartoon, or cartography.
Related words search allows a legal professional to specify a word and other words that are deemed to be related to it. Typically, such related words are determined as either part of concept search or by statistical co-occurrence with other words.
A measurement of relevancy of a document, so that the Search Hits within a Search Results can be ordered. Relevancy measurements often involve counting the number of occurrences of a keyword within a document, as well as number of documents a keyword is found in.
A subset of ESI that potentially matches the desired set of documents for the case.
A party that does not own the ESI and is requesting that the Producing Party which owns the ESI to provide some subset of the ESI based on a Search Request.
One type of Response Bias can occur if the sampling process considers the content of the documents. See also, Non-Response Bias
A subset of ESI that matches potentially the desired set of documents for the case.
Review feedback validation involves cross referencing the results of search with the calls made by attorneys during document review. The document level classification as relevant or privileged provides keen insight into refining the search and selection criteria or in identifying gaps that require additional analysis. This feedback will be used for additional analysis and to refine the Search Criteria sets. The feedback may identify categories of documents that are not yielding responsive documents and or could identify documents to be excluded from the review set. Also, the feedback may identify new categories of documents that should be included and the criteria will be broadened to include those documents in the review set.
Sampling is a method of reviewing statistical ratios of complete or portions of a classified corpus for the purposes of validation.
A search component that implements the actual process of interpreting a search request and identifying subsets of documents. For example, a database management system such as Microsoft SQL Server contains a component that manages searches of the data stored in its databases.
A document in ESI that is considered to match the requested Search Query.
A well-formulated Search request that an automated search engine can interpret in order to produce matching results.
A collection of Search Hits that match the intended documents of a Search Request.
A type of relational database management system (RDBMS). Relationships in a relational database are represented by linkages that exist between two or more pieces of data. The final defining feature of SQL is its ability to return data from one data field based on its relationship with another data field. See also Relational Database Management Systems
A search option that returns matches for all variations of the root word of the initial query word. For example, if the query word was sing, then if a search used stemming the search results would match singing, sang, sung, song, and songs as well as sing.
A synonym search returns documents that contain terms similar in meaning to the query words, usually using a thesaurus to determine which terms would match the query words.
Having the equivalence of meaning; having the same definition without having the same expression.
Text clustering is a technology that analyzes a document collection and organizes the documents into groups based on finding documents that are similar to each other based on words contained within it (such as noun phrases). Text clustering establishes a notion of “distance between documents” and attempts to select enough documents into the cluster so as to minimize the overall pair-wise distance among all pairs of documents.
An operation that examines a document or block of text and breaks the text into words. Typically, a space is used to separate words, but special characters such as a hyphen, period, or quotation mark can also be used.
A Search Specification that indicates that matching documents must contain words that begin with the letters entered, but that the matching words can end with any combination of letters.
All electronic data is represented as sequences of bits, or numbers. Each alphabet or script used in a language is mapped to a unique numeric value, or ‘encoded’ for use on a computer using a standard known as Unicode. Within Unicode, each letter or character has been assigned its own unique value in the Unicode encoding schemes, known as the Unicode Transformation Format (UTF). The UTF utilizes multiple encoding schemes, of which the most commonly used are known as UTF-8 and UTF-16. For example, the English alphabet and the more common punctuation marks have been assigned values between 0 and 255, while Tibetan characters have been assigned the values between 3,840 (written as x0F00) and 4,095 (written as x0FFF). All modern (and many historical) scripts are supported by the Unicode Standard. Unicode provides a unique number for every character, regardless of the platform, program, or language. The Unicode Standard is described in detail at the website http://www.unicode.org
. See also, Character Encoding
Validation methodologies involve the case team in reviewing samples of documents to determine litigation relevance to classify documents as Responsive or Not Responsive to the issues of the case and therefore increasing the precision of the search results. Results of a keyword or iterative search may be validated by observing the frequency of hits, validating dropped items, sampling non-hits, and review call feedback analysis.
Symbols such as * or ? included within a Keyword to indicate that the location where the symbols are used may match a single letter or multiple letters.
- Electronically Stored Information or ESI is information that is stored electronically on enumerable types of media regardless of the original format in which it was created.
- Electronically Stored Information: this is an all inclusive term referring to conventional electronic documents (e.g. spreadsheets and word processing documents) and in addition the contents of databases, mobile phone messages, digital recordings (e.g. of voicemail) and transcripts of instant messages. All of this material needs to be considered for disclosure.