Part One: Underreporting Pitfalls
Many ESI agreements ask to have search terms specified before putting TAR in place. Search term searches alone have proven to either find less of what you need (underinclusive) or a lot of things you don’t want or need (overinclusive). This is a problem. So what do you do? Here are a few suggestions:
- Do not agree to the use of search terms for culling that precedes TAR. TAR in this instance will only be employed to further narrow the small, arbitrarily defined universe of documents. It is not in the best interest of the case for a requesting party to limit the universe of potential relevance by agreeing to a method that will inherently do so.
- An alternative proposal that includes search terms and TAR is to bulk search the terms so the results make up one combined set of documents. From these results use TAR to build a seed set and model that can be applied to the full production prior to culling collection. The results from this model, after a couple of runs, should be a reasonable set of relevant documents.
- Another option is to ignore the search terms and use either TAR 1 – the larger passive random sample – or TAR 2 – continuous active learning – to develop a model of relevance. TAR 2 is faster, provides more bites at the apple and requires less work. TAR 1’s large random sample when combined with TAR 2 is favored to build classifiers that will be in constant reuse. Deep learning, the newer version of machine learning, employs very large seed sets and provides extraordinary results. In either of the TAR alternatives, the parties should agree on the accuracy and recall scores.
- If you are stuck and can’t get an agreement to use TAR instead of search terms, insist on a validation of the search term results through a random document sample. Then, look at the documents not deemed relevant by the search term results to see the rate of relevance that occurred. This will let you know how accurate that search will be.
- Culling may be useful, but it should be limited to the elimination of program files and other technical operating system documentation that do not affect content. Date range culling may not be prejudicial but may be subject to the limitations of bad OCR. Culling of complete email domains may be very risky as custodians may use aliases and other servers. Culling to reduce the size of a production for the sake of size reduction may have once been important, but today with low storage costs and TAR methods there is no justification for cutting down the size of a collection without a very good evidentiary reason.
So, if you can get TAR into the ESI agreement before you sign, the chances are good that the documents you end up with will be relevant to what you need and not more than you need, saving both time and money.
For more information on TAR for culling collections, read part two of this blog post series: “Why TAR Is Important to Narrowing Your Search.”