Member-only story
Measuring Distances in Legal Texts — A Work-In-Progress, Part 2
Why my first attempt at measuring distances between legal texts with cosine similarity is way off the mark. How I was wrong and what I’m thinking about next in natural language processing (“NLP”).
I took a super simple approach to a very complex problem and thought I was quite clever — oh how wrong I was.
Background
In a previous post on math and law in Python, I explored a preliminary finding for a method to measure the distance between legal texts. For the long story, go here. For the short story, the project boils down to whether an algorithm can pick up the difference in meaning between “may” and “shall not.”
In these experiments, I have a known baseline — D.C. has a rule that impacts its lawyers in a way that impacts no other state. But getting an algorithm to detect that is tricky. In terms of text, the difference of a few words and letters is almost insignificant. However, from a legal context, the permission to do something with “may” versus the prohibition against it with “shall not” is substantial.
What I Got Wrong
I created a first mini-experiment to see how the same algorithm works if I change “shall not” to “may” and determined the method I used returned the right answer for the…