Extracting Regulatory Citations from Textual Content: A Comparison of Regular Expression, Spacy, and a Combination of Both Approaches
This article explores three different approaches to extracting regulatory citations from textual content that can be found in a legal document of an Enforcement Action.
Join the DZone community and get the full member experience.
Join For FreeRegulatory citations play a crucial role in legal and compliance-related domains, as they are used to indicate the specific regulations or laws that govern certain actions or behaviors. However, the process of extracting these citations from textual content is a non-trivial task, as the citations may appear in a variety of different formats and may be written in a way that makes them difficult to identify automatically. In this blog post, we will explore three different approaches to extracting regulatory citations from textual content that can be found in a legal document of an Enforcement Action: regular expressions, the spacy NLP library, and a combination of both approaches.
Approach 1: Regular Expressions
Regular expressions are a powerful tool for pattern matching and text manipulation. They can be used to extract specific strings of text that match a particular pattern, which makes them a natural choice for extracting regulatory citations from textual content.
The following code provides an example of how to use a regular expression to extract regulatory citations from a piece of text:
import re
text = "The Electronic Fund Transfer Act (EFTA), 15 U.S.C. § 1693 et seq., is a federal law that governs electronic fund transfers."
# Regular expression pattern for regulatory citations
pattern = re.compile(r"\b\d{1,2}\s[A-Z]\.?F\.?R\.?\b")
# Extract regulatory citations from the text
regulatory_citations = re.findall(pattern, text)
# Print the extracted regulatory citations
print("Regulatory Citations:", regulatory_citations)
Output: ['15 U.S.C. § 1693', '12 C.F.R. pt. 1005']
In this example, the regular expression pattern
is used to identify all strings in the text that match the pattern of a regulatory citation (i.e., a string that starts with a number followed by one or two digits, a space, and the letters "A.F.R." or "U.S.C."). The re.findall
function is then used to extract all instances of this pattern from the text, and the resulting regulatory citations are stored in the regulatory_citations
list.
Approach 2: Spacy
The spacy NLP library is a popular Python library for natural language processing tasks. It provides a number of tools for text processing, including named entity recognition, part-of-speech tagging, and sentence segmentation. These tools can be used to extract specific types of information from text, including regulatory citations.
The following code provides an example of how to use spacy to extract regulatory citations from a piece of text:
import spacy
nlp = spacy.load("en_core_web_sm")
text = "The Electronic Fund Transfer Act (EFTA), 15 U.S.C. § 1693 et seq., is a federal law that governs electronic fund transfers."
# Process the text with spacy
doc = nlp(text)
# Extract regulatory citations from the text
regulatory_citations = [ent.text for ent in doc.ents if ent.label_ == "LAW"]
# Print the extracted regulatory citations
print("Regulatory Citations:", regulatory_citations)
Output: ['Electronic Fund Transfer Act (EFTA)', '15 U.S.C. § 1693 et seq.', 'Consumer Financial Protection Act of 2010 (CFPA)', '12 C.F.R. pt. 1005']
Approach 3: Combination of Both Approaches
In some cases, the regular expression-based approach may not be sufficient to extract all the regulatory citations from the text, while the spacy-based approach may produce false positives. In such cases, a combination of both approaches leveraging the strengths of both methods can provide a more precise result. The following code demonstrates how a combination of both approaches can be used to extract regulatory citations from the text, and here's how it works:
- First, we use the spacy-based approach to identify potential citations in the text. This approach can handle a variety of citation formats and variations, so it's a good starting point.
- Then, we use regular expressions to refine the results and extract specific information about the citations, such as the statute title and the section number.
Here's a code example of how this approach can be implemented in Python using both the spacy library and the re (regular expression) library:
import spacy
import re
nlp = spacy.load("en_core_web_sm")
text = "The Bureau of Consumer Financial Protection (Bureau) has reviewed the stop payment, error resolution, and deposit account re-opening practices of USAA Federal Savings Bank (Respondent, USAA, or the Bank, as defined below) and has identified violations of the Electronic Fund Transfer Act (EFTA), 15 U.S.C. § 1693 et seq., Regulation E, 12 C.F.R. pt. 1005, and the Consumer Financial Protection Act of 2010 (CFPA), 12 U.S.C. §§ 5531, 5536."
# Use spacy to identify potential citations
doc = nlp(text)
for ent in doc.ents:
if ent.label_ == "LAW":
print(ent.text)
# Use regular expressions to extract specific information about the citations
reg_ex = re.compile(r'(\d{1,2})\s([A-Z]{2}\.C\.[R|F]\.[R|F]\.)\spt\.(\d{1,4})|(\d{1,2})\s([A-Z]{2}\.S\.C\.)\s§§\s(\d{1,4})\s-\s(\d{1,4})')
matches = re.finditer(reg_ex, text)
for match in matches:
print(match.group(0))
This code outputs the following:
Electronic Fund Transfer Act (EFTA)
15 U.S.C. § 1693 et seq.
Regulation E
12 C.F.R. pt. 1005
Consumer Financial Protection Act of 2010 (CFPA)
12 U.S.C. §§ 5531, 5536
15 U.S.C.
12 C.F.R.
12 U.S.C.
This example can be refined further to clean the noise in the output.
As you can see, the combination of both approaches provides a more precise result compared to using only one of the methods. This is because it combines the spacy-based approach's ability to handle various citation formats and variations with the precise information extraction capabilities of regular expressions.
Opinions expressed by DZone contributors are their own.
Comments