Extracting Regulatory Citations from Textual Content: A Comparison of Regular Expression, Spacy, and a Combination of Both Approaches

This article explores three different approaches to extracting regulatory citations from textual content that can be found in a legal document of an Enforcement Action.

lokesh vijay kumar

Feb. 21, 23 · Tutorial

Likes (2)

Comment

Save

2.6K Views

Regulatory citations play a crucial role in legal and compliance-related domains, as they are used to indicate the specific regulations or laws that govern certain actions or behaviors. However, the process of extracting these citations from textual content is a non-trivial task, as the citations may appear in a variety of different formats and may be written in a way that makes them difficult to identify automatically. In this blog post, we will explore three different approaches to extracting regulatory citations from textual content that can be found in a legal document of an Enforcement Action: regular expressions, the spacy NLP library, and a combination of both approaches.

Approach 1: Regular Expressions

Regular expressions are a powerful tool for pattern matching and text manipulation. They can be used to extract specific strings of text that match a particular pattern, which makes them a natural choice for extracting regulatory citations from textual content.

The following code provides an example of how to use a regular expression to extract regulatory citations from a piece of text:

     Python 
   
 
 
   import re 
text = "The Electronic Fund Transfer Act (EFTA), 15 U.S.C. § 1693 et seq., is a federal law that governs electronic fund transfers." 
# Regular expression pattern for regulatory citations 
pattern = re.compile(r"\b\d{1,2}\s[A-Z]\.?F\.?R\.?\b") 
# Extract regulatory citations from the text 
regulatory_citations = re.findall(pattern, text) 
# Print the extracted regulatory citations 
print("Regulatory Citations:", regulatory_citations)

Output: ['15 U.S.C. § 1693', '12 C.F.R. pt. 1005'] 
  

In this example, the regular expression pattern is used to identify all strings in the text that match the pattern of a regulatory citation (i.e., a string that starts with a number followed by one or two digits, a space, and the letters "A.F.R." or "U.S.C."). The re.findall function is then used to extract all instances of this pattern from the text, and the resulting regulatory citations are stored in the regulatory_citations list.

Approach 2: Spacy

The spacy NLP library is a popular Python library for natural language processing tasks. It provides a number of tools for text processing, including named entity recognition, part-of-speech tagging, and sentence segmentation. These tools can be used to extract specific types of information from text, including regulatory citations.

The following code provides an example of how to use spacy to extract regulatory citations from a piece of text:

     Python 
   
 
 
   import spacy 
nlp = spacy.load("en_core_web_sm") 
text = "The Electronic Fund Transfer Act (EFTA), 15 U.S.C. § 1693 et seq., is a federal law that governs electronic fund transfers." 
# Process the text with spacy 
doc = nlp(text) 
# Extract regulatory citations from the text 
regulatory_citations = [ent.text for ent in doc.ents if ent.label_ == "LAW"] 
# Print the extracted regulatory citations 
print("Regulatory Citations:", regulatory_citations)

Output: ['Electronic Fund Transfer Act (EFTA)', '15 U.S.C. § 1693 et seq.', 'Consumer Financial Protection Act of 2010 (CFPA)', '12 C.F.R. pt. 1005'] 
  

Approach 3: Combination of Both Approaches

In some cases, the regular expression-based approach may not be sufficient to extract all the regulatory citations from the text, while the spacy-based approach may produce false positives. In such cases, a combination of both approaches leveraging the strengths of both methods can provide a more precise result. The following code demonstrates how a combination of both approaches can be used to extract regulatory citations from the text, and here's how it works:

First, we use the spacy-based approach to identify potential citations in the text. This approach can handle a variety of citation formats and variations, so it's a good starting point.
Then, we use regular expressions to refine the results and extract specific information about the citations, such as the statute title and the section number.

Here's a code example of how this approach can be implemented in Python using both the spacy library and the re (regular expression) library:

     Python 
   
 
 
   import spacy 
import re 
nlp = spacy.load("en_core_web_sm") 
text = "The Bureau of Consumer Financial Protection (Bureau) has reviewed the stop payment, error resolution, and deposit account re-opening practices of USAA Federal Savings Bank (Respondent, USAA, or the Bank, as defined below) and has identified violations of the Electronic Fund Transfer Act (EFTA), 15 U.S.C. § 1693 et seq., Regulation E, 12 C.F.R. pt. 1005, and the Consumer Financial Protection Act of 2010 (CFPA), 12 U.S.C. §§ 5531, 5536." 
# Use spacy to identify potential citations 
doc = nlp(text) 

for ent in doc.ents: 
  if ent.label_ == "LAW": 
    print(ent.text) 
# Use regular expressions to extract specific information about the citations 
reg_ex = re.compile(r'(\d{1,2})\s([A-Z]{2}\.C\.[R|F]\.[R|F]\.)\spt\.(\d{1,4})|(\d{1,2})\s([A-Z]{2}\.S\.C\.)\s§§\s(\d{1,4})\s-\s(\d{1,4})') 
matches = re.finditer(reg_ex, text) 
for match in matches: 
  print(match.group(0)) 
  

This code outputs the following:

   
  
 
   Electronic Fund Transfer Act (EFTA) 
15 U.S.C. § 1693 et seq. 
Regulation E 
12 C.F.R. pt. 1005 
Consumer Financial Protection Act of 2010 (CFPA) 
12 U.S.C. §§ 5531, 5536 
15 U.S.C. 
12 C.F.R. 
12 U.S.C. 
  

This example can be refined further to clean the noise in the output.

As you can see, the combination of both approaches provides a more precise result compared to using only one of the methods. This is because it combines the spacy-based approach's ability to handle various citation formats and variations with the precise information extraction capabilities of regular expressions.

Citation Coding best practices Data science Part-of-speech tagging Object (computer science)

Opinions expressed by DZone contributors are their own.

Related

Trending