Improving the Capabilities of LLM-Based Analytics Copilots With Semantic Search and Fine-Tuning
Learn how LLMs can be deployed to make critical decisions like domain-specific question-answering, SQL generation needed for data retrieval, and more.
Join the DZone community and get the full member experience.
Join For FreePicture this: You're an analyst drowning in a sea of data, trying to make sense of complex attribution models and customer journeys. Wouldn't it be great if you had a super-smart AI assistant that could instantly answer your questions, generate SQL queries on the fly, and break down complex tabular data? Well, that's exactly what we're working on with Large Language Model (LLM)- based analytics copilots. But as with any cutting-edge tech, it's not all smooth sailing. Let's dive into the challenges we faced and the cool solutions we came up with to make these AI assistants truly shine.
The LLM Conundrum: Brilliant, but Flawed
First things first: let's talk about why we're so excited about using LLMs in analytics. These language models are like the Swiss Army knives of the AI world – they can tackle a wide range of tasks, from answering questions to generating code. For us analysts, that means:
- Less time spent digging through dashboards and reports
- More flexible insights that go beyond static visualizations
- Quicker problem-solving and decision-making
Sounds great, right? But here's the catch: LLMs aren't perfect. They've got some quirks that can make them a bit tricky to work with:
- They've got memory limits (imagine trying to read "War and Peace," but forgetting the beginning by the time you reach the end).
- Sometimes they confidently spout nonsense (we call this "hallucination" – it's less fun than it sounds).
- They're not great with numbers (which is kind of important in analytics).
- It can be hard to understand why they give certain outputs.
- They can be biased (just like us humans, unfortunately).
So, we set out on a mission to overcome these challenges and create analytics copilots that are actually useful in the real world. Our secret weapons? Semantic search and fine-tuning. Let's break it down.
Semantic Search: Teaching Our AI To Find the Right Context
Imagine you're at a huge library, trying to find the answer to a specific question. You could read every book, or you could ask a librarian who knows exactly where to look. Semantic search is like giving our LLM its own super-librarian.
Here's how we did it:
- We built a knowledge base by scraping relevant websites and documents.
- We chopped up this info into bite-sized chunks.
- We used fancy math (okay, it's called "embedding") to turn these chunks into numbers that represent their meaning.
- We stored all this in a special database that can quickly find similar chunks.
When someone asks a question, we use the same embedding magic on their query, find the most relevant chunks in our database, and feed that context to the LLM. It's like giving the AI a cheat sheet before it answers the question.
Here's a simplified Python code snippet to give you an idea of how this works:
```python
from sentence_transformers import SentenceTransformer
from faiss import IndexFlatL2
import numpy as np
# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Create a FAISS index
index = IndexFlatL2(384) # 384 is the embedding dimension for this model
# Embed and index our document chunks
for chunk in document_chunks:
embedding = model.encode(chunk)
index.add(np.array([embedding]))
# When we get a query, embed it and find similar chunks
query_embedding = model.encode(user_query)
distances, indices = index.search(np.array([query_embedding]), k=5)
# Use the top 5 most relevant chunks as context for the LLM
relevant_context = [document_chunks[i] for i in indices[0]]
```
We tested this setup with different LLMs (GPT-4, Falcon-40B, and Llama-2-70b) and different embedding models. The results were pretty exciting:
- GPT-4 with semantic search was the top performer.
- Llama-2-70b was nipping at its heels (and it's open-source, which is cool).
- Some open-source embedding models held their own against the fancy proprietary ones.
Fine-Tuning: Teaching Old LLMs New Tricks
While semantic search helped with question-answering, we still had two big problems to solve: generating SQL queries and analyzing tabular data. This is where fine-tuning came to the rescue.
Fine-tuning is like sending your LLM to a specialized training camp. We take a pre-trained model and give it additional training on specific tasks. It's like teaching a chess champion how to play poker – they already understand game strategy, but now they're learning the specific rules and tactics of a new game.
SQL Query Generation: From Natural Language to Database Speak
For SQL generation, we used a dataset called b-mc2/sql-create-context from Hugging Face. It's got a bunch of examples that pair natural language questions with SQL queries. Here's what a typical example looks like:
Question: How many heads of the departments are older than 56?
Context: CREATE TABLE head (age INTEGER)
Answer: SELECT COUNT(*) FROM head WHERE age > 56
We fine-tuned our models on thousands of examples like this. The results were mind-blowing:
- GPT-4 (which we couldn't fine-tune) got about 64.5% accuracy with few-shot learning
- Our fine-tuned open-source models jumped from less than 30% accuracy to over 80%!
- The big Llama-2-70b model showed tons of potential even without fine-tuning
Here's a simplified look at how we did the fine-tuning:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = datasets.map(tokenize_function, batched=True)
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=8,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
)
trainer.train()
```
Tabular Data Analysis: Making Sense of the Numbers
For tabular analysis, we created a dataset specifically for attribution. We wanted our models to explain changes in attribution credit for different touchpoints. Here's an example of what our data looked like:
```
model_name: lead
channel: display
absolute_change: -82
targeting_quality: 63
contact_frequency: -4
ad_cannibalization: -33
```
We then fine-tuned our models to generate explanations like this:
"The display channel for the lead model has seen a significant decrease in attribution credit (-82%). This is primarily due to improved targeting quality (63%), which suggests that while fewer impressions are needed, they are more effective. However, this positive effect is partially offset by increased ad cannibalization (-33%), indicating some overlap in the audience reached by different channels. The contact frequency had a minimal impact (-4%)."
The results were fascinating:
- GPT-4 with basic prompt engineering hit about 70% accuracy.
- A special setup called "pandas agent" with GPT-4 nailed 100% accuracy.
- Our fine-tuned open-source models went from less than 10% accuracy to over 80-90%!
Here's a peek at how we used the pandas agent:
```python
from langchain.agents import create_pandas_dataframe_agent
from langchain.llms import OpenAI
import pandas as pd
# Load your data into a pandas DataFrame
df = pd.read_csv("attribution_data.csv")
# Create the pandas DataFrame agent
agent = create_pandas_dataframe_agent(OpenAI(temperature=0), df, verbose=True)
# Ask the agent to analyze the data
response = agent.run("Explain the changes in attribution credit for the display channel in the lead model, considering targeting quality, contact frequency, and ad cannibalization.")
print(response)
```
Conclusion
By combining semantic search and fine-tuning, we've managed to supercharge our analytics copilots, making them more accurate, reliable, and useful. The journey wasn't easy, but the results speak for themselves. With these advanced techniques, we’re paving the way for smarter, more efficient analytics tools.
More details can be found in the original paper by my team.
Opinions expressed by DZone contributors are their own.
Comments