OpenAI Evals Using Phoenix
OpenAI Evals are used for evaluating LLM models and finding accuracy, which helps to compare the custom models and figure out how best your custom model performs.
Join the DZone community and get the full member experience.
Join For FreeOpenAI Evals are used for evaluating LLM models and finding accuracy, which helps to compare the custom models with some existing models and figure out how best your custom model performs, so you can make necessary modifications/refinements.
If you are new to OpenAI Evals, then I recommend you to go through the OpenAI Eval repo to get a taste of what Evals are actually like. It's like what Greg said here:
Role of Evals
From OpenAI Eval repo: Evals provide a framework for evaluating large language models (LLMs) or systems built using LLMs.
Can you imagine writing an evaluation program for a complex model by yourself? You may spend hours creating an LLM model and you may not have room to work on an evaluation program since that effort may take more effort than creating a model. That is where the Eval framework helps you to test the LLM models to ensure their accuracy. You can use GPT3.x or GPT4 based on the need and where your LLM Model's target is.
Building Evals
OpenAI Eval repo has a good intro and detailed steps to create a custom eval for an arithmetic model:
The links above pretty much cover what you require for running an eval or custom evals, you are covered there. My take here is how I can enable/help you to use the Phoenix framework seems a bit easier than the OpenAI evals. Phoenix actually created on top of the OpenAI Eval framework.
- Phoenix Home
- Repo
- LLM evals documentation
- How to
- This LLM explanation will help you better understand.
Building Custom Evals
Building your own evals is one of your go-to compare your custom model with GPT 3.5 or 4, so here are the steps for that.
Below are the steps that I followed and tested to evaluate my models:
Install Phoenix and related modules:
!pip install -qq "arize-phoenix-evals" "openai>=1" ipython matplotlib pycm scikit-learn tiktoken
Make sure you have all imports covered.
import os
from getpass import getpass
import matplotlib.pyplot as plt
import openai
import pandas as pd
# import phoenix libs
import phoenix.evals.templates.default_templates as templates
from phoenix.evals import (
OpenAIModel,
download_benchmark_dataset,
llm_classify,
)
from pycm import ConfusionMatrix
from sklearn.metrics import classification_report
Prepare the data (or download the data sets). Eg:
df = download_benchmark_dataset(task="qa-classification", dataset_name="qa_generated_dataset")
Set your OpenAI key.
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
openai_api_key = getpass(" Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key
5. prepare the data sets in correct format as per the evaluation prompt
df_sample = (
df.sample(n=N_EVAL_SAMPLE_SIZE)
.reset_index(drop=True)
.rename(
columns={
"question": "input",
"context": "reference",
"sampled_answer": "output",
}
)
)
Set and load the model for running evals.
model = OpenAIModel(
model_name="gpt-4",
temperature=0.0,
)
Run your custom evals.
rails = list(templates.QA_PROMPT_RAILS_MAP.values())
Q_and_A_classifications = llm_classify(
dataframe=df_sample,
template=templates.QA_PROMPT_TEMPLATE,
model=model,
rails=rails,
concurrency=20,
)["label"].tolist()
Evaluate the above predictions with pre-defined labels.
true_labels = df_sample["answer_true"].map(templates.QA_PROMPT_RAILS_MAP).tolist()
print(classification_report(true_labels, Q_and_A_classifications, labels=rails))
Create a confusion matrix and plot to get a better picture.
confusion_matrix = ConfusionMatrix(
actual_vector=true_labels, predict_vector=Q_and_A_classifications, classes=rails
)
confusion_matrix.plot(
cmap=plt.colormaps["Blues"],
number_label=True,
normalized=True,
)
Note: You can set model "gpt-3.5-turbo" in step 6 and run evals against GPT 3.5 or any other models you want to evaluate your custom model with.
Here is one helpful link to Google Colab which I followed where I can find good step-by-step instructions:
PS: The code and steps I have mentioned in this article are based on this collab book.
Here is a good article by Aparna Dhinakaran (co-founder and CPO of Arize AI and a Phoenix contributor) about Evals and Phoenix.
Conclusion
I hope this article helped you to understand how evals can be implemented for custom models. I would be happy if you got at least some insights about evals and some motivation to create your own! All the best for your trials and tries.
Opinions expressed by DZone contributors are their own.
Comments