Introduction to ML Engineering and LLMOps With OpenAI and LangChain
Understand how to work with OpenAI LLMs and use the popular LangChain toolkit in Python. Extract from the book Machine Learning Engineering with Python, Packt, 2023.
Join the DZone community and get the full member experience.
Join For FreeThe following article is based on an extract from the Second Edition of Machine Learning Engineering with Python, Packt, 2023, by Andy McMahon.
Living It Large With LLMs
At the time of writing, GPT4 has been released only a few months previously, in March 2023, by OpenAI. This model is potentially the largest machine learning model ever developed, with a reported one trillion parameters, although OpenAI refuses to confirm this number. Since then, Microsoft and Google have announced advanced chat capabilities using similarly large models in their product suites, and a raft of open-source packages and toolkits have been released that it feels like everyone is trying to understand and apply. All of these solutions leverage some of the largest neural network models ever developed, Large Language Models (LLMs). LLMs are part of an even wider class of models known as Foundation Models, which span not just text applications but video and audio as well. These models are roughly classified by the author as being too large for most organizations to consider training, or potentially even hosting, themselves, and therefore, they will usually be consumed as a third-party service. Solving this integration challenge in a safe and reliable way represents one of the main challenges in modern machine learning engineering. There is no time to lose, as new models and capabilities seem to be released every day. Let’s go!
Understanding LLMs
The main focus of LLM-based systems is to create human-like responses to a wide range of text-based inputs. LLMs are based on transformer architectures, which enable these models to process input in parallel, significantly reducing the amount of time required to train them.
The transformer-based architecture of LLMs, as for any transformer, consists of a series of encoders and decoders that leverage self-attention and feed-forward neural networks. At a high level, you can think of the encoders as being responsible for processing the input, transforming it into an appropriate numerical representation, and then feeding this into the decoders, from which the output can be generated. The magic of transformers comes from the use of self-attention, which is a mechanism for capturing the contextual relationships between words in a sentence. This results in attention vectors that represent this numerically, and when multiple of these are being calculated, it is called “multi-headed attention.” Both the encoder and decoder use self-attention mechanisms to capture the contextual dependencies of the input and output sequences.
One of the most popular transformed-based models used in LLMs is the Bidirectional Encoder Representations from Transformers (BERT) model. BERT was developed by Google and is a pre-trained model that can be fine-tuned for various natural language tasks. Another popular architecture is the Generative Pre-trained Transformer (GPT), created by OpenAI.
The ChatGPT system, released by OpenAI in Nov 2022, apparently utilized a 3rd-generation GPT model when it took the world by storm. At the time of writing in March 2023, these models are up to their fourth generation and are incredibly powerful. Although as I type this, GPT-4 is only available to certain users across the globe, it is already sparking heated debate about the future of AI and whether or not we have reached artificial general intelligence (AGI). The author does not believe we have, but what an exciting time to be in this space anyway!
The thing that makes LLMs infeasible to train anew in every new business context or organization is that they are trained on colossal datasets. GPT-3, which was released in 2020, was trained on almost 500 billion tokens of text. A token, in this instance, is a small fragment of a word used for the training and inferences process in LLMs, roughly around four characters in English. So that is a lot of text! The costs for training these models are, therefore, concomitantly large, and even inference can be hugely costly. This means that organizations whose sole focus is not producing these models will likely fail to see the economies of scale and the returns required to justify investing in them at this scale. This is before you even consider the need for specialized skill sets, optimized infrastructure, and the ability to grab all of that data. There are a lot of parallels with the advent of the public Cloud several years ago, where organizations now no longer have to invest in as much on-premises infrastructure or expertise and instead pay on a ‘what you use’ basis. The same thing is now happening with the most sophisticated machine learning models. This is not to say that smaller, more domain-specific models have been ruled out. In fact, I think that this will remain one of the ways that organizations can leverage their own unique datasets to drive advantage over competitors and build out better products. The most successful teams will be those that combine this approach with the approach from the largest models in a robust way.
Scale is not the only important component, though. ChatGPT and GPT-4 were not only trained on huge amounts of data but they were also fine-tuned using a technique called Reinforcement Learning with Human Feedback (RLHF). During this process, the model is presented with a prompt, such as a conversational question, and generates a series of potential responses. The responses are then presented to a human evaluator who provides feedback on the quality of the response, usually by ranking them, which is then used to train a reward model. This model is then used to fine-tune the underlying language model through techniques like Proximal Policy Optimisation (PPO). The details of all of this are well beyond the scope of this book, but hopefully, you are gaining an intuition for how this is not run-of-the-mill data science that any team can quickly scale up. Since this is the case, we have to learn how to work with these tools as more of a ‘black box’ and consume them as 3rd party solutions. We will cover this in the next section.
Consuming LLMs via API
As discussed in the previous section, the main change in our way of thinking as ML engineers who want to interact with LLMs and Foundation Models, in general, is that we can no longer assume we have access to the model artifact, the training data, or testing data. We have to instead treat the model as a third-party service that we should call out to for consumption. Luckily, there are many tools and techniques for implementing this.
The next example will show you how to build a pipeline that leverages LLMs by using the popular LangChain package. The name comes from the fact that to leverages the power of LLMs. We often have to chain many interactions with them with calls to other systems and sources of information.
First, we walk through a basic example of calling the OpenAI API.
1. Install LangChain and OpenAI Python bindings:
pip install langchain
pip install openai
2. We assume the user has set up an OpenAI account and has access to an API key. You can set this as an environment variable or use a secrets manager for storage, like the one that GitHub provides. We will assume the key is accessible as an environment variable.
import os
openai_key = os.getenv('OPENAI_API_KEY')
3. Now, in our Python script or module, we can define the model we will be calling using the OpenAI API as accessed via the LangChain wrapper. Here, we will work with the gpt-3.5-turbo
model, which is the most advanced of the GPT-3.5 chat models:
from langchain.chat_models import ChatOpenAI
gpt = ChatOpenAI(model_name='''gpt-3.5-turbo''')
4. Langchain then facilitates the building up of pipelines using LLMs via PromptTemplates
, which allows you to standardize how we will prompt and parse the response of the models:
template = """Question: {question}
Answer: """
prompt = PromptTemplate(
template=template,
input_variables=['question']
)
5. We can then create our first “chain," which is the mechanism for pulling together related steps in LangChain. This first chain is a simple one that takes a prompt template and the input to create an appropriate prompt to the LLM API before returning an appropriately formatted response:
# user question
question = '''Where does Andrew McMahon, author of 'Machine Learning
Engineering with Python' work?'''
# create prompt template > LLM chain
llm_chain = LLMChain(
prompt=prompt,
llm=gpt
)
You can then run this question and print the result to the terminal as a test:
print(llm_chain.run(question))
This returns:
As an AI language model, I do not have access to real-time information. However, Andrew McMahon is a freelance data scientist and software engineer based in Bristol, United Kingdom.
Given that I am an ML engineer employed by a large bank and am based in Glasgow, United Kingdom, you can see that even the most sophisticated LLMs will get things wrong. This is an example of what we term a hallucination, where an LLM gives an incorrect but plausible answer. This is still a good example of building a basic mechanism through which we can programmatically interact with LLMs in a standardized way.
LangChain also provides the ability to pull multiple prompts together using a method in the chain called ‘generate’:
questions = [
{'question': '''Where does Andrew McMahon, author of 'Machine Learning Engineering \
with Python' work?'''},
{'question': "What is MLOps?"},
{'question': "What is ML engineering?"},
{'question': "What's your favourite flavour of ice cream?"}
]
print(llm_chain.generate(questions))
The response from this series of questions is rather verbose, but here is the first element of the returned object:
generations=[[ChatGeneration(text='As an AI modeler and a data scientist,
Andrew McMahon works at Cisco Meraki, a subsidiary of networking
giant Cisco, in San Francisco Bay Area, USA.', generation_info=None,
message=AIMessage(content='As an AI modeler and a data scientist, Andrew
McMahon works at Cisco Meraki, a subsidiary of networking giant Cisco, in
San Francisco Bay Area, USA.', additional_kwargs={}))], …]generations=[[ChatGeneration(text='As an AI modeler and a data scientist, Andrew McMahon works at Cisco Meraki, a subsidiary of networking giant Cisco, in San Francisco Bay Area, USA.',
generation_info=None,
message=AIMessage(content='As an AI modeler and a data scientist, Andrew McMahon works at Cisco Meraki, a subsidiary of networking giant Cisco, in
San Francisco Bay Area, USA.', additional_kwargs={}))], …]
Again, not quite right. You get the idea, though! With some prompt engineering and better conversation design, this could quite easily be a lot better. I’ll leave you to play around and have some fun with it.
This quick introduction to LangChain and LLMs only scratches the surface but hopefully gives you enough to fold in calls to these models into your ML workflows. Let’s move on to discuss another important way that LLMs are becoming part of this workflow as we explore software development using AI assistants.
Building the Future With LLMOps
Given the rise in interest in LLMs recently, there has been no shortage of people expressing the desire to integrate these models into all sorts of software systems. For us as ML engineers, this should immediately trigger us to ask the question, “What will that mean operationally?” As discussed throughout this book, the marrying together of operations and development of ML systems is termed MLOps. Working with LLMs is likely to lead to its own interesting challenges, however, and so a new term, LLMOps, has arisen to give this subfield of MLOps some good marketing. Is this really any different? I don’t think it is that different, but it should be viewed as a sub-field of MLOps with its own additional challenges. Some of the main challenges that I see in this area are:
- Larger infrastructure, even for fine-tuning: As discussed previously, these models are far too large for typical organizations or teams to consider training their own, so instead, teams will have to leverage third-party models, be they open-source or proprietary, and fine-tune them. Fine-tuning models of this scale will still be very expensive, and so there will be a higher premium on building very efficient data ingestion, preparation, and training pipelines.
- Model management is different: When you train your own models, effective ML engineering requires us to define good practices for versioning our models and storing metadata that provide the lineage of the experiments and training runs we have gone through to produce these models. In a world where models are more often hosted externally, this is slightly harder to do, as we do not have access to the training data, to the core model artifacts, and probably not even to the detailed model architecture. Versioning metadata will then likely default to the publicly available metadata for the model, think along the lines of gpt-4-v1.3 and similar-sounding names. That is not a lot of information to go on, and so you will likely have to think of ways to enrich this metadata, perhaps with your own example runs and test results, in order to understand how that model behaved in certain scenarios. This then also links to the next point.
- Rollbacks become more challenging: If your model is hosted externally by a third party, you do not control the roadmap of that service. This means that if there is an issue with version 5 of a model and you want to roll back to version 4, that option might not be available to you. This is a different kind of “drift” from the model performance drift we’ve discussed at length in this book, but it is going to become increasingly important. This will mean that you should have your own model, perhaps with nowhere near the same level of functionality or scale, ready as a last resort default to switch to in case of issues.
- Model performance is more of a challenge: As mentioned in the previous point, with foundation models being served as externally hosted services, you are no longer in as much control as you were. This means that if you do detect any issues with the model you are consuming, be they drift or some other bugs, you are very limited in what you can do, and you will need to consider that default rollback we just discussed.
- Applying your own guardrails will be key: LLMs hallucinate, they get things wrong, they can regurgitate training data, and they might even inadvertently offend the person interacting with them. All of this means that as these models are adopted by more organizations, there will be a growing need to develop methods for applying bespoke guardrails to systems utilizing them. As an example, if an LLM was being used to power a next-generation chatbot, you could envisage that between the LLM service and the chat interface, you could have a system layer that checked for abrupt sentiment changes and important keywords or data that should be obfuscated. This layer could utilize simpler ML models and a variety of other techniques. At its most sophisticated, it could try to ensure that the chatbot did not lead to a violation of ethical or other norms established by the organization. If your organization has made the climate crisis an area of focus, you may want to screen the conversation in real-time for information that goes against critical scientific findings in this area, as an example.
Since the era of foundation models has only just begun, it is likely that more and more complex challenges will arise to keep us busy as ML engineers for a long time to come. To me, this is one of the most exciting challenges we face as a community: how we harness one of the most sophisticated and cutting-edge capabilities ever developed by the ML community in a way that still allows the software to run safely, efficiently, and robustly for users day in and day out. Are you ready to take on that challenge?
Let’s dive into some of these topics in a bit more detail, first with a discussion of LLM validation.
Validating LLMs
The validation of generative AI models is inherently different from and seemingly more complex than the same for other ML models. The main reason for this is that when you are generating content, you are often creating very complex data in your results that has never existed! If an LLM returns a paragraph of generated text when asked to help summarize and analyze some document, how do you determine if the answer is “good”? If you ask an LLM to reformat some data into a table, how can you build a suitable metric that captures if it has done this correctly? In a generative context, what do “model performance” and “drift” really mean, and how do I calculate them? Other questions may be more use-case-dependent. For example, if you are building an information retrieval or Retrieval-Augmented Generation (see Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks) solution, how do you evaluate the truthfulness of the text generated by the LLM?
There are also important considerations around how we screen the LLM-generated outputs for any potential biased or toxic outputs that may cause harm or reputational damage to the organization running the model. The world of LLM validation is complex!
What can we do? Thankfully, this has not all happened in a vacuum, and there have been several benchmarking tools and datasets released that can help us on our journey. Things are so young that there are not many worked examples of these tools yet, but we will discuss the key points so that you are aware of the landscape and can keep on top of how things are evolving. Let’s list some of the higher-profile evaluation frameworks and datasets for LLMs:
- OpenAI Evals: This is a framework whereby OpenAI allows for the crowdsourced development of tests against proposed text completions generated by LLMs. The core concept at the heart of evals is the “Completion Function Protocol,” which is a mechanism for standardizing the testing of the strings returned when interacting with an LLM. The framework is available on GitHub.
- Holistic Evaluation of Language Models (HELM): This project from Stanford University styles itself as a “living benchmark” for LLM performance. It gives you a wide variety of datasets, models, and metrics and shows the performance across these different combinations. It is a very powerful resource that you can use to base your own test scenarios on, or indeed just to use the information directly to understand the risks and potential benefits of using any specific LLM for your use case. The HELM benchmarks are available at.
- Guardrails AI: This is a Python package that allows you to do validation on LLM outputs in the same style as Pydantic, which is a very powerful idea! You can also use it to build control flows with the LLM for when issues arise, like a response to a prompt not meeting your set criteria; in this case, you can use Guardrails AI to re-prompt the LLM in the hope of getting a different response. To use Guardrails AI, you specify a Reliable AI Markup Language (RAIL) file that defines the prompt format and expected behavior in an XML-like file. Guardrails AI is available on GitHub.
There are several more of these frameworks being created all the time, but getting familiar with the core concepts and datasets out there will become increasingly important as more organizations want to take LLM-based systems from fun proofs-of-concept to production solutions. In the penultimate section of this chapter, we will briefly discuss some specific challenges I see around the management of “prompts” when building LLM applications.
PromptOps
When working with generative AI that takes text inputs, the data we input is often referred to as “prompts” to capture the conversational origin of working with these models and the concept that an input demands a response, the same way a prompt from a person would. For simplicity, we will call any input data that we feed to an LLM a prompt, whether this is in a user interface or via an API call, irrespective of the nature of the content we provide to the LLM.
Prompts are often quite different beasts from the data we typically feed into an ML model. They can be effectively freeform, have a variety of lengths, and, in most cases, express the intent for how we want the model to act. In other ML modeling problems, we can certainly feed in unstructured textual data, but this intent piece is missing. This all leads to some important considerations for us as ML engineers working with these models.
First, the shaping of prompts is important. The term prompt engineering has become popular in the data community recently and refers to the fact that there is often a lot of thought that goes into designing the content and format of these prompts. This is something we need to bear in mind when designing our ML systems with these models. We should be asking questions like “Can I standardize the prompt formats for my application or use case?”, “Can I provide appropriate additional formatting or content on top of what a user or input system provides to get a better outcome?” and similar questions. I will stick with calling
this prompt engineering.
Secondly, prompts are not your typical ML input, and tracking and managing them is a new, interesting challenge. This challenge is compounded by the fact that the same prompt may give very different outputs for different models or even with different versions of the same model. We should think carefully about tracking the lineage of our prompts and the outputs they generate. I term this challenge as prompt management.
Finally, we have a challenge that is not necessarily unique to prompts but definitely becomes a more pertinent one if we allow users of a system to feed in their own prompts, for example, in chat interfaces. In this case, we need to apply some sort of screening and obfuscation rules to data coming in and coming out of the model to ensure that the model is not “jailbroken” in some way to evade any guardrails. We would also want to guard against adversarial attacks that may be designed to extract training data from these systems, thereby gaining personally identifiable or other critical information that we do not wish to share.
As you begin to explore this brave new world of LLMOps with the rest of the world, it will be important to keep these prompt-related challenges in mind.
Take on the Challenge!
Hopefully, the preceding paragraphs have convinced you that there are some unique areas to explore when it comes to LLMOps and that this area is ripe for innovation. These points barely scratch the surface of this rich new world, but I personally think they highlight that we do not have the answers yet. Are you ready to help build the future?
Opinions expressed by DZone contributors are their own.
Comments