Running LLMs Locally: A Step-by-Step Guide
In this article, take a closer look at LocalAI, an open-source alternative to OpenAI that allows you to run LLMs on your local machine.
Join the DZone community and get the full member experience.
Join For FreeIn this post, you will take a closer look at LocalAI, an open-source alternative to OpenAI that allows you to run LLMs on your local machine. No GPU is needed: consumer-grade hardware will suffice. Enjoy!
Introduction
OpenAI is a great tool. However, you may not be allowed to use it due to company policies because you might send sensitive information to OpenAI. Besides that, you might want to experiment with different kinds of LLMs (Large Language Models). Wouldn’t it be great if you could run models locally using the same Rest API as for OpenAI? Well, that is exactly what LocalAI has to offer you! LocalAI is an open-source alternative to OpenAI and has a Rest API which is compatible with the OpenAI API specifications. Besides that, no GPU is needed, you can run it on consumer-grade hardware. It is advised, however, to use a GPU, because it will be approximately 20 times faster.
Prerequisites
Actually, there are no prerequisites for reading this blog. As I am at the beginning of learning more about AI appliances, this blog is at entry level. No need to know how LLMs work internally: we will just make use of the LLMs.
You do need the following tools:
- Git
- Docker Compose
curl
or equivalent (Postman for example)
Installation
The installation of LocalAI for the CPU is described here. This paragraph contains the steps and changes I made in order to install LocalAI.
Clone the LocalAI git repository.
$ git clone https://github.com/go-skynet/LocalAI
Navigate into the repository directory.
$ cd LocalAI
The repository contains a .env
file that you need to customize.
- Uncomment
THREADS
and adjust the number to the number of physical cores you have (12 in my case). - Uncomment
GALLERIES
and adjust it to the galleries as described in the installation guide.
The top of the file looks as follows:
## Set number of threads.
## Note: prefer the number of physical cores. Overbooking the CPU degrades performance notably.
THREADS=12
## Specify a different bind address (defaults to ":8080")
# ADDRESS=127.0.0.1:8080
## Default models context size
# CONTEXT_SIZE=512
#
## Define galleries.
## models will to install will be visible in `/models/available`
GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]
Start the Docker container. The Docker image refers to the latest
tag (at the time of writing, v2.0.0 of LocalAI is the latest
). You can find this by navigating to the image repository, searching for the latest tag, copying the manifest hash, and searching for the copied manifest hash.
$ docker compose up -d --pull always
Be patient: this takes some time. The image is about 70GB. The previous version v1.40.0 was approximately 14GB.
When the container has started successfully, you should be able to retrieve the available models:
$ curl http://localhost:8080/models/available
Install a Model
First, you need to install a model. You can do so via the model gallery via the API; but at the time of writing, this is still experimental. I prefer to add the model manually. The instructions can be found here, but do know that it might change over time, so do not solely rely on the contents in this paragraph.
Create a file lunademo.yaml
in directory models
. Change the threads
to the number of physical cores on your machine.
name: lunademo
parameters:
model: luna-ai-llama2-uncensored.Q5_K_M.gguf
top_k: 80
temperature: 0.2
top_p: 0.7
context_size: 1024
threads: 12
backend: llama
roles:
assistant: 'ASSISTANT:'
system: 'SYSTEM:'
user: 'USER:'
template:
chat: lunademo-chat
completion: lunademo-completion
The model
refers to a file containing the model. Download the file to the models
directory from HuggingFace. HuggingFace contains many open-source models that you can use; but in this example, you will use a model based on Llama 2, the AI model created by Meta. Note that in the Model Card, the models are listed with their use cases. Also, the use case states which models are recommended to use. Beware to use only GGUF models, GGML is no longer supported for Llama 2.
Also, note that two templates are defined in the configuration file of the model: one chat
template and one completion
template.
Create a file lunademo-chat.tmpl
in the models
directory. The template is derived from the Model Card at HuggingFace (search for Prompt template).
USER: {{.Input}}
ASSISTANT:
Create a file lunademo-completion.tmpl
in the models
directory.
Complete the following sentence: {{.Input}}
Restart the Docker container in order to load the model.
$ docker compose restart
Ask Questions
Now that a model has been loaded, you can start asking questions. You can take a look at the OpenAPI specification, below some examples are shown in order to verify how the local model responds and how accurate it is.
1. How Are You?
As a first simple example, you ask the model how it is feeling. In the request, you mention the model to be used, the message and you can set the temperature. A high temperature allows the model to be more creative. The model answers that it is doing well.
$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "lunademo",
"messages": [{"role": "user", "content": "How are you?"}],
"temperature": 0.9
}'
{
"created":1700993538,
"object":"chat.completion",
"id":"2fe33052-f4be-4724-8b53-fdade80b49de",
"model":"lunademo",
"choices":[
{
"index":0,
"finish_reason":"stop",
"message":{
"role":"assistant",
"content":"I'm doing well, thank you. How about yourself?"
}
}
],
"usage":{
"prompt_tokens":0,
"completion_tokens":0,
"total_tokens":0
}
}
2. Fact About Famous Actor
Let’s ask the model if it knows who Leonard di Caprio is. You set the temperature to zero because you want only facts. The answer is short but correct. Also, note that it corrected the name in the response.
$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "lunademo",
"messages": [{"role": "user", "content": "who is leonardo di caprio?"}],
"temperature": 0
}'
{
"created":1700993538,
"object":"chat.completion",
"id":"2fe33052-f4be-4724-8b53-fdade80b49de",
"model":"lunademo",
"choices":[
{
"index":0,
"finish_reason":"stop",
"message":{
"role":"assistant",
"content":"Leonardo DiCaprio is an American actor and film producer. He has appeared in numerous films, including \"Titanic,\" \"The Revenant,\" and \"The Wolf of Wall Street.\""
}
}
],
"usage":{
"prompt_tokens":0,
"completion_tokens":0,
"total_tokens":0
}
}
3. Facts About Famous Soccer Player
Let’s verify whether it also knows the famous Dutch soccer player Johan Cruijff. The answer is in this case also correct.
$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "lunademo",
"messages": [{"role": "user", "content": "who is Johan Cruijff?"}],
"temperature": 0
}'
{
"created":1700993538,
"object":"chat.completion",
"id":"2fe33052-f4be-4724-8b53-fdade80b49de",
"model":"lunademo",
"choices":[
{
"index":0,
"finish_reason":"stop",
"message":{
"role":"assistant",
"content":"Johan Cruyff was a Dutch professional football player and coach. He played as a forward for Ajax, Barcelona, and the Netherlands national team. He is widely regarded as one of the greatest players of all time and was known for his creativity, skill, and ability to score goals from any position on the field."
}
}
],
"usage":{
"prompt_tokens":0,
"completion_tokens":0,
"total_tokens":0
}
}
4. A Fairy Tale
Ask the model to write a short fairy tale, and set the temperature again on a high value. The answer is a nice fairy tale including goblins and fairies.
$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "lunademo",
"messages": [{"role": "user", "content": "write a short fairy tale including goblins and fairies"}],
"temperature": 0.9
}'
{
"created":1700993538,
"object":"chat.completion",
"id":"2fe33052-f4be-4724-8b53-fdade80b49de",
"model":"lunademo",
"choices":[
{
"index":0,
"finish_reason":"stop",
"message":{
"role":"assistant",
"content":"Once upon a time, in a faraway forest lived a group of mischievous goblins. They loved to cause trouble for the creatures living in the woods. One day, they decided to prank the fairies who lived on the other side of the forest.\n\nThe goblins snuck into the fairy's garden and stole all their flowers. The fairies were devastated and knew they had to get their flowers back. They searched high and low but couldn't find any trace of the goblins.\n\nJust when they thought all hope was lost, a wise old fairy appeared before them. She told them that she knew where the goblins had taken the flowers and gave them a clue to find them.\n\nThe fairies followed the clue and found themselves in front of a cave. They were hesitant to enter but the wise fairy encouraged them to be brave. Inside the cave, they saw the goblins playing with the stolen flowers.\n\nThe fairies were furious and ready to punish the goblins for their mischief. But the wise old fairy stopped them and suggested a different approach. She told them that instead of getting angry, they should try to understand why the goblins had done this.\n\nThe fairies listened to her and decided to talk to the goblins. They explained to them how much the flowers meant to them and how they were used for healing and magic spells. The goblins were sorry for their prank and promised never to do it again.\n\nFrom that day on, the fairies and goblins became friends and would often play together in the forest. They learned to appreciate each other's differences and lived happily ever after."
}
}
],
"usage":{
"prompt_tokens":0,
"completion_tokens":0,
"total_tokens":0
}
}
5. Other Languages
Up until now, you used English as a language to interact. But what if you need to use a different language - Dutch, for example? Is the model able to understand and answer in Dutch? Let’s find out!
Ask the model ‘Do you understand Dutch?’. The answer is yes.
$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "lunademo",
"messages": [{"role": "user", "content": "begrijp je nederlands?"}],
"temperature": 0
}'
{
"created":1700993538,
"object":"chat.completion",
"id":"2fe33052-f4be-4724-8b53-fdade80b49de",
"model":"lunademo",
"choices":[
{
"index":0,
"finish_reason":"stop",
"message":{
"role":"assistant",
"content":"Ja, ik kan Nederlands begrijpen."
}
}
],
"usage":{
"prompt_tokens":0,
"completion_tokens":0,
"total_tokens":0
}
}
Ask the model in Dutch who Johan Cruijff is. The model answers correctly, but it answers in English.
$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "lunademo",
"messages": [{"role": "user", "content": "wie is johan cruijff?"}],
"temperature": 0
}'
{
"created":1700993538,
"object":"chat.completion",
"id":"2fe33052-f4be-4724-8b53-fdade80b49de",
"model":"lunademo",
"choices":[
{
"index":0,
"finish_reason":"stop",
"message":{
"role":"assistant",
"content":"Johan Cruyff was a Dutch professional football player and manager. He played as a forward for Ajax, Barcelona, and the Netherlands national team. He is widely regarded as one of the greatest players of all time and is known for his innovative playing style."
}
}
],
"usage":{
"prompt_tokens":0,
"completion_tokens":0,
"total_tokens":0
}
}
You can fix this by instructing the assistant to always answer in Dutch. You can do so by adding a system message to the request with this instruction. The answer is in Dutch and correct. This is quite amazing, isn’t it? Dutch is not a widespread language and you are running the model locally!
$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "lunademo",
"messages": [
{"role": "system", "content": "You are a helpful assistant. Antwoord altijd in het Nederlands."},
{"role": "user", "content": "wie is Johan Cruijff?"}],
"temperature": 0
}'
{
"created":1700993538,
"object":"chat.completion",
"id":"2fe33052-f4be-4724-8b53-fdade80b49de",
"model":"lunademo",
"choices":[
{
"index":0,
"finish_reason":"stop",
"message":{
"role":"assistant",
"content":"Johan Cruijff was een Nederlandse voetballer die bekendstond om zijn technische vaardigheden en zijn snelle, creatieve spel. Hij speelde als middenvelder voor onder andere Ajax, Barcelona en het Nederlands elftal."
}
}
],
"usage":{
"prompt_tokens":0,
"completion_tokens":0,
"total_tokens":0
}
}
6. Stream the Response
Sometimes, the answer will take some time. However, by adding the stream
parameter in the request, you do not have to wait for the complete response, but you can receive character by character in order that you can display the response to the user. This way, you have a better user experience for the user.
$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "lunademo",
"messages": [{"role": "user", "content": "who is Johan Cruijff?"}],
"temperature": 0,
"stream": true
}'
data: {"created":1700993538,"object":"chat.completion.chunk","id":"2fe33052-f4be-4724-8b53-fdade80b49de","model":"lunademo","choices":[{"index":0,"delta":{"role":"assistant","content":""}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
data: {"created":1700993538,"object":"chat.completion.chunk","id":"2fe33052-f4be-4724-8b53-fdade80b49de","model":"lunademo","choices":[{"index":0,"delta":{"content":"J"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
...
data: {"created":1700993538,"object":"chat.completion.chunk","id":"2fe33052-f4be-4724-8b53-fdade80b49de","model":"lunademo","choices":[{"index":0,"finish_reason":"stop","delta":{"content":""}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
data: [DONE]
7. Format Response as JSON
Verify whether the answer can be formatted as a JSON object.
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "lunademo",
"messages": [{"role": "user", "content": "who is Johan Cruijff? Format the response as a JSON object containing firstName, lastName and clubs"}],
"temperature": 0
}'
{
"created":1700993538,
"object":"chat.completion",
"id":"2fe33052-f4be-4724-8b53-fdade80b49de",
"model":"lunademo",
"choices":[
{
"index":0,
"finish_reason":"stop",
"message":{
"role":"assistant",
"content":"{\n \"firstName\": \"Johan\",\n \"lastName\": \"Cruijff\",\n \"clubs\": [\n {\n \"name\": \"Ajax Amsterdam\",\n \"startYear\": 1957,\n \"endYear\": 1968\n },\n {\n \"name\": \"Barcelona\",\n \"startYear\": 1968,\n \"endYear\": 1973\n },\n {\n \"name\": \"Manchester United\",\n \"startYear\": 1973,\n \"endYear\": 1974\n }\n ]\n"
}
}
],
"usage":{
"prompt_tokens":0,
"completion_tokens":0,
"total_tokens":0
}
}
The content is a JSON object and it is formatted just as we asked.
{
"firstName":"Johan",
"lastName":"Cruijff",
"clubs":[
{
"name":"Ajax Amsterdam",
"startYear":1957,
"endYear":1968
},
{
"name":"Barcelona",
"startYear":1968,
"endYear":1973
},
{
"name":"Manchester United",
"startYear":1973,
"endYear":1974
}
]
}
However, do note that you also asked to mention the clubs Johan Cruijff played for. Although this seems to be correct, Johan Cruijff never played for Manchester United. Also, the start and end years of Ajax and Barcelona are not correct. The model is hallucinating here, even when the temperature is set to 0. See Wikipedia for the details.
Conclusion
Running an LLM locally is possible by means of LocalAI. You can run it even if you do not have a GPU. This is very promising and opens the door for using LLMs even if your company policies do not allow you to use cloud-hosted LLMs.
Published at DZone with permission of Gunter Rotsaert, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments