Install Llama-Cpp-Python With GPU Support

This article is a walk-through to install the llama-cpp-python package with GPU capability (CUBLAS) to load models easily on the GPU.

Manish Kovelamudi

May. 01, 24 · Tutorial

Likes (1)

Comment

Save

5.7K Views

If you are looking for a step-wise approach to installing the llama-cpp-python package, you are in the right place. This guide summarizes the steps required for installation.

Before we install, are you wondering why we need to install this package separately with GPU capability?

This package gives us a class or interface (LlamaCPP) to create a model instance or object, primarily for pre-trained LLM models.

By default, even if you have a Nvidia GPU in your system with all the CUDA compilers and packages installed, this package only installs CPU capability.

Installing with GPU capability enabled eases the computation of LLMs (Larger Language Models) by automatically transferring the model onto GPU.

In this guide, detailed steps are provided to install this package using cuBLAS (GPU-accelerated library) provided by Nvidia.

Tested System Configuration

System — Azure VM
OS — Ubuntu 20.04
LLM model used — Mistral -7B

Prerequisites

Ensure the Nvidia CUDA toolkit is installed, the minimum required package version is 12.2

Download the required package from Nvidia's official website and install it.
Verify the successful installation of the toolkit by using this command nvidia-smi. This command should detect your GPU.
Also, verify in the source folder by checking in the /usr/local/ directory, there should be cuda-12.2 directory created and inside all the required files will be created.

2. Install GCC and G++ compilers to compile and install packages

Add the gcc repository using the below command.
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
Install gcc and g++ compilers using the command below.
sudo apt install gcc-11 g++-11 (minimum required version is 11 for gcc and g++ compilers)
Update alternatives using the below command to change default version 11
sudo update-alternatives — install /usr/bin/gcc gcc /usr/bin/gcc-11 60 — slave /usr/bin/g++ g++ /usr/bin/g++-11
Check the installed versions of GCC and G++ for correct installation.
gcc — version # This should printout gcc version as 11.4.0
g++ — version # This should printout gcc version as 11.4.0

3. Install Langchain and cmake packages using the below command

    Python
   
   pip install langchain cmake

Llama-CPP Installation

By default, the LlamaCPP package tries to pick up the default version available on the VM. If there are multiple CUDA versions, a specific version needs to be mentioned.
Use the below command for the installation of the package.

    Python
   
   CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCUDA_PATH=/usr/local/cuda-12.2 -DCUDAToolkit_ROOT=/usr/local/cuda-12.2 -DCUDAToolkit_INCLUDE_DIR=/usr/local/cuda-12/include -DCUDAToolkit_LIBRARY_DIR=/usr/local/cuda-12.2/lib64" FORCE_CMAKE=1 pip install llama-cpp-python - no-cache-dir

Verifying Installation

Verify by creating an instance of the LLM model by enabling verbose = True parameter.

    Python
   
   from langchain.llm import LlamaCpp
model = LlamaCpp(model_path, n_gpu_layers = -1, verbose = True)

n_gpu_layers = -1 is the main parameter that transfers available computation layers onto the GPU. Alternatively, you can set the number of layers you want to transfer, but -1 will automatically calculate and transfer them.

verbose = True prints the models details and parameters

On the terminal console, when the model is loaded, check for the following lines.

Device: <your-gpu-name> (Ex: Device 0: Tesla T4)

BLAS = 1 (indicates that the model is loaded onto the GPU)

Comparison

LlamaCPP With CPU

Time taken to load Mistral-7B model: 1 min (approx)

Time taken to generate a response to a query: 20 min (approx)

LlamaCPP With GPU

Time taken to load Mistral-7B model: 30 sec(approx)

Time taken to generate a response to query: 30 sec (approx)

Conclusion

Based on the load time and response generation, there is a significant performance difference when we use llama-cpp-python package with GPU support. Consider installing this package for better performance, if you have GPU/s attached to your system.

CPU time CUDA Python (language)

Published at DZone with permission of Manish Kovelamudi. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending