Self Hosting OpenAI Chat Endpoint with GPU-accelerated MistralOrca 7B 8K (GGUF) and Llama CPP Python Server

Oct 16, 2023

Serving models as an emulated OpenAI Endpoint enables a few important benefits:

Application start-up is decoupled from model initialization (faster code iteration)
Served model can be swapped without changing your code (faster model swap)

Any GGUF formatted model should work, but I am using the new MistralOrca 7B.

I actually did this on Windows (so I could still play Rocket League), but will provide details for both Linux and Windows. I will assume in both cases you already installed Nvidia drivers. Reach out if you get stuck on Nvidia stuff.

You will probably want to create and enter a working directory for this, I created a folder called llama-server.

Step 1: Download the GGUF model file

In general, you can google “TheBloke <MODEL NAME> GGUF” and get the GGUF files for popular model. Here is a direct link to TheBloke’s MistralOrca 7B GGUF.

Navigate to the Files and Versions tab, then download the appropriate quantization for your available GPU VRAM. I have an RTX 3090 with 24GB of VRAM so I can fit any of these. I chose the 6K Quantization to keep the best quality.

If you have less VRAM, try out different quantizations to see what works. Your CPU will pick up whatever can’t fit into the GPU VRAM, but it will run slower so it is up to you to make the quality/speed tradeoff if you have a smaller card.

Save the GGUF file in your working directory.

Step 2: Set up a Virtual Environment (Optional)

You don’t necessarily need to do all these venv steps, but if you are doing any other python stuff it will prevent a lot of conflict nonsense between applications.

Install Python3 and Python3 venv module. Then create a new virtual environment in your working directory:

python3 -m venv llamaserver

Then source that virtual environment:

Windows Powershell: .\llamaserver\Scripts\Activate.ps1
Linux BASH: source llamaserver/bin/activate

Step 3: Install Llama CPP Python with Server

The llama-cpp-python project is all you need to get up and running with your GGUF model. Install by following the hardware acceleration steps in the README. Be sure to include the [server] package! You should absolutely read the full README anyway, but I have included the necessary commands for Nvidia users.

Windows Powershell:

$env:CMAKE_ARGS = "-DLLAMA_CUBLAS=on"
pip install llama-cpp-python[server] --force-reinstall --upgrade --no-cache-dir

Linux BASH:

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python[server] --force-reinstall --upgrade --no-cache-dir

You probably don’t need all that --force-reinstall --upgrade --no-cache-dir stuff at the end of the command, but I included it in case you need to run it again or have a conflicting install and it shouldn’t cause trouble on a fresh install anyway.

Step 4: Run the OpenAI Endpoint server!

Create a run script in your working directory. For this model, the HuggingFace page said it uses the chatml format for chat. And the context is 8k. And I think it actually only has like 35 layers, but putting too many doesn’t hurt. Play with that setting.

Windows Powershell (run.ps1):

deactivate
.\llamaserver\Scripts\Activate.ps1
python -m llama_cpp.server `
  --model .\mistral-7b-openorca.Q6_K.gguf `
  --n_ctx 8192 `
  --chat_format chatml `
  --n_gpu_layers 43 `
  --model_alias gpt-3.5-turbo `
  --host 0.0.0.0 --port 8180

Linux BASH (run.sh):

deactivate
./llamaserver/bin/activate
python -m llama_cpp.server \
  --model ./mistral-7b-openorca.Q6_K.gguf \
  --n_ctx 8192 \
  --chat_format chatml \
  --n_gpu_layers 43 \
  --model_alias gpt-3.5-turbo \
  --host 0.0.0.0 --port 8180

Then run the script. You should see your GPU mentioned with the GPU layer offload mentioned (if not, you might have missed the GPU build specification in Step 3). Here is part of my output so you know how it should start (notice the GPU is mentioned):

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from .\mistral-7b-openorca.Q6_K.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q6_K     [  4096, 32002,     1,     1 ]
...

Once it is up, you can visit http://127.0.0.1:8180/docs or use Langchain’s ChatOpenAI wrapper against your new endpoint and override the default OpenAI endpoint with your hosted one:

from langchain.llms import OpenAI
llm = OpenAI(
    temperature=0,
    openai_api_key="use llama-cpp-python.server lol!",
    max_tokens=512
)

export OPENAI_API_BASE=http://127.0.0.1:8180/v1
python3 example.py

Conclusion

I have looked through several projects and determined this to be the most straight forward for at least my hobby use case. Single command installation. Works well on Windows. TheBloke delivers consistent GGUF models almost as soon as the official ones are released.

Being able to architect my LangChain projects or even Autogen agents to use a simple OpenAI endpoint has dramatically simplified integration and made upgrading to the latest model as easy as pulling the latest GGUF and updating my run script to use the new model. No code changes. No library modifications for the server. I am up and running with a new GGUF model in only a couple minutes after the download.

Did you find this helpful? Did you get stuck? Did I miss something? Reach out on X.

blog.bios.dev

Discussion about this post