8BitMiscreant

Up and Running with llama.cpp

Just my notes to configure llama.cpp.

TL;DR Use llama-setup.sh and llama-run.sh from my script-stash!

Table of Contents

Build

CPU Build

cmake -B build
cmake --build build --config Release -j

NOTE: Better to specify -jN where N is the number of threads to use for building to prevent memory overloading.

GPU Build

  1. Prerequisite: CUDA Toolkit

    wget https://developer.download.nvidia.com/compute/cuda/13.1.1/local_installers/cuda-repo-fedora42-13-1-local-13.1.1_590.48.01-1.x86_64.rpm
    sudo rpm -i cuda-repo-fedora42-13-1-local-13.1.1_590.48.01-1.x86_64.rpm
    sudo dnf clean all
    sudo dnf -y install cuda-toolkit-13-1
    

    Takes a lifetime to download!

  2. Build llama.cpp

    cmake -B build -DGGML_CUDA=ON
    cmake --build build --config Release -j
    

    See bottom for FAQ/Error fixes!

llama-cli

./build/bin/llama-cli --model models/<MODEL_NAME>

llama-server

Launch server at localhost:8080

./build/bin/llama-server --model models/<MODEL_NAME>

Accessing on the local network

./build/bin/llama-server --model models/<MODEL_NAME> --host 0.0.0.0 --port 8080
ifconfig

Access using:

http://<LOCAL_IP>:<PORT>

Preferably add a secret key:

./build/bin/llama-server --model models/<MODEL_NAME> --host 0.0.0.0 --port 8080 --api-key <SECRET_KEY>

Generate a base64 secret key using openssl!

Remote Access (using Tailscale)

  1. Setting up Tailscale (https://tailscale.com/download/linux/)

    sudo dnf config-manager addrepo --from-repofile=https://pkgs.tailscale.com/stable/fedora/tailscale.repo
    sudo dnf install tailscale
    sudo systemctl enable --now tailscaled
    
  2. Connect Machine to tailscale and authenticate:

    sudo tailscale up
    
  3. Find device ip:

    tailscale ip -4
    

Tools

  1. vmtouch
    • To examine and control which parts of files or directories are resident in the system’s RAM. Will be used to analyze RAM usage of models (if using CPU build).
    • vmtouch <model.gguf> - check how much of a model is cached
  2. nvidia-smi
    • To monitor, manage and query nvidia gpu devices. It provides real-time information about GPU utilization, memory usage, temperature, power consumption, and running processes
    • watch -n 1 nvidia-smi - continuously monitor GPU stats (refresh every 1s)
    • nvidia-smi pmon - show running processes using GPU
  3. openssl
    • To generate base64 API key: openssl rand -base64 32

FAQ/Errors

Nvidia CUDA build error (rsqrt, rsqrtf)

/usr/include/bits/mathcalls.h(206): error: exception specification is
incompatible with that of previous function "rsqrt" (declared at line 629
of
/usr/local/cuda/bin/../targets/x86_64-linux/include/crt/math_functions.h)
extern double rsqrt (double __x) noexcept (true); extern double __rsqrt (double __x) noexcept (true);
^
/usr/include/bits/mathcalls.h(206): error: exception specification is
incompatible with that of previous function "rsqrtf" (declared at line 653
of
/usr/local/cuda/bin/../targets/x86_64-linux/include/crt/math_functions.h)
extern float rsqrtf (float __x) noexcept (true); extern float __rsqrtf (float __x) noexcept (true);
^
2 errors detected in the compilation of "CMakeCUDACompilerId.cu".
# --error 0x2 --
Call Stack (most recent call first):
/usr/share/cmake/Modules/CMakeDetermineCompilerId.cmake:8 (CMAKE_DETERMINE_COMPILER_ID_BUILD)
/usr/share/cmake/Modules/CMakeDetermineCompilerId.cmake:53 (__determine_compiler_id_test)
/usr/share/cmake/Modules/CMakeDetermineCUDACompiler.cmake:139 (CMAKE_DETERMINE_COMPILER_ID)
ggml/src/ggml-cuda/CMakeLists.txt:58 (enable_language)

Fix

Add noexcept(true) to respective functions (rsqrt, etc) in /usr/local/cuda/targets/x86_64-linux/include/crt/math_functions.h