Up and Running with llama.cpp
Just my notes to configure llama.cpp.
TL;DR Use
llama-setup.shandllama-run.shfrom my script-stash!
Table of Contents
Build
CPU Build
cmake -B build
cmake --build build --config Release -j
NOTE: Better to specify -jN where N is the number of threads to use for building to prevent memory overloading.
GPU Build
Prerequisite: CUDA Toolkit
wget https://developer.download.nvidia.com/compute/cuda/13.1.1/local_installers/cuda-repo-fedora42-13-1-local-13.1.1_590.48.01-1.x86_64.rpm sudo rpm -i cuda-repo-fedora42-13-1-local-13.1.1_590.48.01-1.x86_64.rpm sudo dnf clean all sudo dnf -y install cuda-toolkit-13-1
Takes a lifetime to download!
Build llama.cpp
cmake -B build -DGGML_CUDA=ON cmake --build build --config Release -j
See bottom for FAQ/Error fixes!
llama-cli
./build/bin/llama-cli --model models/<MODEL_NAME>
llama-server
Launch server at localhost:8080
./build/bin/llama-server --model models/<MODEL_NAME>
Accessing on the local network
./build/bin/llama-server --model models/<MODEL_NAME> --host 0.0.0.0 --port 8080
ifconfig
Access using:
http://<LOCAL_IP>:<PORT>
Preferably add a secret key:
./build/bin/llama-server --model models/<MODEL_NAME> --host 0.0.0.0 --port 8080 --api-key <SECRET_KEY>
Generate a base64 secret key using openssl!
Remote Access (using Tailscale)
Setting up Tailscale (https://tailscale.com/download/linux/)
sudo dnf config-manager addrepo --from-repofile=https://pkgs.tailscale.com/stable/fedora/tailscale.repo sudo dnf install tailscale sudo systemctl enable --now tailscaled
Connect Machine to tailscale and authenticate:
sudo tailscale up
Find device ip:
tailscale ip -4
Tools
- vmtouch
- To examine and control which parts of files or directories are resident in the system’s RAM. Will be used to analyze RAM usage of models (if using CPU build).
vmtouch <model.gguf>- check how much of a model is cached
- nvidia-smi
- To monitor, manage and query nvidia gpu devices. It provides real-time information about GPU utilization, memory usage, temperature, power consumption, and running processes
watch -n 1 nvidia-smi- continuously monitor GPU stats (refresh every 1s)nvidia-smi pmon- show running processes using GPU
- openssl
- To generate base64 API key:
openssl rand -base64 32
- To generate base64 API key:
FAQ/Errors
Nvidia CUDA build error (rsqrt, rsqrtf)
/usr/include/bits/mathcalls.h(206): error: exception specification is
incompatible with that of previous function "rsqrt" (declared at line 629
of
/usr/local/cuda/bin/../targets/x86_64-linux/include/crt/math_functions.h)
extern double rsqrt (double __x) noexcept (true); extern double __rsqrt (double __x) noexcept (true);
^
/usr/include/bits/mathcalls.h(206): error: exception specification is
incompatible with that of previous function "rsqrtf" (declared at line 653
of
/usr/local/cuda/bin/../targets/x86_64-linux/include/crt/math_functions.h)
extern float rsqrtf (float __x) noexcept (true); extern float __rsqrtf (float __x) noexcept (true);
^
2 errors detected in the compilation of "CMakeCUDACompilerId.cu".
# --error 0x2 --
Call Stack (most recent call first):
/usr/share/cmake/Modules/CMakeDetermineCompilerId.cmake:8 (CMAKE_DETERMINE_COMPILER_ID_BUILD)
/usr/share/cmake/Modules/CMakeDetermineCompilerId.cmake:53 (__determine_compiler_id_test)
/usr/share/cmake/Modules/CMakeDetermineCUDACompiler.cmake:139 (CMAKE_DETERMINE_COMPILER_ID)
ggml/src/ggml-cuda/CMakeLists.txt:58 (enable_language)
Fix
Add noexcept(true) to respective functions (rsqrt, etc) in /usr/local/cuda/targets/x86_64-linux/include/crt/math_functions.h