Up and Running with llama.cpp

14 Jan, 2026

Just my notes to configure llama.cpp.

TL;DR Use llama-setup.sh and llama-run.sh from my script-stash!

Build
- CPU Build
- GPU Build
llama-cli
llama-server
- Accessing on the local network
- Remote Access (using Tailscale)
Tools
References
FAQ/Errors
- Nvidia CUDA build error (rsqrt, rsqrtf)

Build

CPU Build

cmake -B build
cmake --build build --config Release -j

NOTE: Better to specify -jN where N is the number of threads to use for building to prevent memory overloading.

GPU Build

Prerequisite: CUDA Toolkit

wget https://developer.download.nvidia.com/compute/cuda/13.1.1/local_installers/cuda-repo-fedora42-13-1-local-13.1.1_590.48.01-1.x86_64.rpm
sudo rpm -i cuda-repo-fedora42-13-1-local-13.1.1_590.48.01-1.x86_64.rpm
sudo dnf clean all
sudo dnf -y install cuda-toolkit-13-1

Takes a lifetime to download!

Build llama.cpp

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

See bottom for FAQ/Error fixes!

llama-cli

./build/bin/llama-cli --model models/<MODEL_NAME>

llama-server

Launch server at localhost:8080

./build/bin/llama-server --model models/<MODEL_NAME>

Accessing on the local network

./build/bin/llama-server --model models/<MODEL_NAME> --host 0.0.0.0 --port 8080
ifconfig

Access using:

http://<LOCAL_IP>:<PORT>

Preferably add a secret key:

./build/bin/llama-server --model models/<MODEL_NAME> --host 0.0.0.0 --port 8080 --api-key <SECRET_KEY>

Generate a base64 secret key using openssl!

Remote Access (using Tailscale)

Setting up Tailscale (https://tailscale.com/download/linux/)

sudo dnf config-manager addrepo --from-repofile=https://pkgs.tailscale.com/stable/fedora/tailscale.repo
sudo dnf install tailscale
sudo systemctl enable --now tailscaled

Connect Machine to tailscale and authenticate:
```
sudo tailscale up
```
Find device ip:
```
tailscale ip -4
```

Tools

vmtouch
- To examine and control which parts of files or directories are resident in the system’s RAM. Will be used to analyze RAM usage of models (if using CPU build).
- vmtouch <model.gguf> - check how much of a model is cached
nvidia-smi
- To monitor, manage and query nvidia gpu devices. It provides real-time information about GPU utilization, memory usage, temperature, power consumption, and running processes
- watch -n 1 nvidia-smi - continuously monitor GPU stats (refresh every 1s)
- nvidia-smi pmon - show running processes using GPU
openssl
- To generate base64 API key: openssl rand -base64 32

FAQ/Errors

Nvidia CUDA build error (rsqrt, rsqrtf)

/usr/include/bits/mathcalls.h(206): error: exception specification is
incompatible with that of previous function "rsqrt" (declared at line 629
of
/usr/local/cuda/bin/../targets/x86_64-linux/include/crt/math_functions.h)
extern double rsqrt (double __x) noexcept (true); extern double __rsqrt (double __x) noexcept (true);
^
/usr/include/bits/mathcalls.h(206): error: exception specification is
incompatible with that of previous function "rsqrtf" (declared at line 653
of
/usr/local/cuda/bin/../targets/x86_64-linux/include/crt/math_functions.h)
extern float rsqrtf (float __x) noexcept (true); extern float __rsqrtf (float __x) noexcept (true);
^
2 errors detected in the compilation of "CMakeCUDACompilerId.cu".
# --error 0x2 --
Call Stack (most recent call first):
/usr/share/cmake/Modules/CMakeDetermineCompilerId.cmake:8 (CMAKE_DETERMINE_COMPILER_ID_BUILD)
/usr/share/cmake/Modules/CMakeDetermineCompilerId.cmake:53 (__determine_compiler_id_test)
/usr/share/cmake/Modules/CMakeDetermineCUDACompiler.cmake:139 (CMAKE_DETERMINE_COMPILER_ID)
ggml/src/ggml-cuda/CMakeLists.txt:58 (enable_language)

Fix

Add noexcept(true) to respective functions (rsqrt, etc) in /usr/local/cuda/targets/x86_64-linux/include/crt/math_functions.h

8BitMiscreant