Self-Hosting AI Models
Run your own AI inference server by renting GPU hardware and deploying open-source models with a web interface.
This guide covers renting a GPU server on Vast.ai, deploying models with Ollama, and setting up Open WebUI with TLS.
Overview
- Rent a server with GPUs on Vast.ai
- Download AI models with Ollama
- Setup a web interface for interacting with the models
- Add TLS support to distribute access securely
Key Concepts
Inference
The process of using live data with a trained AI model to make predictions or solve tasks. This occurs after training, when the model is used for chatting or querying.
GGUF
GPT-Generated Unified Format (GGUF) is a specialized file format designed for optimized storage and fast loading of large language models (LLMs). It converts the original model file into a more efficient format, improving performance and usability during inference tasks.
Inference Framework
A software system optimized to execute trained AI models for real-world tasks. It handles:
- Loading models (e.g., GGUF or custom formats)
- Efficiently processing inputs/outputs
- Leveraging hardware (CPU/GPU) for speed
Unlike training frameworks, it focuses purely on running models (not building them), prioritizing speed, compatibility, and ease of use.
Tools Used
| Tool | Purpose |
|---|---|
| Ollama | Lightweight inference framework for downloading and running AI models |
| Open WebUI | ChatGPT-like web interface that integrates with Ollama's API |
| Caddy | Lightweight reverse proxy to handle TLS |
| Cloudflare | Domain registration and DNS API for TLS verification |
Step 1 - Rent a Server
- Go to Vast.ai
- Fund the account
- Select the template NVIDIA Cuda (Ubuntu)
- Modify the template:
- Open port 3000
- Select launch mode Interactive shell server, SSH
- Check Use direct SSH connection
- Set at least 200 GB of disk space
- Save the template
- Go to Search:
- Select the template
- In Machine Options check Secure Cloud and Static IP Address
- Select a server with your desired GPU configuration
Step 2 - SSH into the Server
Go to Instances and click on the server IP. Note the IP and port mapping:
Public IP Address: <SERVER_IP>
Open Ports:
<SERVER_IP>:<SSH_PORT> -> 22/tcp
<SERVER_IP>:<WEB_PORT> -> 3000/tcp
The ports are automatically proxied by Vast.ai.
Create an SSH keypair:
ssh-keygen -t ed25519
Back on your instance, click the Key icon and add your SSH public key.
Connect to the server:
ssh -i ~/.ssh/id_ed25519 root@<SERVER_IP> -p <SSH_PORT>
Disable the preset tmux (optional):
touch .no_auto_tmux
Step 3 - Setup Ollama
Download and install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
Verify the installation:
ollama --version
Download the models you want:
ollama pull deepseek-r1:7b
Browse available models at ollama.com/search
Since we're inside a container without systemd, use tmux to keep Ollama running:
tmux new-session -s 'ollama'
ollama serve
Press Ctrl+b d to detach. Ollama is now running and exposes its API on port 11434.
Step 4 - Setup Open WebUI
Install uv (Python package manager):
curl -LsSf https://astral.sh/uv/install.sh | sh
Create the data directory:
mkdir ~/.open-webui
Open a tmux session and start Open WebUI on port 3001:
tmux new-session -s 'webui'
DATA_DIR=~/.open-webui uvx --python 3.11 open-webui@latest serve --port 3001
We use port 3001 because Caddy will handle TLS on port 3000 and proxy to 3001.
Press Ctrl+b d to detach.
Step 5 - Setup Cloudflare
- Go to Cloudflare and register a domain (or use an existing one)
- Go to My Profile > API Tokens
- Create a Token:
- Select Create custom token
- Give the token a name
- Set permissions: Zone : DNS : Edit
- Save the API token
This token allows Caddy to verify domain ownership via DNS challenge (since ports 80/443 may not be available).
Step 6 - Setup Caddy
Download Caddy with the Cloudflare DNS plugin:
- Go to caddyserver.com/download
- Select Linux AMD64
- Add the
caddy-dns/cloudflareplugin - Copy the download link
On the server:
wget "<DOWNLOAD_URL>" -O /usr/bin/caddy
chmod +x /usr/bin/caddy
Verify the installation:
caddy version
Create the configuration directory and file:
mkdir -p /etc/caddy
Create /etc/caddy/Caddyfile with the following content:
example.com:3000 {
tls {
dns cloudflare {env.CLOUDFLARE_API_TOKEN}
}
reverse_proxy localhost:3001
}
Replace example.com with your domain.
Start Caddy in a tmux session:
tmux new-session -s 'caddy'
CLOUDFLARE_API_TOKEN="YOUR_API_TOKEN" caddy run --config /etc/caddy/Caddyfile
Press Ctrl+b d to detach.
Step 7 - Access and Configure
Access Open WebUI at https://example.com:<WEB_PORT>/
The first user to register becomes the admin. Configure user management:
- Create groups to manage permissions
- By default, AI models are private
- Grant access to specific groups or make models public
Summary
Your self-hosted AI server is now running with:
- Ollama serving models on port 11434 (internal)
- Open WebUI on port 3001 (internal)
- Caddy handling TLS on port 3000
- Models accessible via
https://your-domain.com:<WEB_PORT>/