TL;DR
AI is an expensive hobby. When I started building with AI (Pedro), I didn’t want to pay for the API access. I was trying to self-fund my twitch hobby, and the fixed cost of hardware was much more attractive than a subscription; so I bought a refurbished desktop and got started.
Most people who are building with AI are using a 3rd party APIs like Anthropic’s Claude or OpenAI’s ChatGPT, but open-sourced AI is not far behind a lot of these closed-source models. With tools like vLLM, Ollama, Llama.cpp, and LM Studio, even a humble laptop or refurbished desktop can become a powerful development environment for self-hosted AI workflows.
This post outlines the local-first AI stack I use as a developer, homelab enthusiast, and open-source streamer and why I believe self-hosting models is not only doable, but it’s often the smartest path forward.
On June 12th there was a 2 hour Google outage that took down Claude and ChatGPT, but Pedro was still live 🤫.
Why Host Your AI Locally
AI is still a highly experimental field, which means it takes a lot of tests and iterations to make something valuable, but the faster you can test ideas, the faster you can prove value and ship . To experiment quickly, developers need environments that offer immediate feedback. Local development makes that possible without the overhead of cost per call. It enables faster experimentation by cutting out network delays, removing reliance on API tokens, and eliminating the need for cloud GPU provisioning. More importantly, developing locally puts you in full control of your workflow, infrastructure, and data. You own the whole stack and everything that goes into and out of the model. There are also data privacy concerns around AI, and keeping everything internal is de-risking by default. And finally, running AI models on your own hardware is cost-effective. You can reuse older gear, avoid paying for cloud time, and experiment freely without worrying about billing dashboards.
Tools for Local AI
Local development thrives on the right set of tools. Below are the core components I use in my stack.
Serving Models: Ollama, Llama.cpp, and LM Studio
There are several tools for serving models locally, but I want to only dive into the ones I have experience with. Those are ollama and llama.cpp. LM studio is also with a mention since it is UI first
Ollama
Ollama is a command-line interface designed to make running large language models locally incredibly easy. It leverages llama.cpp under the hood and provides a smooth experience for spinning up models like LLama3, Mistral, and Qwen. It’s written in Go and can run as a CLI and as a server that exposes a REST API, making it perfect for integration into other tools and apps. With a single command, you can get a model up and running:
ollama run llama3
If you are running Ollama in server mode, you can leverage OpenWebUi to make a really cool Chat interface that behaves just like your favorite web AI client.
Ollama can run on CPU or GPU, but is CPU by default. It does take extra configuration to set It up for GPU (and it is a very hands-on process), so you will need to be very comfortable with the command-ine to set it up correctly.
Llama.cpp
For those who want more performance and control, llama.cpp is the go-to engine. It ships with 30 different binaries for every single use case. If you have a GPU machine, it can be built from source and optimized for your specific hardware. You can download GGUF-format models from Hugging Face and run them locally in server mode.
Here’s a basic example:
llama-server --hf-repo TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
-hf-file mistral-7b-instruct-v0.2.Q3_K_S.gguf
Of you want to leverage llama.cpp on a GPU, you have to build it on that GPU. , Unless you are on apple silicon with metal, GPU is not available by default so keep that in mind. There are instructions for each possible GPU architecture on their GitHub, but if you run WSL2 like me you are probably going to have quite a few hiccups along the way. Ultimately, I've found it's worth if for the performance; I went from 900 tokens per second prompt processing to 3000 tokens per second.
Llama.cpp has benchmarking tools, CLI tools, Gemma tools, and more that ship with it. Llama.cpp is efficient, customizable, and ideal for developers who want to tune every layer of their inference pipeline.
LM Studio
For visual users or beginners, LM Studio provides a friendly GUI for downloading, managing, and chatting with models. Like Ollama, it is limited to the models that the ui server by default.
Databases and Vector Stores
If you’re building systems that need memory or retrieval-augmented generation (RAG), vector stores are key. You can totally host your own database/vector store with PostgreSQL. I use pgvector, a Postgres extension that enables high-performance vector similarity search. Vector is a column type that allows you to use an array of numbers that can do comparative distance queries. These queries will not be the fastest on the market using PostgreSQL, but I promise that us casual users won't know the difference.
Embeddings are vectors that are created from text payloads. They can be generated from either Ollama or llama.cpp in server mode. Once generated, embeddings can be stored and queried through PostgreSQL with pgvector. For example, OpenAI’s embedding API structure is widely adopted, and you can simulate it locally:
curl <http://localhost:11434/api/embeddings> \\
-H "Content-Type: application/json" \\
-d '{
"input": "The food was delicious and the waiter...",
"model": "llama3"
}'
Instead of self-hosting PostgreSQLs, I currently use Supabase’s free tier to spin up quick, full-stack prototypes. This allows for similarity searches to optimize behaviors via look up. Rather than sending vectors directly to a model, I typically use them for grounding prompts, contextualizing user input, or performing search-like queries. One situation is if someone in chat asks the question:
”What is your youtube channel?”
There are many variations on this question that expect the same answer:
”Do you have a youtube channel”? ”Where do you post your videos?” ”Where can I watch past streams?” ”What platforms are you on?”
I can store these as embeddings in a database is if someone asks a question that hits about a similarity threshold, then I will return a canned response such as:
”You can follow me at https://youtube.com/c/soypetetech”.
How SoyPete Tech Is Nerding Out!
PedroGPT is my locally-hosted AI assistant that integrates with Discord and Twitch. It answers user questions, summarizes code, and in future iterations, will even make pull requests based on GitHub Actions. This assistant is a live demonstration of what’s possible when you combine local inference, embeddings, and smart retrieval.
Looking forward, I’m building CLI-based agents inspired by Claude Code. These agents will scaffold code, run tests, and deploy updates using a combination of prompt engineering, Regex, and local execution. With pgvector support, the bot will be able to search past commands, summarize stream content, and respond to Twitch viewers with meaningful information from its knowledge base.
Editor Integrations
Tools like llama.vim and llama.vscode allow you to run models directly from your text editor. These extensions talk directly to the binaries on your machine, meaning you don’t need to run a server or rely on the cloud. This local-first developer experience enhances productivity and gives you fast, private access to LLMs while coding. I really want to stop using Copilot on my local stream setup and just run llama.cpp on my mac with nvim. It would be cool!
Conclusion
“Don’t wait for GPU credits—your machine is already powerful.”
Self-hosting models gives you speed, control, and creativity. You don’t have to depend on cloud providers or APIs. It’s not just about saving money—it’s about building an engineering environment that works for you.
Want more?
Subscribe to @soypetetech on Substack or come hang our on discord.