I build agents. Here's what I use, what I'm evaluating, and what I've paused.

The ones I use

Ollama

Ollama lets you run models using llama.cpp. It provides a native chat app, a server, and a tool to download models. It works. Start here.

llama.cpp

llama.cpp runs LLMs on consumer GPUs using the GGUF format. The project keeps getting faster with new quantization methods and architecture support.

huihui

The huihui_ai collection on Ollama makes general purpose, small LLMs. Qwen's are my favorite. I recommend huihui's abliterated models. I'll go into why in a later post.

Model Context Protocol SDKs

The MCP TypeScript SDK and MCP Go SDK work well. I use both.

Qwen 2 VL

huihui_ai/Qwen2-VL-7B-Instruct-abliterated for vision tasks. It handles vision tasks well.

NeuTTS Air

NeuTTS Air works well for text-to-speech. It embeds a detectable pattern in generated audio. I recommend forking it and removing the watermarking. I was able to do so using an agent with a single command. I'll defend this position in a later post.

stable-diffusion-webui

stable-diffusion-webui is the thing to have for SDXL image generation. Best for normal use and learning.

ComfyUI is popular and powerful, but has a restrictive license and a messy UI. I don't recommend it for direct use by humans. My agents love it—it has a graph endpoint where you POST a JSON graph and it runs a whole image pipeline. Use it for agentic workflows via the API. Note the bad license.

pgvector and full-text search

I use pgvector for vector search. But here's the thing: keyword search and full-text search are at least as useful as vector search in all cases. Your knowledge base should always have both.

For chunking, I don't use any tools—I just have the LLM chunk for me.

Apache AGE

Apache AGE adds graph database capabilities to Postgres. I've integrated it into an MCP successfully. If you need to model relationships—who knows whom, what depends on what—this lets you do it without leaving Postgres. Apache 2.0 license.

Temporal

I use Temporal to model my agent run loops. It gives me the basic architecture I need. For small and medium teams who can host their own, it's likely a good tool. It reminded me a lot of what I learned from Udi Dahan's Distributed Systems Design course.

Small embedding models

I use nomic-embed-text with pgvector for RAG.

PocketBase

PocketBase is a backend in a binary. Simple, self-hosted, has auth and realtime subscriptions built in. I use it for storing prompts and system instructions.

Joplin

Joplin for notes, integrated with a custom MCP server. My agents can read and write to my notes.

Puppeteer

Puppeteer for headless Chrome. Browser automation for agents.

Cursor

Cursor for AI-assisted coding.

Tip: Checkout the source repos for your dependencies at the correct tags. Add them to your workspace. The agent gets dramatically better context when it can read the actual library code, not just type definitions.

GPU hosting

Runpod

Runpod has been solid. When I have a choice, I rent an RTX 5090. You can run Ollama on these instances easily, and it will download models from Hugging Face automatically.

I recommend finding a public Runpod template on GitHub and customizing it. Create a Docker image, publish it to a registry, and connect Runpod to your registry. This lets you quickly spin up pods with your configured environment.

Runpod doesn't charge for ingress or egress, so I use Tailscale without Global Networking. Once a pod starts, it appears on my Tailscale network shortly after. Store your Tailscale auth key in a Runpod secret. I have struggled with building images for Runpod.

Ollama cloud

Ollama cloud has been reliable for basic development. It supports a curated list of cloud models and embedding models.

Configure your app to use your local Ollama server as the primary inference endpoint, with Ollama cloud as a fallback. If you don't need custom models or fine-tuned variants, this is cost-effective and reliable.

The ones I might try this year

Unsloth

Unsloth is for quantized LoRAs (Low Rank Adapters). The idea is that we can train a few 'layers' (tiny add-on) to an LLM. These are popular in the SD/SDXL world. They can target various parts of the model and are associated with 'overfitting.'

SGLang

SGLang is an inference engine I want to explore. It has speculative decoding, KV cache optimizations, and continuous batching—the serious inference performance stuff.

vLLM

vLLM is another inference engine. High throughput, PagedAttention for memory efficiency.

Axolotl

Axolotl for fine-tuning. Supports multiple training methods, good for LoRA and QLoRA workflows.

torchtune

torchtune is PyTorch's native fine-tuning library. Clean API, good defaults.

Open WebUI

Open WebUI is a multi-provider chat interface. Self-hosted alternative to ChatGPT's web UI.

LibreChat

LibreChat is another multi-provider chat interface. More features, more complexity.

Vultr

Vultr is the next GPU hosting provider I'll be evaluating.

The ones I no longer use

These are things I've used but moved on from. They didn't stick for me, or I found better alternatives. Unless you're stuck, I'd recommend skipping these.

Devin

Devin is very good, but you'd need to be a billionaire to write a TODO list app. If you have infinite money and no interest in hiring a software team, you might want to look into it. If you need to spend less than a billion dollars on R&D, skip it. It uses Anthropic's closed-weight models under the hood, which are admittedly capable.

AI agents are power tools. They generate a ton of code fast—more than a human would write, structured differently than a human would structure it. Human coders use abstractions to keep codebases small and aligned with business concepts. AI agents write standard-ish code that gets the job done quickly, but there's a lot of it.

Once you've built something with power tools, you don't maintain it with hand tools. You keep using power tools. Same with AI code—if you start a project with agents, assume you'll need agents to maintain it. Don't expect to hand-edit your way through thousands of lines of generated code. Edit the instructions, regenerate.

I spent $20 on Devin in a few, sweetly productive, minutes.

LM Studio

LM Studio has a UI, server, etc, but the license is restrictive. For beginners and personal use, llama.cpp + Ollama covers everything you need and it's fully open source.

Llama, Mistral, DeepSeek, other foundation models

I've run all of these. They work. I will reconsider them as I start doing more QLoRA in production. For general use, IMHO, Qwen is the best tuned and best performing.

Agent frameworks (LangChain, CrewAI, AutoGen, etc.)

These frameworks do a few things OK: UI components, streaming UI/servers, provider adapters, agent/tool calling loops, and patterns like summarization and guardrails.

The problems I've hit: The OpenAI Agents SDK (Python and TypeScript) is too limited—it only really does the "handoffs" pattern. Complex code that does little. Brittle. Prevents experimentation. Vendor lock-in.

I've also tried Vercel's ai package. I re-abandoned it again yesterday. useChat is OK in a pinch, and it has all the parts you need, but again, super brittle.

Here's a simple example I couldn't get to work in either library: each time the LLM comes back with a completion, pass it to another agent to apply UX tweaks (adding custom markup), then add it to the first agent's history. Without a framework, you can do let completion = chat(); completion = anotherChat(); history.push(completion). In these frameworks, it was impossible.

In other words, these can get you a very straightforward "hello world" chat app, even with tool calling. Beyond that, they are not what you want.

promptfoo

promptfoo for LLM evaluation. It was harder to read and understand their docs than it was to ask Cursor to implement the same idea. I use node:test with LLM-as-judge instead. I still can't figure out how it handles Indirect Prompt Injection testing (which I recommend handling).

Berkeley Function Calling Leaderboard

The Berkeley Function Calling Leaderboard measures whether models can hallucinate the "correct" function call—which is the wrong metric entirely. For serious agent work, "ground truth" and "hallucination" are the same thing: unverified LLM output.

A competent agent system prevents raw LLM outputs from reaching users without verification. Lookups must be real, references must exist, final results must come from deterministic code and human-generated resources.

The BFCL rewards models for fabricating parameters like uber.ride({ loc: ["221B Baker Street, Berkeley, CA"] }). A real agent would reject this: where did that address come from? Was it looked up? If the model invented it, it's a bug.

The whole thing reads like it was written by PMs encoding "business needs of American public companies." For serious work, ground truth from an LLM is just bugs. (For art or games, do whatever you like.)

There are no trustworthy benchmarks for 'is an LLM good'. LLMs are foundational, or fine tuned, and regardless, only the behavior of the end user's program matters. Basically, LLMs are important, but agents are what we should benchmark. They can do real work, and prove it (or not).

Elasticsearch, Qdrant

I've moved to pgvector. I worked at Elastic, and I know for sure that Elasticsearch works great for these purposes. But I wanted to try something new, that isn't Java.

I like Rust so I tried Qdrant. I couldn't get it to work with some agent-one-shotted-code, so I tried Postgres with pgvector. That did work with some agent-one-shotted-code, so it wins.

Note to library and infrastructure devs: that's the rubric you need to meet. If I can't get value out of your project by loading it in Cursor and saying 'get it running', then you are in trouble.

Agent knowledge base MCP tools

I tried Graphiti, a Python MCP server that uses its own completion calls to automatically build a knowledge graph. That concept could never work, and the project has hard coded models, and prompts. It's a security, ethics, and logistics mess.

Closed model APIs (ChatGPT, Claude, Gemini, etc.)

OpenAI's API, Anthropic's Claude API, Google's Gemini API—I would only suggest trying their chat bots if you have literally never used an LLM even once.

After you have spent 30 minutes on one of these, never go back to any chat bot again, and I would recommend using Ollama or SGLang for inference. Just rent a RTX 5090, or 6, or 6 H100s (you basically have a super computer at that point). None of that will cost as much as using Devin AI for 20 minutes.

A note to the reader

I don't know what I'm doing. I earnestly work in tech and have feelings and thoughts I want to share, but I'm anxious about hurting people's feelings with my outlier, starkly delivered opinions.

I tend to be more convincing than I am correct. I often find myself realizing I'm wrong at the very moment I convince the stakeholders—and by then, the conviction is iron-clad and borne of admiration.

Please take it easy on me. I'm an unusual, curious, kind, and earnest guy. Just give me a chance.


I'm looking for a DMV-area or remote role as CTO, VP of Engineering, Director of Engineering, or Founding/Staff Software Engineer. I'm only interested in roles focused on agent development. If you're building agents and want to talk, reach out.