We’ve built several AI features in Rails by now: image classification, image upscaling, similarity search, etc. And every time, the same question came up: which model and prompt should we actually use? The image classification project made this especially painful: a pricing change blew up our budget, smaller images proved to work better than larger ones, and every model switch required re-running the entire evaluation from scratch.
Every change on a prompt opens up a tree of choices. Which provider should we use? Which model? How detailed should the instructions be? Would more samples in the prompt work better? How much context per message? Should we use a reasoning model? Or augment the data available to the model with multi-modal input? There’s also the cost vs. accuracy tradeoff: is 10x the price worth a 5% improvement for this specific feature?
The combinatorial explosion gets overwhelming fast, and the result of the process has this feeling of uncertainty… is there a branch I missed that works better? Or that costs less?
The pragmatic choice: spreadsheets
We needed a methodology to track changes across iterations so the team can follow along. Naturally, we took a pragmatic stance: we started using spreadsheets for each feature tracking results across prompt/provider/model configurations, all run against the same data. It worked quite well, and over several features, we started seeing a workflow emerge, but…
Spreadsheets don’t scale
We knew the limits going in, but they became harder to ignore over time:
- They fragment. People make copies. When you’re sharing with non-technical collaborators, you end up with multiple sources of truth.
- No enforced structure. Each feature ended up with its own format. You have to re-learn how to read each one, and not all of them track the same metrics the same way.
- Hard to compare. Eyeballing results across configurations isn’t intuitive, and people get confused.
- No regression baseline. Once you settle on a configuration, how do you catch regressions later?
- Prompts drift. Someone edits the spreadsheet and forgets to update the code. Nobody notices until something breaks.
- Disconnected from code. Prompts and evaluations should live where the application lives.
In one project with many AI features, this all came apart. Links got lost, copies multiplied across different people’s drives with small divergences. Building eval datasets meant downloading images and re-uploading them to sheets. Running prompts required manual dev work because the data lived in Google Drive, but prompts had to go through the LLM provider. We built some internal tooling to help, but since every sheet and feature had a different format, nothing was reusable.
But they were useful to uncover what we needed: a place where you can couple a prompt configuration with a curated dataset extracted from real data, that helps you find the right balance between accuracy and costs for the feature at hand. Ideally, without leaving the Rails app.
So we built RubyLLM::Evals, a Rails engine for testing, comparing, and improving LLM prompts directly inside your application.
RubyLLM::Evals
Since we’re using RubyLLM, it made sense to build on top of it.
The core abstractions are prompts and samples. A prompt captures a full configuration: provider, model, system instructions, message template (with Liquid variables), tools, and output schemas. If you already have tools or schemas in your app, you can reuse them. Samples are your test cases: each one defines an evaluation type (exact match, contains, regex, LLM judge, or human judge) and an expected output.
The interesting design choice was making the LLM-as-judge a first-class eval type. For features like summarization or classification with fuzzy boundaries, exact matching doesn’t cut it. You need another model to assess whether the response is good enough. It’s not perfect, the judge has its own biases and failure modes, but for iterative prompt development, it’s a pragmatic tradeoff: fast feedback now, human review on the edge cases.
Each run saves a snapshot of the prompt settings and records accuracy, cost, and duration. A comparison tool lays all runs of a prompt side by side, so you can spot what changed and why.
Real application data
One thing we really wanted was the ability to populate samples from the application’s data. For example, in our image categorization feature, we can:
prompt = RubyLLM::Evals::Prompt.find_by(slug: "image-categorization")
Image.uncategorized.limit(50).each do |image|
sample = prompt.samples.create(eval_type: :human)
sample.files.attach(image.attachment.blob)
end
Now you’re iterating on your prompt with actual production data, not synthetic examples.
The temptation is to throw hundreds of samples at a prompt and see what sticks. In practice, a smaller curated set that covers your edge cases tells you more than a large random one. We typically start with 20-30 samples: a mix of straightforward cases, known hard cases from production, and a few adversarial examples. If accuracy looks promising, we expand. If not, the small set is faster to iterate on.
In production
Once you’re happy with a prompt, you can use it directly in your application:
response = RubyLLM::Evals::Prompt.execute(
"image-categorization",
files: [image.attachment.blob]
)
response.content # => "deck"
The configuration lives in the database, versioned through your evaluation runs, always in sync with what you tested. Rolling back to a previous version or A/B testing a new iteration becomes straightforward.
Where this leaves us
Production data has a way of surprising you: new usage patterns, edge cases you never curated a sample for, a provider silently updating a model or its pricing… your prompt’s accuracy can degrade, or your cost can skyrocket without a single line of code changing. This is a challenge that has no single solution, but monitoring a prompt’s performance in production is key. Each feature will require something different and use different metrics, but you need feedback, so when your metrics surface a drift, lower quality results, or higher error cases, higher costs, you can pull new samples into RubyLLM::Evals and adjust the prompt to the new reality.
The pattern we keep seeing across projects is that prompts are never done. Models get updated, data distributions shift, and what worked last month might silently degrade and fail over time. Continuous testing and monitoring are critical.
RubyLLM::Evals and RubyLLM::Monitoring are how we go from concept to production. Both are open source and built for Rails.
At SINAPTIA, we specialize in helping businesses implement AI solutions that deliver real value. If you’re facing challenges with prompt engineering or AI integration, we’d love to help.