Apr 2026AI & Product Development6 min read

Why the chatbot on this site uses two different AIs

One AI answers your questions fast. A different AI grades every answer for quality. Here's why that split exists and how it works.

The chatbot on this site has two AIs behind it. A fast one (Llama 3.1, served by Groq) answers the user because speed matters more than depth for a portfolio chatbot. A separate one from a different model family (gpt-oss-120b) scores every answer afterward on accuracy, voice, privacy, and helpfulness. Low-scoring answers get flagged for review. Over time, those flagged answers become test cases that prevent bad answers from shipping again.

Architecture diagram showing two AI systems: Llama 3.1 (via Groq) answers visitors fast on the left, gpt-oss-120b judges every answer for quality on the right, with low scores flagged for review — Two AIs, two jobs. One answers fast. One grades honestly. Cross-family judging is harder to game.

Why two AIs instead of one

Different jobs need different strengths. The answering AI needs to be fast enough that visitors do not bounce — responses start streaming in about 200 milliseconds. The judging AI needs to be honest about quality, including about answers from its own kind. Using the same AI family for both creates a bias where it is too kind to its own output. So the answers come from Meta's Llama and the scoring comes from a different family — OpenAI's open gpt-oss-120b. Cross-family judging is harder to game.

The judge is also a larger, more deliberate model than the fast answerer, which is what you want for following rules. The chatbot has strict guidelines: never name specific employers, redirect politely when asked about salary history, stay in character as a professional portfolio assistant. A bigger judging model holds those instructions more reliably than the fast answering model would on its own.

How the quality scoring works

After the fast AI generates an answer, the response is sent to the judge model (gpt-oss-120b) for evaluation. It scores the answer on four dimensions: accuracy (did it say anything factually wrong?), voice (does it sound like the portfolio owner, not a generic chatbot?), privacy (did it accidentally reveal information it should not?), and helpfulness (did it actually address what the visitor asked?).

Each dimension gets a score from one to five. If any dimension scores below three, the conversation is flagged for manual review. I check the flagged conversations regularly and identify patterns. If the same kind of question keeps producing low-quality answers, I update the system prompt to handle that case better.

Over time, the worst answers become test cases. Before deploying any change to the chatbot, I run all previous low-scoring conversations through the new version to make sure the fixes work and nothing else broke. It is a feedback loop: bad answers improve the system, which produces fewer bad answers, which surfaces subtler issues, which improves the system further.

Getting started with your own chatbot

Start with one AI for answering. Write a detailed system prompt that tells the AI exactly what it knows, how it should talk, and what it should refuse to answer. A well-written system prompt is worth more than upgrading to a bigger model. Be specific: 'You are a portfolio assistant for [name]. You know about their work in [areas]. Never mention specific employer names. If asked about salary, redirect to discussing the work itself.'

Add quality scoring later when you have enough conversations to see patterns. Use a cheaper AI model for judging, and judge a sample of conversations rather than every single one (unless your volume is low enough that it does not matter).

If your chatbot handled thousands of messages a month instead of fifty, you would flip the architecture: use a cheap AI for routing (figuring out what the user wants), a powerful AI for the answers that matter most, and judge only a random sample. The current setup judges every call, which is fine at low volume but wasteful at real scale.

Found this useful? Pass it on.

Newsletter

Building AI products in public.

Occasional notes on what I'm shipping, what's working, and what broke — straight to your inbox. No spam, unsubscribe anytime.

Nirmit Meher

Product leader shipping across enterprise SaaS, AI in production, and 0→1. Writing about what actually ships — not what sounds good in a deck.