Operations & quality

AI Debug & Continuous Improvement

The "AI Debug" area is your self-service toolkit for quality: here you find out why the assistant answered the way it did, fix missed answers yourself, and prove with tests that a change has actually improved something. You reach the area via Agents → [Your agent] → AI Debug. The page is divided into five tabs: "Overview", "Retrieval test", "Turn traces", "Eval" and "Routing examples". On top of that there is the "Decision flow", which you open directly from any conversation.

This section walks you once through the complete improvement loop: read health → find a missed answer and adjust the threshold → inspect a single answer in detail → rate with thumbs down and promote it to a training example in one click → build a test set → prove with the eval that it is now better.

Briefly upfront — what "distance" means: The assistant searches for matching knowledge sections via a so-called cosine distance (0 = identical in content, 2 = completely opposite). A section is normally only used if its distance falls below the threshold (kb_max_distance, default 0.5). More recently the search works "recall-first": if nothing at all falls below the threshold, the best near-miss is used anyway (marked yellow in the decision flow as "Recall fallback"). This way the customer gets a generous answer rather than none at all — but a high recall-fallback share is a signal that your data or the threshold need readjusting.


18.1 "Overview" tab — reading knowledge health

The overview shows aggregated metrics across all recorded conversations within a time period. This is your starting point: a quick health check before you go into detail.

Step 1: Choose the time period. At the top there are three switches — "7d", "30d" and "90d" (days). The default is 30 days. Recommendation: 30 days for everyday use, 7 days directly after a change (to see the effect quickly).

Step 2: Read the metric tiles. Each tile turns green/yellow/red depending on whether the value is good:

  • "Quality index" — overall score from 0 to 100. Green from 70, yellow from 40, below that red. Your most important single number.
  • "Knowledge hit rate" — share of turns where matching knowledge reached the model. Green from 80%, yellow from 50%.
  • "No-context rate" — share of turns entirely without knowledge context. Turns red from 20%. High values = knowledge gaps or too strict a threshold.
  • "Recall-fallback rate" — share of knowledge turns that were answered only via the generous near-miss. Turns yellow from 30%. High = your content only just matches; a sign to upload clean text or raise the threshold.
  • "Embedding errors" — share of failed question embeddings (technical error, e.g. OpenAI key/rate limit). Any value above 0 is red and should not occur.
  • "Iteration-limit rate" — share of turns that reached the internal step limit. Turns yellow from 10%.
  • "Recorded turns" — total count in the period, with a note on how many of them ran with an active knowledge base.
  • "Retrieval p50 (ms)" — typical (median) search time in milliseconds, plus the p95 value (the slow 5%).
  • "Best distance p50" — typical best distance per turn. The smaller, the better your knowledge matches the real questions.

Step 3: Check the detail lists below.

  • "Distance distribution" — a bar chart showing at which distances your hits land. Green = reached the model. Many bars just to the right of the threshold? Then you are missing near hits.
  • "Flag frequency" — how often technical flags such as no_kb_context or kb_recall_fallback occurred.
  • "Most frequent failing questions" — the customer questions that most often went without knowledge. This list is gold: copy them straight into the retrieval test (tab 2) to reproduce the failure.
  • "Retrieved documents" / dead documents — shows how many of your documents were ever used at all. A "dead document" was never retrieved in the period — either superfluous, poorly worded, or with too low a quality score (marked red below 50).

Step 4 (optional): Export the report. Via the bar at the top right you can "Copy Markdown", "Copy JSON" or "Download .md" the report — handy for sharing it with support or your team.


18.2 "Retrieval test" tab — finding and fixing a missed answer

This is the most powerful tool in the area. It runs the knowledge search for a test question — without actually letting the assistant/the model answer (cost-effective and with no side effect). You see exactly which sections were found, which just missed out, and why.

At the top you choose the mode: "Single question" (default) or "Batch audit".

Single question

Step 1: Enter the question. Type a real customer question into the large text field, e.g. "How much is the roast duck?". Tip: with ⌘/Ctrl + Enter you start straight away.

Step 2 (optional): Override max distance. The "Max distance" field lets you try out a different threshold for testing. - What it does: Limits how "far away" a section may still be used. Larger = more generous (more, but less certain, hits). - Default: Empty = the agent's stored threshold is used, otherwise the global default 0.5. The placeholder shows you the active value. - Bounds: 0.0 to 2.0; step size 0.05. - Recommendation: Leave it empty at first so that you see what the customer really experiences. Only then experiment.

Step 3: Click "Run retrieval". You get:

  • Result banner — either green ("n section(s) reached the model.") or red ("No knowledge reached the model for this question."). Below it a summary: candidates checked, best distance, threshold used and top_k.
  • Recommendation box (blue): Appears when a narrowly missed section could be rescued with a higher threshold — e.g. "A threshold of 0.61 would include the best missed section." With two buttons:
  • "Apply & rerun" — tries out the recommended threshold for testing only (not yet saved).
  • "Save to agent" — adopts the threshold permanently for this agent (one-click fix). Confirmation: "Threshold saved to agent."
  • "Distance distribution" — the same histogram as in the overview, with a dashed threshold line.
  • Candidate table — each found section with a distance bar (green = used), a "Result" label (used / recall fallback / over threshold / below top_k) and a content preview. This way you see immediately which section was missing or why it was dropped.
  • "Context block sent to the model" — the exact text the assistant would have received to answer.
  • Knowledge base overview — all documents with type, status, quality, section and character count. Very short documents (under 200 characters) and low quality are marked red.

The typical repair loop: test question → no hit → save the recommendation or (better for junk hits) re-upload the document in question as clean text → test again until it is green.

Batch audit

Instead of one question you check many at once — ideal for content review.

Step 1: Either type one question per line, or leave the field empty and use "Recent messages" to test the most recent real customer messages (default 20, allowed 1–50; the field is locked as soon as you enter your own questions).

Step 2: "Start audit". You get a table with one row per question: sections found (⚠ on embedding error), best distance and "Fix at" (the threshold that would rescue this hit). Rows with no hit at all are highlighted red — your to-do list.


18.3 "Turn traces" tab — forensically inspecting past answers

This is where the stored detailed recordings of real answers ("traces") live. Each trace captures the exact prompt, the search, the timings and all tool calls.

Step 1: Optionally enable the "Failed / flagged turns only" switch to jump straight to the problem cases.

Step 2: The table shows time, customer message, knowledge (used/seen), tokens, duration (ms) and flags. Clicking a row opens the detail dialog with: - Customer message and Final answer, - "Phase timings" (waterfall: how long the search and LLM calls took), - "Retrieval candidates" (the same distance table as in the retrieval test), - collapsible: "Assembled system prompt", "Messages sent" and the raw JSON block (LLM calls, tools, timings, errors). - Here too you export everything via Markdown/JSON/Download.

Note on data protection: Traces contain message texts, are therefore PII-scrubbed on saving, deleted automatically after expiry, and removed along with a contact in a GDPR deletion.


18.4 The "Decision flow" + thumbs feedback (in the conversation)

You open the decision flow not here, but directly in a conversation: below every assistant answer you find a small action bar.

Step 1: Rate. Click thumbs up ("Good answer") or thumbs down ("Bad answer"). On thumbs down a correction area automatically expands.

Step 2: "View decision". Opens the "Decision flow" — a flow diagram (n8n-style) of the one answer, top to bottom: Message → Routing → Tools → Knowledge → LLM → Tool calls → Answer. Each node has a traffic light (green/yellow/red). Clicking a node opens a detail inspector on the right — at the Knowledge node, for example, a "Retrieval distance map": each candidate as a point on an axis, green zone = within the threshold (used), red zone = discarded, with a threshold marker. The most informative node is pre-selected automatically. A recall fallback is shown as a yellow node with the note "Recall fallback (generous hit)".

Step 3: On thumbs down, capture the cause. In the correction area: - "What went wrong?" — choose a category: Wrong tool, Wrong routing, Bad answer, Knowledge missing or Other. - Note — optional free-text field.

Step 4: Promote to training data in one click. This is exactly where the loop closes — three promotion options:

  • "Fix routing — as an example for topic:" Choose the correct topic (Knowledge, Appointments, Order, Lead, Small talk), then "Adopt routing example". The preceding customer message is saved (trimmed to 150 characters) as a routing example for this topic — the router learns to assign such questions correctly in future. (Appears in the "Routing examples" tab.)
  • "Teach it the right answer:" In the field, edit the answer that would have been correct, then "Adopt example answer". This saves a question→answer pair (few-shot example) that the assistant will orient itself by in future.
  • "Add to eval set" — turns this turn into a permanent regression test (see "Eval" tab). The customer message becomes the test input, your correction (if any) the reference answer, your note the grading criterion, and the chosen topic the expected topic.

After a successful promotion, a green "Adopted" label appears on the message.


18.5 "Eval" tab — proving that a change works

The eval harness is your regression test: you collect "golden cases" (exemplary question→answer pairs), replay them at the push of a button, and have each answer rated by an LLM judge — with no real side effects (no bookings, no emails).

Step 1: Create golden cases. In the lower "Golden cases" area click "Add case" and fill in: - Name (required, 1–200 characters) — e.g. "Opening hours". - Input (required, 1–4000 characters) — the customer message to replay. - Reference answer (optional, up to 8000 characters) — the ideal answer; gives the judge a benchmark.

Save with "Save case". Each case carries a source (manual / feedback / trace) — cases you promoted from a conversation appear here automatically with source feedback. You delete cases via the trash icon. Recommendation: 10–20 cases that cover your most common and trickiest real questions.

Step 2: "Start eval". The button at the top right is only active once at least one case exists. The run replays all active cases and rates them. Completion message: "Eval finished: X/Y passed".

Step 3: Read the results. On the left the "Runs" history (each run with date and pass rate as a percentage badge — green from 80%, yellow from 50%, otherwise red). Clicking a run shows "Results" on the right: - the pass rate in large type (e.g. 85%, 17/20), - three average judge scores: "Helpful", "Correct" and "Tool" (appropriate tool choice), - below that each case with a tick (passed) or cross (failed), the actually generated answer, the judge's rationale and the individual scores (Helpful/Correct/Tool).

Step 4: Close the loop. Run the eval before a change → make the change (threshold, knowledge, persona, routing examples) → run the eval again → compare the rates of the two runs. If the rate rises, the change has measurably helped. If it falls, you have spotted a regression before real customers notice it.


18.6 "Routing examples" tab — sharpening topic assignment

Here you deposit, per topic, example sentences that customers really use. The router uses them (together with the FAQ, menu and services) to send each message to the right capability.

Step 1: Per topic card, enter example sentences via "Add phrase", e.g. under "Knowledge / Questions" the sentence "How much is the duck?". Which topics appear depends on the agent type (restaurant, appointments, sales, support); the default is all of them: Knowledge / Questions, Book appointments, Orders, Sales / Leads, Greeting / Small talk.

Step 2: Mind the limits — each phrase 4 to 150 characters (too short/long is outlined red), max. 10 phrases per topic and max. 30 in total (the total counter at the top turns red as soon as 30 is exceeded — then saving is not possible).

Step 3: Save with "Save phrases" ("Routing phrases saved."). Empty fields are discarded automatically. If you leave a topic empty, sensible defaults apply.

Connection to the loop: When in a conversation you flag an answer as Wrong routing and click "Adopt routing example" (section 18.4), the phrase lands here automatically. You can then fine-tune it here.


Tips & pitfalls

  • Diagnose first, then turn the dials. Leave the "Max distance" field empty on the first retrieval test — that way you see what the customer really experiences. Only then experiment or save the recommendation.
  • Dirty crawled pages inflate the distance. If a document contains a lot of navigation and footer text ("boilerplate"), even good sections appear "far away" and get missed. The better fix is then not to crank up the threshold, but to re-upload the clean text. Watch out for dead documents and low quality scores (red) in the overview.
  • Recall fallback is a crutch, not a victory. A yellow "Recall fallback" flag means: there was actually no good hit, and the best near-miss was taken out of desperation. A high recall-fallback rate in the overview is a clear signal to improve your knowledge.
  • The threshold is a double-edged sword. Higher = more hits, but also more wrong/irrelevant sections. Only raise it far enough that the desired section just slips in (the recommendation does exactly that) — not blanket to 2.0.
  • Distinguish "over threshold" vs. "below top_k". "Over threshold" means the section was too far away — a higher threshold helps. "Below top_k" means the section was close enough, but was crowded out by too many others — here a larger top_k helps (default 5), not the threshold.
  • Eval needs cases. The "Start eval" button stays grey until at least one golden case exists. Build the set up early — most easily by promoting bad answers straight from conversations with "Add to eval set".
  • Always measure before/after. A threshold or knowledge change "feels better" — but it can only be proven by running the eval before and after the change and comparing the pass rates.
  • The batch audit saves time. Instead of testing 20 questions one by one: run a batch audit with "Recent messages" — the rows highlighted red are your repair list.
  • Batch audit and tests cost a little. The batch audit embeds up to 50 questions and respects the agent's daily budget (daily_budget_usd) — once it is reached, the run is stopped. Plan for this if you have set a tight budget.
  • Traces must be enabled. If no traces exist, nothing is recorded (or recording is disabled). Without traces the decision flow stays empty ("No decision trace was recorded for this answer.").
  • Thumbs down without promotion achieves little. The rating alone does not improve the agent automatically — only the promotion as a routing example, example answer or eval case turns it into real training. Use the one-click buttons consistently.