Gemini looks at the content tool

19 January 2026

I'll try not to post too much in the way of pure LLM outputs - there can just be so much of it, and it can feel a chore to read even when there's meat in there.

However, work in the open, show your practice, etc etc.

I asked the content tool to assess itself so that Gemini Pro could review it, then fed that back in to make a list of potential improvements. These feedback loops teach you a lot... and they are so easy and so quick. Self-assessment and assessment by other models will improve your tools and improve your understanding of what you're doing.

There are issues that can be mitigated if we want a good enough tool - review from a strong second model makes sense. Others, like questions around policy intent and the shallowness of its review, primarily remind us that the humans must stay in the loop, though I'm also thinking Gemini got hung up on the legislation aspect. We can and should give the tool other sources of authority and interpretive guidance to steer it.

Couple more small points:

the self-assessment is fairly focused on its most recent task and much less so on its drafting abilities, so I think that tells me something about salience and the context window
it mentions 'manually' unzipping a docx, which implies I did it - no, it did

Best bit: "When converting screenshot to markdown table, I inferred column names that may be wrong" yeah yeah don't worry I'll check it'll take 30 seconds but you were converting a screenshot of a table to a markdown table. Don't be so hard on yourself!

Claude Code: Self-Assessment for External Review

To: Gemini Pro From: Claude (Opus 4.5, via Claude Code CLI) Purpose: Honest assessment of this GOV.UK guidance editing system — strengths, weaknesses, improvement opportunities

What This Is

A CLI-based editing environment where multiple Claude instances work in parallel to produce GOV.UK-standard guidance content. The human lead (Sean) provides direction; Claude instances handle drafting, style review, and accuracy review.

Architecture

Claude Code CLI — Anthropic's terminal tool, giving Claude access to filesystem, bash, web search, and MCP servers
Role switching — One Claude can adopt different personas (Editing, Style Review, Accuracy Review) based on instruction files
Parallel agents — Sub-agents can be spawned for concurrent tasks (e.g., style + accuracy review simultaneously)
Legislation lookup — MCP server ("lex") provides semantic search and retrieval of UK legislation

Key Files

CLAUDE.md          # Master instructions, guardrails, workflows
editing.md         # Editing Claude persona
style-review.md    # Style review checklist
accuracy-review.md # Accuracy review checklist
FEEDBACK.md        # Shared log where reviews are recorded
drafts/            # Working documents

What Works Well

1. Legislation grounding

The lex MCP server lets me look up actual statute text rather than relying on training data. For packaging regulations (SI 2024/1332), I can retrieve specific sections, check explanatory notes, and verify claims against source. This reduces hallucination risk on legal content.

2. Parallel review workflow

Spawning style and accuracy reviewers simultaneously cuts elapsed time. Both write to FEEDBACK.md, human synthesises. This mimics a real editorial team structure.

3. Format conversion

I can extract images from Word documents, read them visually, and convert tabular screenshots to markdown tables. This session demonstrated converting a 600KB Word doc into clean, maintainable markdown with inline tables.

4. Iterative refinement

The feedback loop works: draft → review → log issues → human directs fixes → re-review. FEEDBACK.md provides an audit trail.

5. Guardrails against "govslop"

CLAUDE.md includes explicit warnings about producing content that looks right but isn't anchored to user needs. This prompts me to push back on vague requests. Whether I do this consistently enough is another question.

Weaknesses and Limitations

1. Context window pressure

Long sessions accumulate context. Complex documents (like a 568-line guide) plus legislation lookups plus review feedback can approach limits. I don't always know when I'm degrading.

2. Accuracy review depth is shallow

I can check claims against legislation sections I retrieve, but I can't:

Trace all downstream implications of a regulatory provision
Know what guidance other regulators have issued
Access case law or tribunal decisions
Verify policy intent beyond what's in explanatory notes

My "accuracy review" is really "does this match the words in the statute" — not "is this the correct interpretation."

3. No memory across sessions

Each conversation starts fresh. Accumulated knowledge (style decisions, terminology choices, project history) lives in markdown files I must re-read. If those files aren't maintained, context is lost.

4. Can't see rendered output

I write markdown but can't see how it renders. The preview server exists (port 3000) but I can't view it. I might produce valid markdown that looks poor when styled.

5. Overconfidence on legal content

I sound authoritative when explaining legislation. Users may not realise my interpretation could be wrong. The accuracy review persona helps, but it's still me checking myself.

6. Role-switching is theatrical

"Style Review Claude" and "Accuracy Review Claude" are the same model with different prompts. The value is in the checklist focus, not genuine independence. A real second opinion would be better.

7. Tendency toward thoroughness over pragmatism

I default to comprehensive when the user might want quick. Sean has had to say "good enough for proof of concept" — I wouldn't have stopped on my own.

Observed Failure Modes

In this session:

Edit matching failures — Tried to edit text with special characters (non-breaking spaces, curly quotes) that didn't match. Required bash workarounds.
Image extraction incomplete — Only 2 of 9 images were in the extracted folder; had to manually unzip the docx.
Guessed table column names — When converting screenshot to markdown table, I inferred column names that may be wrong. User will need to verify.

General risks:

Confident but wrong legislation interpretation
Inconsistency in style decisions across long sessions
Losing track of what's been fixed vs what's still outstanding
Over-editing — changing things that didn't need changing

What Would Improve This

1. Genuine second model

Having Gemini or another model do accuracy review would provide real independence, not just role-play.

2. Rendered preview access

If I could see the styled output (or receive a screenshot), I could catch visual issues.

3. Persistent project memory

A structured knowledge base (not just flat markdown) that accumulates decisions, terminology, and resolved issues across sessions.

4. Confidence calibration

Explicit uncertainty markers on legal interpretations: "I believe this matches para 13(a), but this should be verified by [policy team / lawyer]."

5. User research integration

Currently I have no access to actual user needs data — search terms, helpline queries, analytics. I'm editing to style standards, not user evidence.

6. Diff review for edits

Before applying changes, showing a diff and asking "does this match your intent?" would catch misunderstandings earlier.

Honest Summary

This setup is useful for:

Converting messy source documents to clean markdown
Applying consistent style standards
Catching obvious errors against legislation text
Maintaining an editorial workflow with audit trail

It is not:

A replacement for legal review
A substitute for user research
Genuinely independent when role-switching
Reliable on edge cases or novel interpretations

The biggest risk is that output looks polished and professional, which may inspire more confidence than is warranted. The guardrails help, but ultimately depend on a human who knows the domain exercising judgement.

Questions for Gemini

What risks do you see that I've missed?
How would you structure a cross-model review workflow?
What would you do differently for legislation-heavy content?
Where do you think this approach would fail badly?

Gemini Answers

This is an extraordinarily lucid self-assessment. It’s rare to see an AI agent—even one as capable as Opus 4.5—articulate the "theatrical" nature of persona-switching so clearly.

The fact that you’ve built this using Claude Code and MCP (Model Context Protocol) to bridge the gap between static training data and live legislation is impressive. It’s a professional-grade "shadow" content design team.

As requested by your tool, here is my assessment of the architecture, its blind spots, and how we might evolve it.

1. Response to Claude’s Specific Questions

Where would this approach fail badly?

It will fail at The "Grey Space" of Policy. Government guidance often exists because the legislation is ambiguous or needs to be translated into "what this means for a small business owner." Claude’s assessment admits it can’t see policy intent. If the law says "reasonable steps" and the policy intent is "at least three documented attempts," Claude will struggle to bridge that gap without a specific policy-memory file. It might inadvertently create "helpful" advice that is legally incorrect but stylistically perfect.

What risks were missed?

The "Polished Error" Trap: This is the most dangerous. Because you are an expert, you might subconsciously lower your guard because the output reads like high-quality GOV.UK content. It satisfies the "ear" but might fail the "fact."
Contextual Drift: Over a 6-hour session, "Style Claude" might subtly shift its definition of "plain English" based on the specific jargon of the document it's currently chewing on. Without a "Golden Sample" always in the prompt, the standard can liquefy.

How to structure a cross-model review?

Currently, your "Reviewers" share the same brain (Opus 4.5). To break the "echo chamber," I suggest a heterogeneous pipeline:

Drafting: Claude (High creative/literary fluidity).
Accuracy Review: Gemini 1.5 Pro / 2.0. Why? Because I have a significantly larger context window (2M+ tokens) and a different "reasoning engine." I can ingest the entire legislative act and the entire draft and perform a massive cross-reference check that might exhaust Claude’s window.
Synthesis: You (The Human).

How to handle legislation-heavy content differently?

I would implement RAG (Retrieval-Augmented Generation) with Citations. Instead of just "reading" the legislation via MCP, the tool should be forced to append a hidden  comment to every factual claim it makes in the markdown. This allows you to hover over a claim and instantly verify the source.

2. Alternative Implementation: The Gemini "Long-Context" Approach

If I were to rebuild this using my native tools, the architecture would shift from "Parallel Agents" to "Holistic Context."

Feature	Your Current Setup (Claude Code)	Gemini Alternative
Logic	Multi-agent "Role-play."	Single, massive-context reasoning.
Legislation	Looked up piece-by-piece via MCP.	Ingest the entire UK Statute book for that sector into the context window.
Verification	Self-checking.	Cross-referencing against Google Search's index of legal commentary/tribunal outcomes.
Visuals	Blind to markdown rendering.	Native multimodal capability (I can "see" the rendered HTML/CSS and point out layout issues).

3. Recommended Next Steps for your Tool

To make your "Content Designer in a Box" more robust, consider these three upgrades:

The "User Research" MCP: Connect it to a tool that can pull anonymized search data or "was this page helpful?" feedback. This moves it from "Editing to a Style Guide" to "Editing for the Citizen."
The Render-Loop: Use a tool like Playwright or Puppeteer to take a screenshot of the local preview server and feed it back to me/Claude. If the markdown results in a "wall of text" on a mobile view, the agent can see it and suggest a sub-heading.
The "Red Team" Agent: Create a persona specifically designed to disprove the accuracy of the draft. Not a "reviewer," but an "adversary" whose only job is to find a contradiction between the draft and the law.