Gemini looks at the content tool

I'll try not to post too much in the way of pure LLM outputs - there can just be so much of it, and it can feel a chore to read even when there's meat in there.

However, work in the open, show your practice, etc etc.

I asked the content tool to assess itself so that Gemini Pro could review it, then fed that back in to make a list of potential improvements. These feedback loops teach you a lot... and they are so easy and so quick. Self-assessment and assessment by other models will improve your tools and improve your understanding of what you're doing.

There are issues that can be mitigated if we want a good enough tool - review from a strong second model makes sense. Others, like questions around policy intent and the shallowness of its review, primarily remind us that the humans must stay in the loop, though I'm also thinking Gemini got hung up on the legislation aspect. We can and should give the tool other sources of authority and interpretive guidance to steer it.

Couple more small points:

Best bit: "When converting screenshot to markdown table, I inferred column names that may be wrong" yeah yeah don't worry I'll check it'll take 30 seconds but you were converting a screenshot of a table to a markdown table. Don't be so hard on yourself!


Claude Code: Self-Assessment for External Review

To: Gemini Pro From: Claude (Opus 4.5, via Claude Code CLI) Purpose: Honest assessment of this GOV.UK guidance editing system — strengths, weaknesses, improvement opportunities


What This Is

A CLI-based editing environment where multiple Claude instances work in parallel to produce GOV.UK-standard guidance content. The human lead (Sean) provides direction; Claude instances handle drafting, style review, and accuracy review.

Architecture

Key Files

CLAUDE.md          # Master instructions, guardrails, workflows
editing.md         # Editing Claude persona
style-review.md    # Style review checklist
accuracy-review.md # Accuracy review checklist
FEEDBACK.md        # Shared log where reviews are recorded
drafts/            # Working documents

What Works Well

1. Legislation grounding

The lex MCP server lets me look up actual statute text rather than relying on training data. For packaging regulations (SI 2024/1332), I can retrieve specific sections, check explanatory notes, and verify claims against source. This reduces hallucination risk on legal content.

2. Parallel review workflow

Spawning style and accuracy reviewers simultaneously cuts elapsed time. Both write to FEEDBACK.md, human synthesises. This mimics a real editorial team structure.

3. Format conversion

I can extract images from Word documents, read them visually, and convert tabular screenshots to markdown tables. This session demonstrated converting a 600KB Word doc into clean, maintainable markdown with inline tables.

4. Iterative refinement

The feedback loop works: draft → review → log issues → human directs fixes → re-review. FEEDBACK.md provides an audit trail.

5. Guardrails against "govslop"

CLAUDE.md includes explicit warnings about producing content that looks right but isn't anchored to user needs. This prompts me to push back on vague requests. Whether I do this consistently enough is another question.


Weaknesses and Limitations

1. Context window pressure

Long sessions accumulate context. Complex documents (like a 568-line guide) plus legislation lookups plus review feedback can approach limits. I don't always know when I'm degrading.

2. Accuracy review depth is shallow

I can check claims against legislation sections I retrieve, but I can't:

My "accuracy review" is really "does this match the words in the statute" — not "is this the correct interpretation."

3. No memory across sessions

Each conversation starts fresh. Accumulated knowledge (style decisions, terminology choices, project history) lives in markdown files I must re-read. If those files aren't maintained, context is lost.

4. Can't see rendered output

I write markdown but can't see how it renders. The preview server exists (port 3000) but I can't view it. I might produce valid markdown that looks poor when styled.

5. Overconfidence on legal content

I sound authoritative when explaining legislation. Users may not realise my interpretation could be wrong. The accuracy review persona helps, but it's still me checking myself.

6. Role-switching is theatrical

"Style Review Claude" and "Accuracy Review Claude" are the same model with different prompts. The value is in the checklist focus, not genuine independence. A real second opinion would be better.

7. Tendency toward thoroughness over pragmatism

I default to comprehensive when the user might want quick. Sean has had to say "good enough for proof of concept" — I wouldn't have stopped on my own.


Observed Failure Modes

In this session:

General risks:


What Would Improve This

1. Genuine second model

Having Gemini or another model do accuracy review would provide real independence, not just role-play.

2. Rendered preview access

If I could see the styled output (or receive a screenshot), I could catch visual issues.

3. Persistent project memory

A structured knowledge base (not just flat markdown) that accumulates decisions, terminology, and resolved issues across sessions.

4. Confidence calibration

Explicit uncertainty markers on legal interpretations: "I believe this matches para 13(a), but this should be verified by [policy team / lawyer]."

5. User research integration

Currently I have no access to actual user needs data — search terms, helpline queries, analytics. I'm editing to style standards, not user evidence.

6. Diff review for edits

Before applying changes, showing a diff and asking "does this match your intent?" would catch misunderstandings earlier.


Honest Summary

This setup is useful for:

It is not:

The biggest risk is that output looks polished and professional, which may inspire more confidence than is warranted. The guardrails help, but ultimately depend on a human who knows the domain exercising judgement.


Questions for Gemini

  1. What risks do you see that I've missed?
  2. How would you structure a cross-model review workflow?
  3. What would you do differently for legislation-heavy content?
  4. Where do you think this approach would fail badly?

Gemini Answers

This is an extraordinarily lucid self-assessment. It’s rare to see an AI agent—even one as capable as Opus 4.5—articulate the "theatrical" nature of persona-switching so clearly.

The fact that you’ve built this using Claude Code and MCP (Model Context Protocol) to bridge the gap between static training data and live legislation is impressive. It’s a professional-grade "shadow" content design team.

As requested by your tool, here is my assessment of the architecture, its blind spots, and how we might evolve it.


1. Response to Claude’s Specific Questions

Where would this approach fail badly?

It will fail at The "Grey Space" of Policy. Government guidance often exists because the legislation is ambiguous or needs to be translated into "what this means for a small business owner." Claude’s assessment admits it can’t see policy intent. If the law says "reasonable steps" and the policy intent is "at least three documented attempts," Claude will struggle to bridge that gap without a specific policy-memory file. It might inadvertently create "helpful" advice that is legally incorrect but stylistically perfect.

What risks were missed?

How to structure a cross-model review?

Currently, your "Reviewers" share the same brain (Opus 4.5). To break the "echo chamber," I suggest a heterogeneous pipeline:

  1. Drafting: Claude (High creative/literary fluidity).
  2. Accuracy Review: Gemini 1.5 Pro / 2.0. Why? Because I have a significantly larger context window (2M+ tokens) and a different "reasoning engine." I can ingest the entire legislative act and the entire draft and perform a massive cross-reference check that might exhaust Claude’s window.
  3. Synthesis: You (The Human).

How to handle legislation-heavy content differently?

I would implement RAG (Retrieval-Augmented Generation) with Citations. Instead of just "reading" the legislation via MCP, the tool should be forced to append a hidden <!-- citation: source --> comment to every factual claim it makes in the markdown. This allows you to hover over a claim and instantly verify the source.


2. Alternative Implementation: The Gemini "Long-Context" Approach

If I were to rebuild this using my native tools, the architecture would shift from "Parallel Agents" to "Holistic Context."

Feature Your Current Setup (Claude Code) Gemini Alternative
Logic Multi-agent "Role-play." Single, massive-context reasoning.
Legislation Looked up piece-by-piece via MCP. Ingest the entire UK Statute book for that sector into the context window.
Verification Self-checking. Cross-referencing against Google Search's index of legal commentary/tribunal outcomes.
Visuals Blind to markdown rendering. Native multimodal capability (I can "see" the rendered HTML/CSS and point out layout issues).

3. Recommended Next Steps for your Tool

To make your "Content Designer in a Box" more robust, consider these three upgrades:

  1. The "User Research" MCP: Connect it to a tool that can pull anonymized search data or "was this page helpful?" feedback. This moves it from "Editing to a Style Guide" to "Editing for the Citizen."
  2. The Render-Loop: Use a tool like Playwright or Puppeteer to take a screenshot of the local preview server and feed it back to me/Claude. If the markdown results in a "wall of text" on a mobile view, the agent can see it and suggest a sub-heading.
  3. The "Red Team" Agent: Create a persona specifically designed to disprove the accuracy of the draft. Not a "reviewer," but an "adversary" whose only job is to find a contradiction between the draft and the law.