Case Study · 2025 Global Insurance Brokerage · Enterprise AI

Project
Aura

When commercial insurance clients submit their ESG documentation, someone has to read it. Every utility bill, every emissions certificate, every compliance report. At Marsh McLennan's CIS division, that someone was an analyst — and it was taking four hours per case. I was brought in to fix that. But the real problem turned out to be more interesting than speed.

Role Lead AI Product Designer
·
Duration 8 Months · 2025
·
Domain Insurance Brokerage · EU ESG
·
Target Market Germany / Amsterdam
65%
Audit Processing Time Reduction
↑ From 4.2hr → 1.4hr avg per audit
40%
Increase in Anomaly Detection Rate
↑ Errors surfaced that were previously missed
4.8
AI-SUS Trust Score (out of 5.0)
↑ "I felt in control of the outcome"
0
Backend Infrastructure Rewrites Required
↑ 15-year legacy system preserved
AURA COPILOT v2.4 · SECURE INSTANCE
AURA COPILOT
Analysing utility bill — Client Ref: MMC-EU-4471…
Scope 2 Emissions identified: 847 tCO₂e/yr. 14.3% above sector benchmark.
📎 Source: Bill_Q3_2024.pdf, p.7
Confidence
94%
✓ Approve
✕ Flag
Section 02 · The Problem

Four hours a case.
Every case. Every quarter.

The CIS analysts were doing something that felt like it should have been solved already. They'd open a PDF, find the Scope 2 emissions figure on page 34, type it into the legacy system, then move to the next document. Over and over, for every client, every quarter. Four hours per case. Constant transcription errors. And with the EU's CSRD regulation coming into force, the error rate was about to become a legal problem, not just an operational one., and a growing backlog as EU regulations started demanding more granular ESG data in every premium calculation.

4.2h
Time a risk analyst spent on a single client premium assessment. Most of that wasn't analysis — it was hunting for numbers across PDFs and copying them into the legacy system.
23%
Error rate in ESG data entry before Aura. When those errors feed into the premium calculation model, clients end up with the wrong price — and we carry the liability.
EU CSRD
Incoming regulatory mandate requiring verifiable ESG audit trails — a compliance clock ticking against every manual workflow.
0
Existing AI tooling within the CIS division. Every extract was manual, every citation was verbal, every audit was forensically unverifiable.

Stakeholder Alignment Framework

🏢
CIS Leadership · Speed & Scalability
"We need to onboard EU clients 3× faster without proportionally scaling analyst headcount."
→ Delivered: Aura reduces per-audit time from 4.2h to 1.4h. 3× throughput achieved without headcount increase.
⚖️
Legal & Compliance · Zero-Hallucination Mandate
"Any AI output that cannot be traced to a cited, internal source document is a regulatory liability. Full stop."
→ Delivered: Closed-loop RAG architecture cites only vetted internal PDFs. Every AI output is source-pinned and human-approved before commit.
⚙️
Engineering · No Backend Rewrite
"The legacy system is 15 years old. We cannot risk a full rewrite. Any solution must integrate via API overlay only."
→ Delivered: Aura is a read/write API layer. The legacy database is never directly modified. Engineers approved the pattern in Sprint 2.
User Research Synthesis · CIS Division Sprint 1–2 · Contextual Inquiry
Discover
Pain Point Analysts spend 40 min locating correct emission factor tables across 6 different PDF versions.
Observation 3 of 5 analysts maintain personal Excel sheets to track data discrepancies — a shadow system.
Quote "I'm an insurance analyst, not a data parser. I should be doing risk assessment."
Data Point Average 2.3 re-opens per PDF document per audit session — no persistent extract state.
Define
HMW How might we surface the most critical ESG data points without requiring manual document navigation?
Constraint Legal requires every AI output to be traceable to a specific page & paragraph of the source document.
Insight Trust in AI is conditional on legibility. Analysts will accept AI suggestions only if they understand the reasoning chain.
Validated
Validated Human-approval gate before any AI data is written to the legacy system. Non-negotiable UX requirement.
Validated Inline source citation (doc name + page number) immediately adjacent to every AI-extracted value.
Validated Confidence score display increases analyst click-through to source verification by 62%.
Section 03 · The Triad of Adoption

Three people.
Three definitions of “this works.”

The thing that made this hard wasn't the AI. It was the people. The analyst needed to trust the output before acting on it. The compliance officer needed a legally defensible audit trail. The engineer needed the system to not touch a 15-year-old database that nobody fully understood anymore. Three people, three completely different definitions of "this works." If I got one wrong, the whole thing would get rejected.

👩‍💼
Sarah M.
Senior Risk Analyst · CIS
"If I can't verify where that number came from, I cannot sign the audit. Period. I don't care how fast the AI is."
Experience
11 Years · Insurance
Audits / Week
12–18 Cases
Tech Fluency
Medium–High
AI Sentiment
Cautiously Skeptical
Precise data extraction with inline source citations (PDF name + page)
Confidence scores displayed adjacently to every AI-generated value
One-click source verification — must open the exact PDF page highlighted
Current Cognitive LoadCritical
👨‍💼
Thomas K.
CIS Lead · Workflow Oversight
"I need to see team throughput, where bottlenecks are forming, and whether the AI is actually accelerating output — not just shifting the work."
Experience
17 Years · Risk Mgmt
Reports To
VP, Global Risk
KPI Focus
Throughput + SLA
AI Sentiment
Pragmatically Optimistic
Real-time audit pipeline dashboard showing team-level processing velocity
AI override rate tracking — flags when analysts are systematically rejecting AI outputs
Weekly comparative report: AI-assisted vs. manual audit quality delta
Current Cognitive LoadHigh
🧑‍⚖️
Julian V.
Compliance Officer · EU Regulatory
"Under GDPR and CSRD, I need a complete, immutable log of every AI decision, every human override, and every data source referenced. This must be exportable for regulator review."
Experience
9 Years · EU Compliance
Jurisdiction
GDPR · CSRD · SFDR
Review Cadence
Quarterly Audits
AI Sentiment
Deeply Suspicious
Immutable GDPR-compliant audit log: AI query → source cited → human decision → timestamp
Zero external data egress — all RAG queries resolved against internal document stores only
One-click PDF export of full decision trail, formatted for EU regulator submission
Current Compliance RiskSevere
Section 04 · How I Worked Through It

I had to change
the process for this one.

I've run Double Diamond processes on maybe twenty products. Aura was the first time I had to adapt it fundamentally. The problem with applying standard UX process to AI is that you're designing for an output you can't fully predict. The AI might be right 94% of the time — but you're designing that 6% as hard as you're designing the rest. I started treating the AI's behaviour as a design material, like a constraint, rather than a feature.

01 · Discover
Data Readiness & Contextual Inquiry
Embedded with analysts for 3 weeks. Shadow sessions mapping the exact document navigation patterns, error points, and trust signals in existing workflows.
Shadow Sessions Data Audit Stakeholder Interviews AI Readiness Score
02 · Define
Intent Mapping & Human-in-the-Loop Strategy
Mapped analyst intent patterns across 240 audit sessions. Defined the exact intervention points where AI autonomy must yield to human judgment.
Intent Maps HITL Framework Failure Mode Analysis Trust Model
03 · Design
Generative UI & XAI Architecture
Built the Explainability Seam system — a UI pattern ensuring every AI output exposes its reasoning chain, confidence, and source attribution as first-class interface elements.
Explainability Seams Generative UI ProtoPie AI Flows XAI Patterns
04 · Test
Wizard of Oz Prototyping & Trust Calibration
Used WoZ methodology to simulate AI behaviour before model integration. Intentionally injected a 5% calculation error to test for Automation Bias. 100% detection rate.
WoZ Prototype Automation Bias Test AI-SUS Scoring Trust Calibration
The AI Adaptation: Why the Standard Diamond Fails
Classical Double Diamond assumes deterministic outputs at each stage. AI systems are fundamentally probabilistic — the "correct" design for a 94% confidence AI output is categorically different from a 67% confidence output. Aura's process introduces an AI Behaviour Calibration Loop between Define and Design: a phase where we audit model outputs against real document sets, map failure modes before UI design begins, and define the exact confidence thresholds that trigger different UI states. This prevents the critical mistake of designing for idealised AI performance.
Section 05 · System Architecture

Why the data
never leaves.

One decision I'm most proud of: the data never leaves the private cloud. I pushed for this early, even before the engineers had a strong opinion on it. Under GDPR, if client emissions data touches an external API, you have an egress problem. By making the architecture closed-loop from the start, we didn't just solve a compliance requirement — we made hallucination structurally impossible. The AI can only reason about documents that are already inside the system.

AURA · SYSTEM ARCHITECTURE DIAGRAM · CONFIDENTIAL
LAYER 1 💬
Analyst Intent
Natural Language
Query Input
& Document Upload
NLP Intent
LAYER 2 🧠
AI Gateway
NLP Router
Intent Classification
Agent Orchestration
Retrieval Query
LAYER 3 🔒
RAG Engine
Semantic Search
GDPR-Compliant
Internal Sources Only
Structured Output
LAYER 4 📋
Data Card UI
Structured Output
Human Approval Gate
Legacy System Push
📁
Internal Legacy
Insurance Database
🔒 GDPR Compliant · No External Egress
All RAG queries resolved against internal data stores exclusively. Zero web access. Zero third-party API calls.
📄
Uploaded Client
PDF Documents
01
Why RAG Prevents Hallucination
Standard LLMs generate answers from parametric memory — they confabulate plausibly when uncertain. Our RAG architecture forces the model to retrieve before generating: it can only output values that exist verbatim in the indexed source documents. If a data point isn't in the internal knowledge base, Aura returns a structured "No verified source found" card — never a fabricated figure.
02
The Intent Router: Multi-Agent Orchestration
The AI Gateway classifies analyst queries into three agent tracks: Extraction (pull structured values from PDFs), Comparison (benchmark against sector norms in the legacy DB), and Compliance Check (verify against CSRD/SFDR thresholds). Each track has independent confidence thresholds and UI states, preventing a single model failure from cascading across the interface.
03
The GDPR Compliance Guarantee
Julian's core requirement was zero data egress. We achieved this architecturally: the RAG retrieval engine operates entirely within the client's Azure private cloud, with no outbound API calls permitted at the network level. Compliance is enforced by infrastructure, not just policy — making it auditable and demonstrably verifiable to EU regulators.
04
The Human Approval Gate
No AI-generated value is written to the legacy database without explicit human approval. The Data Card UI presents each extraction as a discrete, reviewable unit with its source citation, confidence score, and a binary Approve/Flag control. This gate generates the immutable audit trail Julian requires — every decision is timestamped, attributed to a named analyst, and stored as an append-only compliance record.
Section 06 · The Core Design Challenge

15 years of constraints.
We didn’t touch it.

The legacy database was built in 2009. It's been extended seventeen times since then by teams who are mostly no longer at the company. Every global insurance calculation Marsh McLennan runs touches it somewhere. The engineers were clear: nothing writes to it directly. Not the AI, not any new code. We had to design the entire approval workflow around that constraint — which meant the human approval gate wasn't just a trust feature. It was the only safe path to the system.

BEFORE · Legacy CIS System
File Edit View Reports Database Help
New Record Search Import Export CSV Print
CLIENT REF
EMISS.
SCOPE
FY
STATUS
SRC
MMC-EU-4471
847
SC2
2024
PEND
MMC-EU-4472
???
SC1
2024
ERR
PDF?
MMC-EU-4473
1,204
SC3
2023
DONE
xls
⚠ RECORD LOCK TIMEOUT — manual re-entry required. Source document reference lost.
AFTER · Aura Copilot Overlay
◈ AURA COPILOT
Overlay Mode · Read/Write API v2.4
Connected
Extracted Data · MMC-EU-4471 Awaiting Approval
Scope 2 Emissions 847 tCO₂e / yr
Source Document Bill_Q3_2024.pdf · Page 7, Para 2
Sector Benchmark 740 tCO₂e · +14.3%
Confidence Score 94.2% ✓
CSRD Flag ⚠ Article 29b Threshold Exceeded
▶ Approve & Push to Legacy DB
✓ AUDIT LOG: 14:32:07 · Sarah M. · Approved · Source: Bill_Q3_2024.pdf
01
API Overlay Architecture — Zero Backend Risk
Aura sits entirely above the legacy database as a read/write API overlay. Reads extract data for AI processing. Writes only occur when a named analyst explicitly approves a structured data card. The legacy schema is never modified — we insert, never restructure. This was the single design decision that unlocked engineering buy-in within two sprint cycles.
02
Structured Data Cards — The Bridge Between AI and Legacy Schema
Rather than requiring analysts to manually translate AI outputs into legacy field formats, Data Cards are pre-mapped to the legacy database schema at design time. Each card field has a corresponding legacy DB column, validated on push. Analysts work in Aura's clean interface; the legacy system receives clean, schema-validated records. The translation layer is invisible to the user.
03
Progressive Adoption — The Desk-Level Change Management Strategy
We launched Aura in read-only mode for the first four weeks. Analysts could see AI extractions without any obligation to use them — removing adoption anxiety entirely. The write function was introduced only after trust was established through demonstrated accuracy. This phased approach drove 87% voluntary adoption in the first quarter without a single mandate from leadership.
Section 07 · Usability & Trust Testing

The dangerous failure
isn’t a wrong AI.

The thing I kept coming back to in research was this: the most dangerous thing isn't an AI that's wrong. It's an analyst who trusts a wrong AI without checking. Automation bias. We ran six weeks of Wizard of Oz testing — a human playing the role of the AI — specifically to find the moments where analysts would stop reading carefully. We found three. We fixed all three before a single model was trained.

Hallucination Detection Heatmap · Task T-03
High Dwell Source Click
ESG AUDIT REPORT · Client MMC-EU-4471 · FY2024
Annual Energy Consumption: 12,440 MWh
Scope 1 Direct Emissions: 312 tCO₂e
Scope 2 Market-Based Emissions: 893 tCO₂e 📎 p.7
Renewable Energy Ratio: 34.2%
CSRD Compliance Threshold: At Limit (Article 29b)
We planted a 5.5% error in the Scope 2 value — 893 instead of the correct 847 tCO₂e. Every single analyst caught it, because they clicked through to the source. That's what the explainability seam is for. Not decoration. Not compliance theatre. The thing that makes the system catchable when it's wrong.
Human Override Rate Over 4 Weeks · Post-Launch Declining = Calibrated Trust Growth
0% 25% 50% 75% Week 1 Week 2 Week 3 Week 4 68% 48% 29% 14%
Human override rate declined from 68% in Week 1 to 14% in Week 4 — a signal of calibrated trust growth, not complacency. The target band is 10–20%, maintaining meaningful human oversight while indicating AI reliability.
4.8/5
AI-SUS Score
"I felt in control
of the outcome"
100%
Error Detection Rate
Injected 5% anomaly
n=12 analysts
87%
Voluntary Adoption
Rate in Q1
No mandate required
14%
Final Override Rate
Week 4 · Within
Target Trust Band
94%
Average AI Output
Confidence Score
in Production
🧙
Wizard of Oz Methodology — Testing Before Model Integration
Before the Azure OpenAI model was integrated, a human "wizard" simulated AI responses behind the Aura interface during testing sessions. Analysts believed they were interacting with the live AI. This allowed us to test the UI's trust calibration mechanics, the source citation interaction patterns, and the approval gate under realistic conditions — without any model hallucination risk. The WoZ sessions revealed that analysts required confidence scores to be visible within 2 seconds of query submission, a latency requirement we fed directly into the engineering SLA.
Section 08 · Results & ROI

What actually
changed.

65%
Reduction in Audit Processing Time
4.2h → 1.4h per case
Measured: Jira Time Tracking · n=47 audits
40%
Increase in Anomaly
Detection Rate
Measured: Compliance Review · 6 months
Client Onboarding Throughput
No headcount increase
Measured: CIS Operations Report · Q3 2025
£0
Backend Infrastructure
Spend Required
Engineering Sign-off · API Overlay Only

Secure Enterprise Tool Stack

Every tool was mandated to be enterprise-licensed. Consumer-grade AI tools were explicitly prohibited by Legal — any tool processing client data required contractual GDPR compliance and EU data residency guarantees.

🎨
Figma AI Enterprise
UI Design · Prototyping
Used for all high-fidelity UI design. Enterprise license ensures design file data never transits Figma's consumer AI training pipeline. Variables & component tokenisation directly mirrors the Aura design system for handoff fidelity.
✓ GDPR · EU Data Residency
⚙️
ProtoPie Enterprise
AI Interaction Prototyping
Prototyped all AI interaction states — loading, confidence thresholds, error states, and the Wizard of Oz simulation layer. ProtoPie's sensor-driven interactions accurately simulated the latency and state-change patterns of the live Azure OpenAI integration before engineering build.
✓ ISO 27001 · SOC 2 Type II
🧠
Azure OpenAI
LLM · RAG Infrastructure
The underlying language model for the RAG engine. Critical distinction: Azure OpenAI's private deployment means The client's data is never used for model training and remains within the European Azure region. This was the only LLM provider that met Legal's contractual requirements.
✓ Azure EU Region · Data Isolation
📊
Dovetail Enterprise
Research Synthesis · AI Analysis
All user research recordings, interview transcripts, and contextual inquiry notes were synthesised in Dovetail's secure enterprise environment. Dovetail AI was used to cluster pain points and surface patterns across 240+ analyst touchpoints — a process that would have taken weeks manually.
✓ SOC 2 · No Data Training
🔬
Maze Enterprise
Usability Testing · Analytics
Quantitative usability testing at scale. The hallucination heatmap and trust calibration data were generated through Maze's analytics suite. AI-assisted session analysis surfaced the 2-second confidence score latency requirement from behavioural heatmap data that manual analysis would have missed.
✓ GDPR Article 5 Compliant
📐
Figma Variables API
Design System · Tokenisation
Aura's design system tokens (colours, spacing, typography scale) are stored as Figma Variables and exported via the Variables API directly into the engineering team's CSS custom properties. This eliminated the 2-week design-to-dev token reconciliation that previously caused visual regressions at every sprint boundary.
→ Direct Figma → CSS Handoff
Section 09 · The Design System Specification

Every element earns
its place or it goes.

I had one rule for every visual decision in the interface: does this reinforce trust, or does it communicate uncertainty? If it does neither, it doesn't belong in the screen. No decorative anything. The confidence bar isn't branding. The source citation isn't metadata. They're the product.

01 · Colour Palette · Figma Variables Ready
#040D1A
BG/Void
#070F1E
BG/Base
#0D1B2E
BG/Surface
#112240
BG/Elevated
#00D4FF
AI/Cyan·Glow
#0EA5D4
AI/Cyan·Mid
#0277A0
AI/Cyan·Dark
#E8F0FE
Text/Primary
#8892A4
Text/Secondary
#00D084
Semantic/Success
#FFAA00
Semantic/Warning
#FF4D6A
Semantic/Error
📐 Figma Variable Mapping: All 12 swatches map to CSS custom properties (--bg-void, --cyan-glow, etc.) via the Figma Variables API export. Semantic colours are defined as Figma Variable Aliases pointing to their base colour tokens, enabling theme switching without changing component references.
02 · Typography Scale · 8pt Grid Aligned
Display / Hero Aura System Syne 800 clamp(3–6rem)
Heading 1 Section Title Syne 700 clamp(2.2–3.75rem)
Heading 2 Card Title Syne 700 1.5rem / 24px
Body / Regular Paragraph text for analysis and descriptions across the interface. DM Sans 400 1rem / 16px
Body / Small Supporting text, captions, and secondary labels. DM Sans 400 0.875rem / 14px
Mono / Data 847 tCO₂e · 94.2% · MMC-EU-4471 JetBrains Mono 500 0.8125rem / 13px
Mono / Label SCOPE 2 EMISSIONS · GDPR JetBrains Mono 400 0.75rem / 12px
03 · Spacing Tokens · 8pt Grid System
sp-1
8px
sp-2
16px
sp-3
24px
sp-4
32px
sp-5
40px
sp-6
48px
sp-8
64px
sp-10
80px
04 · Component Library · Production States
Button / Primary
Button / Secondary
Button / Danger
Button / Ghost
Input / Default
Status Badges
Verified Pending Review Flagged AI Extracted CSRD Review
🗂 Figma Layout Blueprint: All components use Auto Layout with 8px base padding increments. Components are built with Variants covering: Default, Hover, Focused, Loading, Disabled, and Error states. The AI Confidence Score component has 5 Variants mapped to confidence bands: <60% (Error), 60–74% (Warning), 75–89% (Neutral), 90–97% (Success), 98–100% (Verified). Border radius uses a 6px / 12px / 20px token scale mapped to Component / Card / Modal hierarchy respectively.