AIHub Context Persistence: A Technical Deep Dive & Update

Back in March I wrote about building cross-platform AI context persistence the why, the philosophy, the high-level architecture. That article was deliberately written for a broad audience. This one isn't.

The original skipped the parts a developer would actually need to evaluate this for themselves. So here's the technical update: what's stored, how it's retrieved, what the wiring looks like, and what works today.

A note up front: this is what I built for me. It works for me. It's not a product. If something in here sparks an idea for your own setup, that's the win.

The mechanism people miss

You might be asking yourself: "if you're sending the database along with each request, how does that help your context window?"

You're not. That's the whole point.

The AI tool calls an MCP tool mid-conversation. It says "I need context about X." The MCP server queries Postgres, runs a vector similarity search against stored conversation embeddings, hydrates the matching records, and returns only the slice the AI asked for. The conversation window stays small. The database scales independently. You're not pre-loading anything.

This is what MCP is for. It's not a context-stuffing protocol it's a tool-calling protocol. The AI requests context on demand, the way you'd query a database from any application.

What works today

Any MCP-capable development tool can connect. In my stack that's Claude Code (CLI and VS Code extension), Gemini CLI, Google Antigravity, Cline, and Roo Code, but the architecture isn't bound to a specific list. The bridge is tool-agnostic. All of them can read and write to the same memory.

This is the part developers juggling three or four agentic tools actually need. You start a refactor in Claude Code, switch to Antigravity for the UI work, and Antigravity already has the architectural decisions on hand.

What's actually stored

When a conversation is saved, a decomposer prompt parses the transcript and produces a structured summary. That summary is written as one row in the 'conversations' table. Here's the shape of what gets stored:

{
  "claude_conversation_uuid": "uuid-v4",
  "source": "claude_code | antigravity | cline | roo | gemini_cli | claude_mobile",
  "project": "project_identifier",
  "conversation_type": "single_topic | multi_topic",
  "topic": "2-5 word topic",
  "categories": ["tag1", "tag2"],
  "decisions": [
    {
      "decision": "what was decided",
      "reasoning": "why"
    }
  ],
  "open_questions": ["unresolved item"],
  "related_entities": ["person", "system", "file", "concept"],
  "summary": "2-3 paragraph synthesis",
  "full_transcript": "optional comprehensive record"
}

Multi-topic conversations nest the same structure: an 'overall_summary', 'cross_topic_insights', and a 'topics' array where each entry has its own topic, categories, decisions, open questions, and summary. The whole thing still writes as one row: the topic breakdown is internal structure, not separate records.

The fields that earn their keep:

decisions what got concluded, with reasoning. This is what AI tools pull when they need to know "what did we decide about X, and why?"
open_questions explicit markers for unresolved items, so future sessions can pick them up instead of re-litigating
related_entities people, systems, files, concepts. The semantic glue between conversations
categories discovery tags, not formal taxonomy

Here's what an actual record from my database looks like, a Quality Gates implementation session from earlier this year:

Topic: Implementation of Secretary Auto-Invocation in Quality Gates

Summary: Integrated Secretary services (SecretaryBriefService,
SecretaryUpdateService, SecretaryContextService) into the Judge
(TaskReviewService), Sheriff (TaskSecurityService), and Task Completion
(TaskManagementService) quality gates. The implementation fetches a Secretary
brief before gate execution and updates the context after. A graceful
degradation pattern was adopted to handle Secretary service unavailability
without failing the entire process. New mock-based tests were created, and
five existing test files were updated to accommodate new constructor
dependencies.

Categories: backend, feature, architecture, testing

Decisions:
- A graceful degradation pattern (try-catch on SecretaryException) was used
  for all Secretary service calls within the quality gates.
  Reasoning: To prevent the entire quality gate process from failing if the
  Secretary service is unavailable, ensuring existing workflows are not broken.

- Utilized mock-based testing for the new integrations instead of full
  integration tests.
  Reasoning: To avoid the complexity and dependency of setting up a
  PostgreSQL test database for automated tests.

- Kept the Sheriff integration minimal, only adding the service via
  dependency injection.
  Reasoning: The full security review flow where the service would be
  actively used is not yet implemented, so a full integration was not yet
  necessary.

- Rejected creating a shared trait or middleware for Secretary injection.
  Reasoning: This was considered over-engineering for the current scope.

- Rejected making Secretary a mandatory, blocking component for quality gates.
  Reasoning: This would break existing workflows for projects where Secretary
  has not been initialized.

Open Questions:
- End-to-end testing of the full Secretary pipeline with real AI calls is
  still needed.
- Prompt slugs (e.g., secretary-distill-spec, secretary-build-judge-brief)
  need to be verified in the production database.
- Pre-existing failures in TaskReviewServiceMockTest.php related to
  AiHubClient::$token need to be addressed in a separate fix.

Related Entities: app/Services/Tasks/TaskReviewService.php,
app/Services/Security/TaskSecurityService.php,
app/Services/Tasks/TaskManagementService.php, SecretaryBriefService,
SecretaryUpdateService, SecretaryContextService,
tests/Feature/QualityGates/SecretaryAutoInvocationTest.php, 
COMMON_PITFALLS.md, Pest.php

Notice the rejected decisions. That's a feature, not noise. Most knowledge systems only capture what got done. Capturing what got considered and rejected, with reasoning, is what makes this useful three months later when you're staring at the same architectural choice from a different angle. Future-you doesn't have to re-derive why the shared trait was a bad idea.

Why not just chunk?

Standard RAG wisdom says: chunk the document into fixed-size pieces, embed each chunk, retrieve the top matches. It works well for reference material like documentation, articles, knowledge bases where the meaningful unit is a paragraph or section.

Conversations are different. The meaningful units are decisions, open questions, rejected paths, and the reasoning behind them. Those don't live in tidy 500-token windows. A decision and its reasoning might be three exchanges apart. A rejection might happen 2,000 tokens after the original proposal. Chunk that conversation by character count and you get fragments that read like nonsense out of context: "we agreed to use the second approach" what approach? With what reasoning?

The decomposer doesn't avoid breaking conversations apart. It breaks them apart along the seams that matter by semantic role rather than by length. A decision stays attached to its reasoning. An open question stays attached to its topic. The embedding text aggregates all of those, so search finds the conversation regardless of which semantic surface the query hits.

This is the part that took me a while to figure out. The first version stored full transcripts and embedded them whole. Retrieval was vague and inconsistent, exactly the "blurry average" problem. The second version chunked them. Retrieval improved on recall but the chunks came back stripped of context. The current version is the third try, and it's the one that actually works for this use case.

If your data is genuinely document-like with large blocks of prose where any section can stand alone, standard chunking is probably fine. If your data is transactional however like conversations, decisions, threads, anywhere meaning depends on what got rejected as much as what got chosen then structure matters more than length, and you need a decomposer that knows the difference.

The decomposer

Naive vector search over whole-conversation summaries has a problem: a single conversation that covers three topics gets compressed into one embedding, and that embedding represents a blurry average of all three. Search for one of those topics specifically and the conversation might not rank well.

The decomposer doesn't solve this by splitting conversations into separate rows. It solves it by enriching the embedding text. When the embedding gets generated, it concatenates the topic, summary, overall summary, all categories, all decisions, all open questions, and all related entities into one block of text then vectorizes that. The resulting embedding captures every distinct semantic surface the conversation touched, not just an averaged summary.

For multi-topic conversations, this is where the 'topics' array earns its place. Each topic contributes its own decisions, questions, and entities to the embedding text. The result is one row with much higher recall across diverse queries.

Coding agents tend to produce single-topic records: a focused session on one bug, one refactor, one integration. Exploratory conversations with broader-purpose tools tend to produce multi-topic records. Both shapes work. The decomposer figures out which is which.

Who does the actual work

The decomposer isn't an algorithm. It's a prompted LLM call. When 'save-conversation' fires, the transcript and a strict instruction prompt go to whatever model is currently routed for this task through AIHub. The model returns JSON matching the schema above. AIHub validates it, stores it, and generates the embedding.

This matters because the quality of every record depends on the model and the prompt. A weaker model produces flatter decisions and vaguer summaries. A stronger model surfaces nuance including the rejected paths and the reasoning behind them. The prompt itself has been through several revisions; what you see in the example record above is the current output, which is markedly better than what the first version produced.

Embeddings are generated separately, by a dedicated embedding model. Search-time hydration is just SQL.

The bridge

The cross-tool part requires a hosted MCP server reachable from every tool. I run one. HMAC authentication, scoped access tokens, obscured paths. Standard infrastructure.

The bridge exposes three tools:

'generate-conversation-id' called at the start of a session to mint a stable UUID. Lets multiple agent calls within one session reference the same conversation before it's saved.
'get-context' vector search plus retrieval. Takes a query, returns the top matching records hydrated into structured markdown.
'save-conversation' accepts the structured JSON, runs it through the decomposer, generates the embedding, writes to Postgres.

That's the entire surface area. Three tools. Any MCP-capable client can use them.

How agents know to use it

A tool's existence doesn't guarantee its use. You can wire up the MCP server, connect every agent, and watch them happily ignore the whole thing if they don't know they're supposed to call it.

The trigger lives in the project's instruction file. CLAUDE.md, AGENTS.md, whatever your tool's convention is. Two lines do most of the work:

At the start of any task or question, call get-context with a relevant query first. This makes retrieval reflexive rather than opportunistic. The agent doesn't wait to be asked it checks memory before answering.
At the end of any significant session, call 'save-conversation' with a structured summary. This makes saving routine instead of dependent on the user remembering to ask.

Without those instructions, the agent has the tools but no habit of using them. With them, retrieval and storage become part of how every session opens and closes.

There's a refinement worth mentioning. You probably don't want the agent calling 'get-context' for every trivial message ("yes," "thanks," "can you reformat this"). The instruction file can be more specific: search when the user references prior work, asks a question that depends on project state, or starts a new task. The exact phrasing depends on how disciplined your tooling is about following instructions, and how chatty you want the search behavior to be.

This is also where the .md file pattern earns its keep beyond memory access. The same file holds project conventions, code style rules, build commands, and now memory-access habits. It's the static layer that tells the agent how to behave; IMaaS is the dynamic layer that tells the agent what's been decided. Together they turn a generic agent into a project-aware one.

Storage split

Small stuff in Postgres, large stuff in object storage. The threshold is rough, under ~10k tokens lives in the database, larger transcripts get pushed to S3 with a reference in the metadata. The summaries are always in Postgres because they're what searches hit. Full transcripts are only fetched when explicitly requested.

This keeps query latency low. Vector search runs against tightly-scoped summary embeddings, not bloated transcripts.

What this doesn't solve

Since this is built for my own purposes, there are honest limits:

It's not magic recall. If a conversation got saved with a bad summary, retrieval will surface a bad summary. The quality of what comes back is bounded by the quality of what went in.
Retrieval depends on good queries. A vague prompt like "check memory for that thing we discussed" returns worse results than a real semantic query.
Project scoping isn't a security boundary. Records are tagged by project, but the database doesn't enforce isolation, anything in the store is searchable by anything with bridge access. The auth layer controls who can query; it doesn't constrain what they see. For a single-operator setup that's fine. For multi-tenant use, you'd need a different model. (The next iteration solves this with tenant-scoped access, see the closing note.)
The decomposer can be wrong. It's an LLM doing classification and structuring, which means it occasionally marks a multi-topic conversation as single-topic, misses a decision, or summarizes flatly. Current setup has no quality gate on the output. The next iteration runs decomposer output through Quality Gates review before commit which is a meaningful upgrade.
It's bespoke infrastructure. Not a package, not a skill, not installable. The MCP client side could absolutely be packaged as a skill or VS Code extension, that's a reasonable thing to build if anyone wants to. The backend you'd run yourself: Laravel, Postgres with pgvector, an MCP server, and a bit of routing logic. None of those pieces are exotic.

What it does handle that's worth noting: duplicate saves update the existing record as a new version rather than creating a second row. So if your tooling double-fires 'save-conversation', you don't end up with two records of the same session.

One more thing

This architecture is what works today. It's also being consolidated. A separate effort is unifying AIHub's multiple services, multiple uploaders, multiple overlapping responsibilities into a single interoperable codebase. That work has its own article coming. When it lands, some of what's described here will be replaced by simpler primitives. For now, this is the current state. If any of it sparks an idea for your own setup, that's the win.

Under the Hood

How AIHub's Context Persistence Actually Works (A Technical Update)