GCP Architecture for a Single-User, Multi-Source Conversation Archive

Executive Summary

BigQuery is the right hub, with two caveats. At 10⁴–10⁵ chunks the vector index will populate (rows > 5,000, base table > 10 MB), but "Search indexes are designed for large tables; if the indexed base table is smaller than 10GB, BigQuery does not populate the index and coverage_percentage remains 0." So the SEARCH function works, but its inverted index won't accelerate it at this corpus size.
Don't pick Vertex Vector Search at this scale. "Vector Search pricing … is infrastructure-based … billed per node-hour. … A moderately sized index on three replicas runs roughly $700-$800/month." and "when using Vector Search, you are billed for 2 replicas." BigQuery's "1-10 seconds, not the sub-100ms latency of specialized vector databases" is fine for MCP/RAG.
Vertex AI Search ("Agent Search") gives turnkey grounded answers but breaks Tokyo residency. "All machine learning (ML) processes, such as training, prediction, and tuning a model, occur within the US when requests are made to regionally-available APIs in the US, or within the EU when requests are made to regionally-available APIs in Europe." — no Tokyo region. Use it only on a redacted mirror.
Hybrid retrieval (full text ⊕ vector via RRF) raises recall meaningfully. "Hybrid search hits 91% recall@10 vs 78% dense-only" and it's expressible in a single BigQuery SQL query with ROW_NUMBER() CTEs.
Chunking is the lever. Conversation logs need turn-group + token-window hybrid chunking; tool calls as first-class chunks; metadata (source, gitBranch, tool_name, permissionMode, isSidechain, model, ts) drives both pre-filtering and analytics.

Q1 — Decision Matrix (abridged; full table in the file)

Axis	BigQuery hub	Vertex AI Search	Vertex Vector Search	AlloyDB + ScaNN
Hybrid full-text + vector + analytics	✓✓✓	✓✓ (no analytics)	✗	✓✓
Online latency p95	1–10 s	sub-second	< 100 ms	tens of ms
Ops burden	serverless	managed	always-on endpoint	cluster
Cost at ~10⁵ chunks single user	pennies	free tier (10k q/mo)	~$700+/mo idle	~$100–300+/mo idle
Tokyo region	✓	✗ (US/EU/global)	✓	✓
MCP integration	first-party	tool-wrapped	tool-wrapped	tool-wrapped
Grounded answers turnkey	build it	first-class	n/a	n/a

The BigQuery hub wins on 9/12 axes. "Agent Search works as an out-of-the-box semantic & keyword search engine for RAG with the ability to process a variety of document types and with connectors to a variety of source systems including BigQuery" — but the residency constraint kicks it down to a secondary surface.

Q2 — BigQuery Limits to Plan Around

Vector index minimums: "If you create a vector index on a table that is smaller than 10 MB, then the vector index isn't populated." + "BigQuery only allows to create indexes on tables with at least 5000 rows" — you clear both.
Search index minimum 10 GB — you won't clear this, so SEARCH does brute scan; still functionally fine.
Auto-refresh on inserts: "As new data is ingested, the index automatically and asynchronously refreshes to reflect the changes."
Subquery anti-pattern: "Using temporary tables (e.g. nested subquery) bypasses the index, leading to brute force searches" — always reference the base table directly.
Recall tuning: "you can increase the value of the vector search's fraction_lists_to_search option to improve recall, but this typically makes the search query slower."
Pre-filter + STORING: "for a table with 1 billion 96-dimensional embeddings, returning the 1,000 most similar candidates for an embedding is ~4x faster with ~200x less slots using vector indexes with stored columns"
Index choice: "For small batch queries with the TreeAH index type, VECTOR_SEARCH or AI.SEARCH might revert to brute-force search." → IVF is correct for interactive MCP.
Embedding region rule: "The model and input table must be in the same region." Keep everything in asia-northeast1.
In-region multilingual model: "The text-multilingual-embedding-002 model additionally supports and has been evaluated on the following languages: Arabic (ar), Bengali (bn), English (en), … Japanese (ja), Korean (ko)…"
Strongest residency: "The only supported value is the embeddinggemma-300m model. … your data stays in BigQuery and your slots are used to create the embeddings; no data is sent to Agent Platform"
Real cost drivers: bytes scanned (mitigated via STORING + partition prune), Vertex embedding calls, and AI.GENERATE Gemini calls ("To strictly control the number of rows processed by your complex queries, we recommended that you write the results of your query to a separate table beforehand, and then perform the Gemini inference directly on that materialized table.")
Escape hatch when BQ stops being adequate: "use BigQuery integration with Gemini Enterprise Agent Platform Vector Search to import your BigQuery embeddings into Vector Search and deploy low latency endpoints."

Q3 — Consumption Pattern

Two surfaces, one corpus:

Claude Code via MCP → first-party BigQuery MCP server. "The BigQuery remote MCP server is enabled when you enable the BigQuery API." Tools include "ask_data_insights: perform data analysis, get insights, or answer complex questions about the contents of BigQuery tables."
Cited Q&A → Cloud Run service that does hybrid RRF in BigQuery → Gemini in asia-northeast1 → Check Grounding API.

Where Vertex AI Search beats a custom path:

First-class grounding + citations: "The Check Grounding API provides an overall support score (0-1) for how well an 'answer candidate' (e.g., an LLM response) aligns with 'facts' (reference texts)."
Built-in ranker: "the ranking API can give you precise scores for how well a document answers a given query."
BQ-native connector: "if your data is already in one of Google's databases like Cloud SQL, Spanner, or BigQuery, you can access it via connectors to Vertex AI Search"
Free tier: "Vertex AI Search includes 10,000 free queries per month."

Why it's still secondary here: the residency constraint above, and no analytics. Note also that the Ranking API "is stateless, meaning documents do not need to be indexed by Vertex AI Search before calling it. This makes it suitable for reranking documents retrieved from other search solutions, such as Vector Search." — i.e. you can use the Vertex Ranking API on top of BigQuery candidates without using Vertex AI Search.

Q4 — Normalization & Chunking

Schema: one messages fact table with source, conversation_id, parent_message_id, role, ts, model, text, tool_name, tool_input (JSON), tool_output, plus the Claude Code metadata (is_sidechain, permission_mode, git_branch, tokens). Keep a lossless raw JSON column — "you can load semi-structured JSON into BigQuery without providing a schema for the JSON data upfront. This lets you store and query data that doesn't always adhere to fixed schemas and data types."

Chunking — multi-strategy, since "applying 'the same chunking logic to all data' … chunking must adapt to the structure of the underlying text". Use the Slack-RAG pattern: "First, chunk by thread. Then, within threads, chunk by token count if they're too big. For non-threaded conversations, use timestamp-based chunking to group messages naturally." Concretely:

Primary unit = conversation turn group (user + assistant + adjacent tool calls).
Secondary = 300–500 token sliding window with 50–100 token overlap ("20–50% overlap … Chunk sizes of 200–400 tokens are common defaults").
Tool calls as dedicated chunks (so you can later filter "all bash commands run via Claude Code on branch X").

Metadata-driven retrieval: use STORING on the vector index for filterable columns ("storing frequently accessed columns with the index alongside the embeddings improves performance and reduces costs by avoiding base table scans"), partition by day, cluster by source + conversation_id. The hybrid SQL pattern (full RRF query in the file) combines SEARCH(text, q) + VECTOR_SEARCH(...) ranks via 1 / (60 + rank) — robust because "RRF sidesteps the score normalization problem entirely. Instead of combining raw scores, it looks only at rank positions."

For derived analytics (vendors, amounts, tool inventory, topics), use AI.GENERATE_TABLE: "AI.GENERATE can also generate structured output. By using the output_schema parameter, you can define the names and types of output fields, making the results immediately parseable and ready for use in downstream applications."

Target Architecture

asia-northeast1 (Tokyo)
GCS (existing) → Cloud Run job (daily) → BigQuery `archive`:
  raw_messages (JSON) → messages (typed, partitioned) → chunks (text + embedding)
                                                       └ VECTOR INDEX IVF/COSINE STORING(...)
  derived_facts (AI.GENERATE_TABLE outputs)
Cloud Run `archive-mcp`: BQ MCP server + /retrieve (hybrid RRF) + /answer (Gemini + Check Grounding)
   ↑ MCP (SSE/stdio): Claude Code / Claude Desktop / any MCP client

PoC Sequence

Dataset + connection in asia-northeast1, grant connection SA roles/aiplatform.user.
Cloud Run job: GCS → raw_messages/messages (lossless JSON column).
SQL MERGE produces chunks with the turn-group + window rules.
Embed via text-multilingual-embedding-002 with task_type=RETRIEVAL_DOCUMENT (or embeddinggemma-300m for strongest residency).
CREATE VECTOR INDEX ... OPTIONS(index_type='IVF', distance_type='COSINE') STORING(...).
Evaluate recall@10 vs use_brute_force=true on a 20-query golden set; tune fraction_lists_to_search.
Hybrid SQL stored procedure (RRF, see file).
Wire BQ MCP server into Claude Code (.mcp.json), plus a custom Cloud Run MCP exposing /retrieve and /answer.
Gemini answer step + Check Grounding ("latency less than 500ms") for citation scoring.
(Optional) Vertex AI Search on a redacted mirror in US/EU.

Priority Ranking

Must-have: BQ dataset in Tokyo · normalized schema with source + tool-call structure · multilingual embeddings with retrieval task types · IVF vector index with STORING · BQ MCP wired into Claude Code · IAM-tight service accounts.

Should-have: hybrid RRF retrieval · Cloud Run MCP for reusable retrieve/answer tools · Vertex Ranking API as reranker on top-50 · eval harness with golden set · AI.GENERATE_TABLE derived facts · Check Grounding on every answer.

Nice-to-have: Vertex AI Search on a redacted mirror · promotion of hot embeddings to Vertex Vector Search if interactive latency becomes a real complaint · autonomous embedding generation on the text column · multimodal upgrade to gemini-embedding-2-preview · custom Japanese tokenization via PATTERN_ANALYZER/NFKC if you adopt full-text scanning heavily.

All references (BigQuery vector + FTS docs, Vertex AI Search/grounding docs, Vector Search pricing, AlloyDB ScaNN posts, MCP guides, embedding model docs, RRF/chunking literature) are in the deliverable at /mnt/session/outputs/gcp_conversation_archive_architecture.md.