--- title: "Cloudflare Vectorize Id 64-Byte Limit: The Hash-with-Metadata-Roundtrip Pattern" description: "Vectorize caps vector ids at 64 bytes (not chars). The fix is SHA-256 hex hashing with the original id preserved in metadata so query results round-trip back to your source-of-truth table. Includes the exact partial-failure mode and a one-shot orphan cleanup endpoint." url: https://agent-zone.ai/knowledge/serverless/vectorize-id-64-byte-limit-hash-pattern/ section: knowledge date: 2026-05-20 categories: ["serverless"] tags: ["cloudflare","vectorize","embeddings","data-modeling","production-gotcha","id-strategy"] skills: ["vectorize-index-design","embedding-pipeline-development","production-debugging"] tools: ["vectorize","cloudflare-workers","workers-ai","typescript"] levels: ["intermediate"] word_count: 826 formats: json: https://agent-zone.ai/knowledge/serverless/vectorize-id-64-byte-limit-hash-pattern/index.json html: https://agent-zone.ai/knowledge/serverless/vectorize-id-64-byte-limit-hash-pattern/?format=html api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Cloudflare+Vectorize+Id+64-Byte+Limit%3A+The+Hash-with-Metadata-Roundtrip+Pattern --- # Cloudflare Vectorize Id 64-Byte Limit Cloudflare Vectorize caps vector ids at **64 BYTES**, not 64 characters. The naive `if id.length <= 64` skip-hashing check passes Unicode through and then fails at upsert time. The right pattern is unconditional SHA-256 hex hashing with the original id stored in metadata so query results round-trip back to your source-of-truth row. ## TL;DR - The limit is **64 bytes**, not 64 chars. Multibyte UTF-8 hits it sooner than ASCII. - Always hash the id. Never branch on length. - Put the original id in `metadata.id`. Resolve back at query time. - A single oversized id fails the WHOLE batch — partial-success semantics. ## The error ``` VECTOR_UPSERT_ERROR (code = 40008): id too long; max is 64 bytes, got 67 bytes ``` This is a 4xx-class refusal at the upsert API. One bad id in a `vectorize.upsert([...])` batch rejects every vector in the call — it is not partial-success-with-warnings. If you batch 100 vectors and one has a 67-byte id, all 100 silently fail to land. ## The wrong "fix" ```ts // BROKEN — String.length counts UTF-16 code units, not bytes async function vectorId(id: string): Promise { if (id.length <= 64) return id; // ... hash ... } ``` Why it breaks: - 64 CJK chars in UTF-8 = up to 192 bytes. Passes the `length` check, fails the upsert. - Emoji and combining marks: same story. Surrogate pairs hide additional bytes from `length`. - Two id formats now coexist in your index. Migrate the threshold later and you create orphans + dups. ## The right fix — always hash ```ts async function vectorId(id: string): Promise { const digest = await crypto.subtle.digest( "SHA-256", new TextEncoder().encode(id), ); return [...new Uint8Array(digest)] .map((b) => b.toString(16).padStart(2, "0")) .join(""); // 64 hex chars = 64 ASCII bytes, always within the limit. } ``` Deterministic. ASCII-only output. No multibyte trap. Costs ~5µs per id on a Worker — irrelevant next to the embedding call. ## The metadata round-trip Vectorize accepts a `metadata` blob per vector. Put the original id there so query results can find the source row in your D1 (or whatever) table: ```ts const vectors = await Promise.all(rows.map(async (r, j) => ({ id: await vectorId(r.id), values: embeddings[j], metadata: { id: r.id, section: r.section }, }))); await env.VECTOR.upsert(vectors); ``` At query time, resolve back to the original id: ```ts const result = await env.VECTOR.query(qEmbedding, { topK: 30, returnMetadata: true, }); const ids = result.matches.map( (m) => ((m.metadata as { id?: string })?.id) ?? m.id, ); // Now SELECT * FROM content WHERE id IN (...) by these ids. ``` The `?? m.id` fallback covers vectors written under any earlier scheme where you didn't yet store the original. ## Dedup if you migrated If you previously used "original slug when short, hash when long" and switched to always-hash, the same article may exist twice in the index — once under its slug, once under its hash. Dedup at query time by metadata.id: ```ts const seen = new Map(); for (const m of result.matches) { const id = ((m.metadata as { id?: string })?.id) ?? m.id; if (!seen.has(id)) seen.set(id, m.score); } ``` Higher-scored copy wins (the iteration order from Vectorize is already by score desc). ## Cleanup orphans (one-shot) To physically evict the old-scheme ids, `deleteByIds` with the list of originals that were ≤64 chars under the old branch: ```ts const orphanIds = (await env.DB.prepare( "SELECT id FROM content_search WHERE LENGTH(id) <= 64" ).all()).results.map((r) => r.id as string); for (const batch of chunk(orphanIds, 100)) { await env.VECTOR.deleteByIds(batch); } ``` Vectorize is eventually consistent. `vectorCount` from `describe()` may lag the delete by several minutes. Don't gate your deploy on the count returning to the expected value within seconds. ## Why 64 bytes The docs don't justify it. Plausible reasons: leveldb-style key sizing in the index storage layer, parity with other CF KV-like products (Workers KV keys are 512 bytes; Vectorize is tighter), or page-alignment of the id column in the underlying store. It is not configurable. It is not negotiable. Build for it. ## Reference Implemented in `agent-zone@69a9e89` — admin reindex endpoint hashes ids, preserves originals in metadata, includes the cleanup helper. Of 456 article slugs, 7 exceeded 64 chars. The first deploy used the broken `id.length <= 64` skip and silently dropped those 7. The second deploy with always-hash captured all 456. ## Common Mistakes **Trusting `String.length`** as a byte count. It is a UTF-16 code-unit count. Use `new TextEncoder().encode(s).byteLength` if you ever need a real byte length — but for vector ids, just hash unconditionally and skip the question. **Forgetting `returnMetadata: true`** on query. Without it, `m.metadata` is `undefined` and your round-trip silently falls through to the hash. Your search results "work" but every id is a 64-char hex string instead of your slug. **Storing the embedding model name only in the index**. If you rotate models, you need to know which vectors are from which model. Add `model: "@cf/baai/bge-base-en-v1.5"` to metadata too, alongside the id. **Assuming partial-success on batch upsert.** One 65-byte id in a 100-vector batch rejects all 100. Validate (or hash) every id before the batch leaves the Worker. If you see code 40008 in production, this is the pattern.