feat(knowledge): add Ollama embedding provider support#3714
feat(knowledge): add Ollama embedding provider support#3714teedonk wants to merge 47 commits intosimstudioai:stagingfrom
Conversation
…keyboard shortcuts, audit logs
…rects to rewrites
…stash, algolia tools; isolated-vm robustness improvements, tables backend (simstudioai#3271) * feat(tools): advanced fields for youtube, vercel; added cloudflare and dataverse tools (simstudioai#3257) * refactor(vercel): mark optional fields as advanced mode Move optional/power-user fields behind the advanced toggle: - List Deployments: project filter, target, state - Create Deployment: project ID override, redeploy from, target - List Projects: search - Create/Update Project: framework, build/output/install commands - Env Vars: variable type - Webhooks: project IDs filter - Checks: path, details URL - Team Members: role filter - All operations: team ID scope Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style(youtube): mark optional params as advanced mode Hide pagination, sort order, and filter fields behind the advanced toggle for a cleaner default UX across all YouTube operations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * added advanced fields for vercel and youtube, added cloudflare and dataverse block * addded desc for dataverse * add more tools * ack comment * more * ops --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * feat(tables): added tables (simstudioai#2867) * updates * required * trashy table viewer * updates * updates * filtering ui * updates * updates * updates * one input mode * format * fix lints * improved errors * updates * updates * chages * doc strings * breaking down file * update comments with ai * updates * comments * changes * revert * updates * dedupe * updates * updates * updates * refactoring * renames & refactors * refactoring * updates * undo * update db * wand * updates * fix comments * fixes * simplify comments * u[dates * renames * better comments * validation * updates * updates * updates * fix sorting * fix appearnce * updating prompt to make it user sort * rm * updates * rename * comments * clean comments * simplicifcaiton * updates * updates * refactor * reduced type confusion * undo * rename * undo changes * undo * simplify * updates * updates * revert * updates * db updates * type fix * fix * fix error handling * updates * docs * docs * updates * rename * dedupe * revert * uncook * updates * fix * fix * fix * fix * prepare merge * readd migrations * add back missed code * migrate enrichment logic to general abstraction * address bugbot concerns * adhere to size limits for tables * remove conflicting migration * add back migrations * fix tables auth * fix permissive auth * fix lint * reran migrations * migrate to use tanstack query for all server state * update table-selector * update names * added tables to permission groups, updated subblock types --------- Co-authored-by: Vikhyath Mondreti <vikhyath@simstudio.ai> Co-authored-by: waleed <walif6@gmail.com> * fix(snapshot): changed insert to upsert when concurrent identical child workflows are running (simstudioai#3259) * fix(snapshot): changed insert to upsert when concurrent identical child workflows are running * fixed ci tests failing * fix(workflows): disallow duplicate workflow names at the same folder level (simstudioai#3260) * feat(tools): added redis, upstash, algolia, and revenuecat (simstudioai#3261) * feat(tools): added redis, upstash, algolia, and revenuecat * ack comment * feat(models): add gemini-3.1-pro-preview and update gemini-3-pro thinking levels (simstudioai#3263) * fix(audit-log): lazily resolve actor name/email when missing (simstudioai#3262) * fix(blocks): move type coercions from tools.config.tool to tools.config.params (simstudioai#3264) * fix(blocks): move type coercions from tools.config.tool to tools.config.params Number() coercions in tools.config.tool ran at serialization time before variable resolution, destroying dynamic references like <block.result.count> by converting them to NaN/null. Moved all coercions to tools.config.params which runs at execution time after variables are resolved. Fixed in 15 blocks: exa, arxiv, sentry, incidentio, wikipedia, ahrefs, posthog, elasticsearch, dropbox, hunter, lemlist, spotify, youtube, grafana, parallel. Also added mode: 'advanced' to optional exa fields. Closes simstudioai#3258 * fix(blocks): address PR review — move remaining param mutations from tool() to params() - Moved field mappings from tool() to params() in grafana, posthog, lemlist, spotify, dropbox (same dynamic reference bug) - Fixed parallel.ts excerpts/full_content boolean logic - Fixed parallel.ts search_queries empty case (must set undefined) - Fixed elasticsearch.ts timeout not included when already ends with 's' - Restored dropbox.ts tool() switch for proper default fallback * fix(blocks): restore field renames to tool() for serialization-time validation Field renames (e.g. personalApiKey→apiKey) must be in tool() because validateRequiredFieldsBeforeExecution calls selectToolId()→tool() then checks renamed field names on params. Only type coercions (Number(), boolean) stay in params() to avoid destroying dynamic variable references. * improvement(resolver): resovled empty sentinel to not pass through unexecuted valid refs to text inputs (simstudioai#3266) * fix(blocks): add required constraint for serviceDeskId in JSM block (simstudioai#3268) * fix(blocks): add required constraint for serviceDeskId in JSM block * fix(blocks): rename custom field values to request field values in JSM create request * fix(trigger): add isolated-vm support to trigger.dev container builds (simstudioai#3269) Scheduled workflow executions running in trigger.dev containers were failing to spawn isolated-vm workers because the native module wasn't available in the container. This caused loop condition evaluation to silently fail and exit after one iteration. - Add isolated-vm to build.external and additionalPackages in trigger config - Include isolated-vm-worker.cjs via additionalFiles for child process spawning - Add fallback path resolution for worker file in trigger.dev environment * fix(tables): hide tables from sidebar and block registry (simstudioai#3270) * fix(tables): hide tables from sidebar and block registry * fix(trigger): add isolated-vm support to trigger.dev container builds (simstudioai#3269) Scheduled workflow executions running in trigger.dev containers were failing to spawn isolated-vm workers because the native module wasn't available in the container. This caused loop condition evaluation to silently fail and exit after one iteration. - Add isolated-vm to build.external and additionalPackages in trigger config - Include isolated-vm-worker.cjs via additionalFiles for child process spawning - Add fallback path resolution for worker file in trigger.dev environment * lint * fix(trigger): update node version to align with main app (simstudioai#3272) * fix(build): fix corrupted sticky disk cache on blacksmith (simstudioai#3273) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Lakee Sivaraya <71339072+lakeesiv@users.noreply.github.com> Co-authored-by: Vikhyath Mondreti <vikhyath@simstudio.ai> Co-authored-by: Vikhyath Mondreti <vikhyathvikku@gmail.com>
… fixes, removed retired models, hex integration
…ogle tasks and bigquery integrations, workflow lock
…gespeed insights, pagerduty
…, brandfetch, google meet
… pagination, memory improvements
… selectors for 14 blocks
…ory instrumentation
…aders, webhook trigger configs (simstudioai#3530)
…anvas navigation updates
|
@teedonk is attempting to deploy a commit to the Sim Team on Vercel. A member of the Team first needs to authorize it. |
PR SummaryHigh Risk Overview Introduces Updates embedding/chunking to be Ollama-aware: Written by Cursor Bugbot for commit 075b005. This will update automatically on new commits. Configure here. |
Greptile SummaryThis PR adds Ollama as a fully local/private embedding provider for knowledge bases, implementing per-KB dynamic pgvector tables to handle variable embedding dimensions across different Ollama models. The overall architecture is well-reasoned — auto-detecting dimensions from Key issues found:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant UI as CreateBaseModal
participant API as POST /api/knowledge
participant OllamaAPI as Ollama /api/show
participant DB as Database
participant DynTable as Per-KB Table
UI->>API: POST {name, embeddingModel: "ollama/nomic-embed-text", ollamaBaseUrl}
API->>OllamaAPI: POST /api/show {name: "nomic-embed-text"}
OllamaAPI-->>API: {embedding_length: 768, context_length: 2048}
API->>DB: createKnowledgeBase(data)
DB-->>API: newKnowledgeBase {id}
API->>DynTable: CREATE TABLE kb_embeddings_{id} (embedding vector(768))
DynTable-->>API: OK
Note over UI,DynTable: Document Upload Flow
UI->>API: POST /api/knowledge/{id}/documents
API->>OllamaAPI: GET context length (cached)
OllamaAPI-->>API: contextLength=2048
API->>API: chunk document (effectiveChunkSize = 0.3 × contextLength)
API->>OllamaAPI: POST /api/embed {model, input: chunks[]}
OllamaAPI-->>API: embeddings[]
API->>DynTable: DELETE WHERE document_id=X
API->>DynTable: INSERT embeddings (batched, no transaction)
Note over UI,DynTable: Search Flow
UI->>API: POST /api/knowledge/search
API->>DB: SELECT embeddingModel FROM knowledgeBase WHERE id IN (...)
DB-->>API: [{id, embeddingModel: "ollama/..."}]
API->>OllamaAPI: POST /api/embed {model, input: [query]}
OllamaAPI-->>API: queryVector[]
API->>DynTable: SELECT ... ORDER BY embedding <=> queryVector LIMIT topK
DynTable-->>API: SearchResult[]
API-->>UI: merged results
Reviews (1): Last reviewed commit: "test(knowledge): update search tests for..." | Re-trigger Greptile |
| if (embeddingProvider === 'ollama') { | ||
| // Per-KB table: delete old chunks then bulk-insert new ones | ||
| await deleteKBDocumentEmbeddings(knowledgeBaseId, documentId) | ||
| await insertKBEmbeddings(knowledgeBaseId, embeddingRecords, kb[0].embeddingDimension) |
There was a problem hiding this comment.
Non-atomic delete + insert risks data loss
The Ollama path deletes all existing embeddings for the document and then inserts new ones without a wrapping transaction. If insertKBEmbeddings fails mid-way (e.g., after a partial batch insert), the old embeddings have already been deleted but the new ones are only partially written — leaving the document with fewer embeddings or none at all, with no way to recover automatically.
The OpenAI path below correctly wraps both operations in a db.transaction. Consider wrapping the Ollama path similarly. Since deleteKBDocumentEmbeddings and insertKBEmbeddings both use db.execute / db.insert, they should participate in a transaction too.
| const newKnowledgeBase = await createKnowledgeBase(createData, requestId) | ||
|
|
||
| if (provider === 'ollama') { | ||
| try { | ||
| await createKBEmbeddingTable(newKnowledgeBase.id, effectiveDimension) | ||
| } catch (tableError) { | ||
| logger.error( | ||
| `[${requestId}] Failed to create embedding table for KB ${newKnowledgeBase.id}`, | ||
| tableError | ||
| ) | ||
| throw tableError | ||
| } |
There was a problem hiding this comment.
Orphaned KB row when embedding table creation fails
createKnowledgeBase persists the KB row to the database at line 136 before createKBEmbeddingTable is called. If createKBEmbeddingTable throws (e.g. the pgvector extension isn't enabled or a naming collision occurs), the error is re-thrown and the request returns 500 — but the KB record is left behind in the database without a corresponding embedding table. Any subsequent document upload to this KB will fail with a table-not-found error and the user has no way to fix it from the UI.
A safe fix is to delete the orphaned KB row in the catch block before re-throwing:
} catch (tableError) {
logger.error(...)
// Clean up orphaned KB row
await deleteKnowledgeBase(newKnowledgeBase.id, requestId).catch(() => {})
throw tableError
}| // Use drizzle's insert API with dynamic table schema | ||
| await db.insert(dynamicTable).values(batch) | ||
| } catch (err: unknown) { | ||
| const pg = err as { code?: string; detail?: string; message?: string; cause?: unknown } | ||
| logger.error(`insertKBEmbeddings failed for table ${table}`, { | ||
| code: pg.code, | ||
| detail: pg.detail, | ||
| message: pg.message, | ||
| cause: pg.cause, |
There was a problem hiding this comment.
Vector literal injected via
sql.raw — fragile and risky pattern
queryVector is constructed by the caller as JSON.stringify(emb) where emb is a number[], so in practice it contains no single quotes. However, interpolating it directly into a sql.raw string template is fragile: a future refactor that passes a different value could silently introduce a SQL injection or at minimum a syntax error.
Drizzle's parameterized approach is safer and just as performant:
// Instead of:
const vecLiteral = sql.raw(`'${queryVector}'::vector`)
// Prefer:
const vecLiteral = sql`${queryVector}::vector`Using a Drizzle placeholder ensures the value is always parameterized correctly, regardless of its content.
| await db | ||
| .update(document) | ||
| .set({ | ||
| chunkCount: processed.metadata.chunkCount, | ||
| tokenCount: processed.metadata.tokenCount, | ||
| characterCount: processed.metadata.characterCount, | ||
| processingStatus: 'completed', | ||
| processingCompletedAt: now, | ||
| processingError: null, | ||
| }) | ||
| .where(eq(document.id, documentId)) |
There was a problem hiding this comment.
Document status update moved outside the transaction (regression)
In the original code, both the embedding inserts and the processingStatus: 'completed' update were inside a single db.transaction, so they were atomic. This PR extracts the status update to a separate await db.update(document)... call that runs after the transaction.
For the OpenAI path, if the transaction (embedding inserts) succeeds but the subsequent status update fails, the document stays permanently in 'processing' state — even though its embeddings are fully in place. There's no retry that would recover this (the catch block at line 656 sets status to 'failed', which is also incorrect since the embeddings are already there).
This is a regression for the OpenAI path that was introduced by the refactor to support the Ollama code path. The processingStatus update should remain inside the db.transaction for the OpenAI path.
| onChange={(e) => { | ||
| const preset = OLLAMA_PRESET_MODELS.find( | ||
| (m) => m.value === e.target.value | ||
| ) | ||
| setOllamaPreset(e.target.value) | ||
| if (preset && preset.dimension > 0) { | ||
| setOllamaDimension(preset.dimension) | ||
| } | ||
| }} | ||
| className='rounded-[4px] border border-[var(--border-1)] bg-[var(--surface-1)] px-[10px] py-[6px] text-[12px] text-[var(--text-primary)]' | ||
| > | ||
| {OLLAMA_PRESET_MODELS.map((m) => ( | ||
| <option key={m.value} value={m.value}> | ||
| {m.label} | ||
| </option> | ||
| ))} | ||
| </select> | ||
| </div> | ||
|
|
There was a problem hiding this comment.
Native
<select> instead of the component library's Select
The rest of the form uses components from the existing @/components/ui library (e.g. Input, Label, Button). This <select> is a raw HTML element with hand-written Tailwind classes for the border/background/text, which won't automatically respect dark mode tokens or any future design-system updates.
Consider replacing it with the project's Select / SelectContent / SelectItem / SelectTrigger pattern used elsewhere in the codebase.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| baseUrl = 'http://localhost:11434' | ||
| ): Promise<OllamaModelInfo> { | ||
| const cacheKey = `${modelName}@${baseUrl}` | ||
| const cached = ollamaModelInfoCache.get(cacheKey) |
There was a problem hiding this comment.
Module-level cache ineffective in serverless/edge deployments
ollamaModelInfoCache is a module-level Map. In serverless environments (Next.js API routes deployed to Vercel, AWS Lambda, etc.) each invocation may spawn a fresh process where the module is re-initialized, so the cache is effectively a no-op. Additionally, multiple concurrent invocations will bypass the cache and issue redundant /api/show calls.
If the project already uses a shared cache layer (e.g. Redis via lib/core/storage), consider using it here. Otherwise, this is worth documenting as a known limitation so future maintainers aren't surprised.
…t silent failures
| ) | ||
| results = [...openaiResults, ...ollamaResults] | ||
| .sort((a, b) => a.distance - b.distance) | ||
| .slice(0, validatedData.topK) |
There was a problem hiding this comment.
Cross-provider normalization skips single-result sets causing unfair ranking
Low Severity
normalizeScores returns single-item arrays unchanged (if (items.length <= 1) return items). In cross-provider search, this means if one provider returns only 1 result, its raw distance (e.g., 0.5) is sorted alongside the other provider's normalized 0–1 distances. The raw score from one embedding space isn't comparable to normalized scores from another, producing incorrect ranking when merged.
| // Store ollamaBaseUrl inside chunkingConfig JSONB to avoid a schema migration | ||
| chunkingConfig: data.ollamaBaseUrl | ||
| ? { ...data.chunkingConfig, ollamaBaseUrl: data.ollamaBaseUrl } | ||
| : data.chunkingConfig, |
There was a problem hiding this comment.
Returned chunkingConfig omits stored ollamaBaseUrl field
Low Severity
createKnowledgeBase stores ollamaBaseUrl inside chunkingConfig JSONB in the database (line 104–106), but the returned object uses the original data.chunkingConfig without ollamaBaseUrl (line 123). This means the API response from KB creation reflects a different chunkingConfig than what's persisted, which could cause inconsistencies if consumers cache or rely on the returned value.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 4 total unresolved issues (including 2 from previous reviews).
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| let openaiQueryVector: string | null = null | ||
| if (hasQuery && openaiKbIds.length > 0) { | ||
| const emb = await generateSearchEmbedding(validatedData.query!, undefined, workspaceId) | ||
| openaiQueryVector = JSON.stringify(emb) |
There was a problem hiding this comment.
OpenAI search embedding ignores KB's configured model
High Severity
The search route generates a single OpenAI query embedding using the hardcoded default text-embedding-3-small, ignoring each KB's actual embeddingModel. This diff newly allows text-embedding-3-large in the creation schema, but search always calls generateSearchEmbedding(query, undefined, workspaceId) where undefined defaults to text-embedding-3-small. A KB created with text-embedding-3-large (3072 dimensions) would fail at search time because the query vector (1536 dimensions) can't be compared against stored vectors via pgvector's <=> operator. The Ollama path correctly generates per-model embeddings, but the OpenAI path does not.
Additional Locations (1)
| tokenCount: Math.ceil(chunk.text.length / 4), | ||
| embedding: embeddings[chunkIndex] || null, | ||
| embeddingModel: 'text-embedding-3-small', | ||
| embeddingModel: embeddingModelName, |
There was a problem hiding this comment.
Stored token count uses wrong ratio for Ollama
Low Severity
The tokenCount in embedding records is always calculated as Math.ceil(chunk.text.length / 4), using the OpenAI token estimation ratio. For Ollama models, the TextChunker uses a ratio of 3 characters per token, so the stored tokenCount is inconsistent with how the chunker actually estimated tokens. This affects any downstream analytics or display that relies on per-chunk token counts.


Summary
/api/showendpointFixes (#3037)
Type of Change
Testing
Tested with Ollama (nomic-embed-text) on local setup:
Checklist
Video.Project.1.mp4
ment-cla)