Use this file to discover all available pages before exploring further.
The Bitfab TypeScript SDK captures your AI function calls to automatically generate evaluations. Re-run your prompts with different models, parameters, and inputs to iterate faster.
Copy this prompt into your coding agent (tested with Cursor and Claude Code using Sonnet 4.5):
Modify existing TypeScript code to add Bitfab tracing.Do NOT browse or web search. Use ONLY the API described below.Bitfab TypeScript SDK (authoritative excerpt):- Install: `npm install bitfab` or `pnpm add bitfab`- Init: import { Bitfab } from "bitfab" const bitfab = new Bitfab({ apiKey: process.env.BITFAB_API_KEY })- Instrumentation (ONLY allowed form - use getFunction): // Declare trace function key once const myService = bitfab.getFunction("<trace_function_key>") // Wrap functions with withSpan const tracedFn = myService.withSpan(originalFunction) // Or with options: const tracedFn = myService.withSpan({ name: "DisplayName", type: "function" }, originalFunction) // Span types: "llm", "agent", "function", "guardrail", "handoff", "custom"- DO NOT modify the original function.- DO NOT extract helper methods.Task:1) Ensure bitfab is installed and initialization exists.2) Read the codebase and identify ALL AI workflows (LLM calls, agent runs, AI-driven decisions).3) Present me with a numbered list of workflows you found. For each, describe: - What it does - Why it's worth instrumenting — what visibility tracing gives you into each step4) After I choose which workflow(s) to instrument: - Create a function wrapper with `bitfab.getFunction("<trace_function_key>")` - Wrap the functions with `myService.withSpan(originalFunction)` - Instrument intermediate steps (not just the final output) so each trace has enough context to diagnose issues - Replace usages of the original functions with the traced versions5) Do not change function signature, behavior, or return value. Minimal diff.Output:- First: your numbered list of workflows with why each is worth instrumenting- After my selection: minimal diffs for dependencies, initialization, and the function wrapping
new Bitfab({ apiKey: string })// Disable tracing (functions still execute, but no spans are sent)new Bitfab({ apiKey: string, enabled: false })
Missing API key doesn’t crash. If the API key is missing, empty, or whitespace-only, the SDK automatically disables tracing and logs a warning. All wrapped functions still execute normally — no spans are sent, no errors are thrown. You don’t need any conditional logic around the API key.
For projects with instrumented functions spread across multiple files, create a dedicated file that initializes Bitfab and exports the function. Import it wherever you need to instrument.
// lib/bitfab.ts — single source of truthimport { Bitfab } from "bitfab"const bitfab = new Bitfab({ apiKey: process.env.BITFAB_API_KEY })export const orderService = bitfab.getFunction("order-processing")
// services/processOrder.tsimport { orderService } from "../lib/bitfab"async function processOrder(orderId: string) { return { orderId }}export const tracedProcessOrder = orderService.withSpan(processOrder)
// services/validateOrder.tsimport { orderService } from "../lib/bitfab"async function validateOrder(orderId: string) { return { valid: true }}export const tracedValidateOrder = orderService.withSpan(validateOrder)
Spans from different files are automatically linked as parent-child when one wrapped function calls another.
When wrapping a function you didn’t define (e.g. an SDK or library call), pass it directly to withSpan and call the result immediately. This ensures the arguments are captured as span input.
// ✅ GOOD — pass function directly, arguments are captured as span inputconst result = await orderService.withSpan( { name: "ProcessOrder", type: "function" }, processOrder,)(orderId)// ❌ BAD — anonymous wrapper loses all input capture (span has no input)const result = await orderService.withSpan( { name: "ProcessOrder", type: "function" }, async () => processOrder(orderId),)()
Never wrap functions in an anonymous function like async () => fn(args). The SDK captures the wrapper function’s arguments as span input — an anonymous wrapper has no arguments, so the span records nothing.
Use getCurrentSpan() to get a handle to the active span, then call .addContext() to attach contextual key-value pairs from inside a traced function — useful for runtime values like request IDs, computed scores, or dynamic context:
Access the current trace ID from within a span using getCurrentSpan().traceId. This is useful for capturing trace IDs to use with replay or for logging:
Use getCurrentSpan() to set the prompt string on the current span. This is stored in span_data.prompt and is useful for capturing the exact prompt text sent to an LLM:
import { getCurrentSpan } from "bitfab"async function classifyText(text: string) { const prompt = `Classify the following text: ${text}` getCurrentSpan()?.setPrompt(prompt) const result = await llm.complete(prompt) return result}const traced = bitfab.withSpan("classification", { type: "llm" }, classifyText)
The last setPrompt call wins — it overwrites any previously set prompt on the span. Calling setPrompt outside a span context is a no-op (it never crashes).
Use getCurrentTrace() to set context that applies to the entire trace (all spans within a single execution). This is useful for grouping traces by session or attaching trace-level metadata:
import { getCurrentTrace } from "bitfab"const traced = bitfab.withSpan("order-processing", { type: "function" }, async () => { const trace = getCurrentTrace() // Set session ID (stored as database column, filterable in dashboard) trace?.setSessionId("session-123") // Set trace metadata (stored in raw trace data) trace?.setMetadata({ region: "us-west-2", environment: "production" }) // Add context entries (stored as key-value pairs, accumulates across calls) trace?.addContext({ workflow: "checkout-flow", batch_id: "batch-2024-01" }) return { status: "completed" }})
setSessionId(id) — Groups traces by user session. Stored as a database column for efficient filtering.
setMetadata(obj) — Arbitrary key-value metadata on the trace. Merges with existing metadata.
addContext(obj) — Key-value context entries. Accumulates across multiple calls.
Replay historical traces through an updated function version to compare outputs:
const pipeline = bitfab.getFunction("my-function")const updatedFn = pipeline.withSpan( { name: "UpdatedPipeline", type: "function" }, async (input: string) => { // New implementation to test return { result: input.toUpperCase() } },)// Replay all traces (up to limit)const result = await bitfab.replay("my-function", updatedFn, { limit: 10, maxConcurrency: 5, // Default: 10})// Replay specific traces by IDconst result2 = await bitfab.replay("my-function", updatedFn, { traceIds: ["trace-id-1", "trace-id-2"],})// Result structureconsole.log(result.testRunId) // Test run identifierconsole.log(result.testRunUrl) // Dashboard URLfor (const item of result.items) { console.log(item.input) // Original input console.log(item.result) // New output console.log(item.originalOutput) // Original output console.log(item.error) // Error if any console.log(item.durationMs) // Original trace duration in ms (or null) console.log(item.tokens) // { input, output, cached, total } or null console.log(item.model) // Original model name, or null}
Per-item metrics (durationMs, tokens, model) come from the historical trace that fed this replay item. Use them to reason about latency and cost: average tokens.total tells you how much the old runs cost, and durationMs tells you how long they took. Each field is null when the underlying trace didn’t capture it.Options:
limit — Maximum number of traces to replay (default: all)
traceIds — Specific trace IDs to replay
maxConcurrency — Number of traces to replay in parallel (default: 10)
codeChangeDescription — Optional rationale for the code change being tested in this replay (stored on the experiment)
codeChangeFiles — Optional list of edited files, each as { path, before, after } (use "" for newly created or deleted files)
mock — Mock strategy for child spans during replay: "none" (default, run real code), "all" (return historical output for every child), or "marked" (only return historical output for child spans declared with mockOnReplay: true). See Mocking child spans during replay below.
When iterating on a root function, child spans sometimes fail in your local environment for reasons unrelated to the code under test: a paid API key is missing, an external service is flaky, or a production-only DB row isn’t seeded locally. The mock option lets the child return its recorded output so the root function can still run.Three strategies on replay():
"none" (default): every child span runs real code. Use when your local environment can faithfully reproduce the trace.
"all": every descendant withSpan returns its historical output. The root function still runs real, but every child is short-circuited. Useful for a quick sanity-check against recorded data; not the recommended iteration strategy because changes to descendants won’t actually execute.
"marked": only descendants declared with mockOnReplay: true are short-circuited; everything else runs real. This is the iteration-friendly mode.
Per-span opt-in via SpanOptions.mockOnReplay:
const fetchArticle = bitfab.withSpan( "fetch-article-from-db", { mockOnReplay: true }, async (id: string) => db.articles.findById(id),)const summarize = bitfab.withSpan( "summarize-article", async (article: Article) => /* real summarization, no flag */)const processArticle = bitfab.withSpan( "process-article", async (id: string) => summarize(await fetchArticle(id)),)// During replay, fetch-article-from-db returns its recorded output;// summarize-article runs real so you can iterate on it.const result = await bitfab.replay("process-article", processArticle, { limit: 10, mock: "marked",})
mockOnReplay is a per-span tag at definition time — it has no effect outside replay, and it’s only read under mock: "marked". The root function always runs real code; only descendants can be mocked.When no historical span matches a child call (e.g. the recorded trace didn’t reach that branch), execution falls through to the real function — never silent omission.
Each replay creates an experiment (test run). When you’re iterating on a function and replaying after every edit, attach the change so the dashboard can show exactly what was edited alongside the results. The agent reads each file before editing, edits, then reads it again — the two strings go straight into codeChangeFiles. There’s no diff format to construct.
import { readFileSync } from "node:fs"const before = readFileSync("src/foo.ts", "utf8")// ...edit src/foo.ts...const after = readFileSync("src/foo.ts", "utf8")const result = await bitfab.replay("my-function", updatedFn, { codeChangeDescription: "fix off-by-one in retry logic", codeChangeFiles: [{ path: "src/foo.ts", before, after }],})
Both options are optional and independent — you can pass just codeChangeDescription for a quick rationale-only annotation, or just codeChangeFiles to record the literal edits.Notes:
Use a single Bitfab client across instrumentation and replay. If your instrumented module constructs new Bitfab() at import and your replay script constructs another, they do not share registered trace functions — import the client from the instrumented module (or a shared singleton) rather than constructing a new one in the replay script.
Replay results are typically consumed by automation (CI logs, code reviewers, and coding agents reading stdout). Emit the full ReplayResult as a single JSON block so a consumer can JSON.parse it and reason about every field, including the new per-item durationMs, tokens, and model. Never print only lengths, counts, hashes, or truncated previews, and never replace the JSON block with ad-hoc per-field log lines.Recommended script tail (TypeScript):
const result = await bitfab.replay("my-function", updatedFn, { limit })// Optional: human-readable summary first.console.log(`Test run: ${result.testRunUrl}`)console.log(`Items: ${result.items.length}`)// Then: full structured dump, ready for JSON.parse.console.log(JSON.stringify(result, null, 2))
The dumped object includes every item’s input, result, originalOutput, error, durationMs, tokens, and model, plus testRunId and testRunUrl. Writing the same JSON to scripts/replay-result.json in parallel is optional but useful for later analysis.Per-item errors are part of the contract. If the wrapped function throws on a given trace, bitfab.replay catches it, sets item.error, leaves item.result undefined, and continues. Treat items with item.error set as unreplayable, not as failing outputs — compute pass/fail only over items where it’s unset. This matters most for DB reads/writes: a stale FK, missing record, or rejected write is infra failure, not a regression.Don’t swallow per-item errors in the script. A custom try/catch that returns a placeholder turns infra failures into fake successes. Let the SDK record them. The only allowed top-level catch is a fatal handler around main() that exits non-zero, so callers can tell a whole-replay crash from a clean run with some unreplayable items.Environment. Replay executes in the app’s own process — the instrumented function is imported as a library, and its DB clients, env vars, config loaders, and model IDs resolve from whatever environment the replay script is run under. The script must bootstrap the same environment the app uses (e.g. import "dotenv/config" at the top, or run via pnpm with-env tsx scripts/replay.ts). Do not mock these — they’re the same dependencies the app resolves in production. For replay to see the same DB rows the trace was captured against, point the script at the trace’s source environment (the environment field on the trace — production / staging / development).Input serialization caveat. Replay deserializes historical span inputs and passes them back to your function. This works for strings, numbers, and plain objects. If your span wraps a function that takes hydrated domain objects (ORM models, class instances, DB records), they won’t round-trip through serialization — move the span to where inputs are IDs or plain data and let the function fetch objects internally, or reshape arguments in the wrapper.
Create a standalone script to regression-test your trace functions against production data with one command. The script maps pipeline names to their replay functions, accepts CLI flags, and prints a side-by-side comparison with delta summaries.
/** * Replay production traces through instrumented functions. * * Uses bitfab.replay() to fetch real traces and re-run them through * the current code, creating a test run for side-by-side comparison. * * Usage: * npx tsx scripts/replay.ts <pipeline> * npx tsx scripts/replay.ts <pipeline> --limit 20 * npx tsx scripts/replay.ts <pipeline> --trace-ids id1,id2 */import "dotenv/config"import { bitfab } from "../lib/bitfab"import { extractMemories } from "../services/extraction"import { searchDocuments } from "../services/search"const FUNCTIONS = { extraction: "my-extraction-pipeline", search: "my-search-pipeline",} as consttype Pipeline = keyof typeof FUNCTIONSconst pipeline = process.argv[2] as Pipeline | undefinedconst args = process.argv.slice(3)if (!pipeline || !FUNCTIONS[pipeline]) { console.error( `Usage: npx tsx scripts/replay.ts <${Object.keys(FUNCTIONS).join("|")}> [--limit N] [--trace-ids id1,id2]`, ) process.exit(1)}let limit = 10let traceIds: string[] | undefinedfor (let i = 0; i < args.length; i++) { if (args[i] === "--limit" && args[i + 1]) { limit = Number.parseInt(args[i + 1], 10) i++ } else if (args[i] === "--trace-ids" && args[i + 1]) { traceIds = args[i + 1].split(",").map((id) => id.trim()) i++ }}// Each pipeline gets its own replay function — replay deserializes// historical inputs, so if the function signature doesn't match the// raw input shape, reshape the arguments in a thin wrapper here.async function replayExtraction() { const fn = async (conversation: string, existingItems: unknown[]) => { return extractMemories(conversation, existingItems) } return bitfab.replay(FUNCTIONS.extraction, fn, { limit, ...(traceIds ? { traceIds } : {}), })}async function replaySearch() { const fn = async (query: string, opts: Record<string, unknown>) => { return searchDocuments(query, { userId: opts.userId as string, limit: (opts.limit as number) ?? 10, }) } return bitfab.replay(FUNCTIONS.search, fn, { limit, ...(traceIds ? { traceIds } : {}), })}const REPLAY_FNS: Record<Pipeline, () => ReturnType<typeof bitfab.replay>> = { extraction: replayExtraction, search: replaySearch,}async function main() { const functionKey = FUNCTIONS[pipeline!] console.log(`[replay] Replaying ${traceIds?.length ?? limit} traces from "${functionKey}"...\n`) const result = await REPLAY_FNS[pipeline!]() console.log(`Test run: ${result.testRunUrl}\n`) let changed = 0, same = 0, errors = 0 for (const item of result.items) { const input = item.input as unknown[] | undefined const label = typeof input?.[0] === "string" ? input[0].slice(0, 80) : JSON.stringify(input?.[0] ?? "unknown").slice(0, 80) if (item.error) { console.log(` ✗ "${label}"`) console.log(` Error: ${item.error}`) errors++ } else { const origStr = typeof item.originalOutput === "string" ? item.originalOutput : JSON.stringify(item.originalOutput) const newStr = typeof item.result === "string" ? item.result : JSON.stringify(item.result) const isSame = origStr === newStr const marker = isSame ? "=" : "Δ" console.log(` ${marker} "${label}"`) console.log(` Original: ${origStr}`) console.log(` New: ${newStr}`) isSame ? same++ : changed++ } } console.log(`\n─── Summary ───`) console.log(` Pipeline: ${pipeline}`) console.log(` Replayed: ${result.items.length}`) console.log(` Same: ${same}`) console.log(` Changed: ${changed}`) if (errors > 0) console.log(` Errors: ${errors}`) console.log(`\n ${result.testRunUrl}`)}main().catch((err) => { console.error("[replay] Fatal error:", err) process.exit(1)})
Adapt the imports, pipeline names, and per-pipeline replay functions to match your project’s instrumented workflows.