Use this file to discover all available pages before exploring further.
The Bitfab Python SDK captures your AI function calls to automatically generate evaluations. Re-run your prompts with different models, parameters, and inputs to iterate faster.
Copy this prompt into your coding agent (tested with Cursor and Claude Code using Sonnet 4.5):
Modify existing Python code to add Bitfab tracing.Do NOT browse or web search. Use ONLY the API described below.Bitfab Python SDK (authoritative excerpt):- Install: `pip install bitfab-py` or `poetry add bitfab-py` or `uv add bitfab-py`- Init: import os from bitfab import Bitfab bitfab = Bitfab(api_key=os.environ["BITFAB_API_KEY"])- Instrumentation (ONLY allowed form - use get_function): # Declare trace function key once my_service = bitfab.get_function("<trace_function_key>") # Decorate methods with span @my_service.span() def method_name(): ... # Or with options: @my_service.span(name="DisplayName", type="function") def method_name(): ... # Span types: "llm", "agent", "function", "guardrail", "handoff", "custom"- Decorator form ONLY; must be placed immediately ABOVE the `def` it instruments.- DO NOT use context managers or manual span creation.- DO NOT extract helper methods.Task:1) Ensure bitfab-py is installed and initialization exists.2) Read the codebase and identify ALL AI workflows (LLM calls, agent runs, AI-driven decisions).3) Present me with a numbered list of workflows you found. For each, describe: - What it does - Why it's worth instrumenting — what visibility tracing gives you into each step4) After I choose which workflow(s) to instrument: - Create a function wrapper with `bitfab.get_function("<trace_function_key>")` - Add `@my_service.span()` directly ABOVE each method's `def` - Instrument intermediate steps (not just the final output) so each trace has enough context to diagnose issues - Ensure the bitfab client is initialized and accessible5) Do not change method signature, behavior, or return value. Minimal diff.Output:- First: your numbered list of workflows with why each is worth instrumenting- After my selection: minimal diffs for dependencies, initialization, and the method changes
Bitfab(api_key: str)# Disable tracing (functions still execute, but no spans are sent)Bitfab(api_key: str, enabled: bool = True)
Missing API key doesn’t crash. If the API key is missing, empty, or whitespace-only, the SDK automatically disables tracing and logs a warning. All decorated functions still execute normally — no spans are sent, no errors are thrown. You don’t need any conditional logic around the API key.
For projects with instrumented functions spread across multiple files, create a dedicated file that initializes Bitfab and exports the function. Import it wherever you need to instrument.
# lib/bitfab_client.py — single source of truthimport osfrom bitfab import Bitfabbitfab = Bitfab(api_key=os.environ["BITFAB_API_KEY"])order_service = bitfab.get_function("order-processing")
Use get_current_span() to get a handle to the active span, then call .add_context() to attach contextual key-value pairs from inside a traced function — useful for runtime values like request IDs, computed scores, or dynamic context:
Use get_current_span() to set the prompt string on the current span. This is stored in span_data.prompt and is useful for capturing the exact prompt text sent to an LLM:
from bitfab import get_current_span@bitfab.span("classification", type="llm")def classify_text(text: str) -> str: prompt = f"Classify the following text: {text}" get_current_span().set_prompt(prompt) result = llm.complete(prompt) return result
The last set_prompt call wins — it overwrites any previously set prompt on the span. Calling set_prompt outside a span context is a no-op (it never crashes).
Use get_current_trace() to set context that applies to the entire trace (all spans within a single execution). This is useful for grouping traces by session or attaching trace-level metadata:
from bitfab import get_current_trace@bitfab.span("order-processing", type="function")def process_order(order_id: str) -> dict: trace = get_current_trace() # Set session ID (stored as database column, filterable in dashboard) trace.set_session_id("session-123") # Set trace metadata (stored in raw trace data) trace.set_metadata({"region": "us-west-2", "environment": "production"}) # Add context entries (stored as key-value pairs, accumulates across calls) trace.add_context({"workflow": "checkout-flow", "batch_id": "batch-2024-01"}) return {"order_id": order_id, "status": "completed"}
set_session_id(id) — Groups traces by user session. Stored as a database column for efficient filtering.
set_metadata(dict) — Arbitrary key-value metadata on the trace. Merges with existing metadata.
add_context(dict) — Key-value context entries. Accumulates across multiple calls.
Replay historical traces through a function and create a test run with comparison data. This is useful for testing changes to your functions against real production inputs.
fn (required): The function to replay (must be decorated with @span)
limit (optional): Maximum number of traces to replay. Default: 5
trace_ids (optional): List of trace IDs to filter which traces are replayed
max_concurrency (optional): Maximum items processed in parallel. 1 for sequential, None for unlimited. Default: 10
code_change_description (optional): Rationale for the code change being tested in this replay (stored on the experiment)
code_change_files (optional): List of edited files, each as {"path": str, "before": str, "after": str} (use "" for newly created or deleted files)
Returns:
{ "items": [ { "input": [...], # The inputs passed to fn "result": ..., # What fn returned "original_output": ..., # What the original trace produced "error": None | str, # Error message if fn raised "duration_ms": int | None, # Original trace duration in ms "tokens": { # Original trace token usage, or None "input": int | None, "output": int | None, "cached": int | None, "total": int | None, } | None, "model": str | None, # Original model name, or None } ], "test_run_id": "...", "test_run_url": "..."}
Per-item duration_ms, tokens, and model come from the historical trace that fed the item. Use them to reason about the cost and latency of the old runs. Each field is None when the underlying trace didn’t capture it.
When iterating on a root function, child spans sometimes fail in your local environment for reasons unrelated to the code under test: a paid API key is missing, an external service is flaky, or a production-only DB row isn’t seeded locally. The mock keyword lets the child return its recorded output so the root function can still run.Three strategies on replay():
"none" (default): every child span runs real code.
"all": every descendant span returns its historical output. The root function still runs real, but every child is short-circuited. Useful for a quick sanity-check against recorded data; not the recommended iteration strategy because changes to descendants won’t actually execute.
"marked": only descendants declared with mock_on_replay=True are short-circuited; everything else runs real. This is the iteration-friendly mode.
Per-span opt-in via the mock_on_replay kwarg on @client.span(...):
@bitfab.span("fetch-article-from-db", mock_on_replay=True)def fetch_article_from_db(article_id: str) -> Article: return db.articles.find_by_id(article_id)@bitfab.span("summarize-article")def summarize_article(article: Article) -> Summary: # Real summarization, no flag — this is what we're iterating on. return Summary(...)@bitfab.span("process-article")def process_article(article_id: str) -> Summary: return summarize_article(fetch_article_from_db(article_id))# During replay, fetch-article-from-db returns its recorded output;# summarize-article runs real so you can iterate on it.result = bitfab.replay(process_article, limit=10, mock="marked")
mock_on_replay is a per-span tag at definition time — it has no effect outside replay, and it’s only read under mock="marked". The root function always runs real code; only descendants can be mocked. When no historical span matches a child call, execution falls through to the real function — never silent omission.
Each replay creates an experiment (test run). When you’re iterating on a function and replaying after every edit, attach the change so the dashboard can show exactly what was edited alongside the results. Read each file before editing, edit, then read it again — the two strings go straight into code_change_files. There’s no diff format to construct.
with open("src/foo.py") as f: before = f.read()# ...edit src/foo.py...with open("src/foo.py") as f: after = f.read()result = bitfab.replay( my_function, code_change_description="fix off-by-one in retry logic", code_change_files=[{"path": "src/foo.py", "before": before, "after": after}],)
Both options are optional and independent — you can pass just code_change_description for a quick rationale-only annotation, or just code_change_files to record the literal edits.Notes:
The function must be decorated with @span — the trace function key is read from the decorator. Pass the decorated function itself, not an undecorated wrapper around it; the @span attribute is what identifies the trace key. For nested decorators (e.g. @retry(@cache(@span(fn)))), pass the outermost — replay walks the __wrapped__ chain to find @span.
For decorated methods on classes, pass the unbound function on the class (MyClass.method) to replay traces for all instances, or a bound method on a specific instance (instance.method) to replay through that instance’s state. Both resolve to the same trace function key.
Use a single Bitfab client across instrumentation and replay. If your instrumented module constructs Bitfab() at import and your replay script constructs another, they do not share registered trace functions — import the client from the instrumented module (or a shared singleton) rather than constructing a new one in the replay script.
The function can be sync or async (async functions are detected and run automatically)
If the function raises an error for one input, replay continues with the remaining inputs
Each replay creates a test run visible in the Bitfab dashboard
Works through nested decorators (e.g. @retry, @cache) — walks the __wrapped__ chain to find @span
Replay results are typically consumed by automation (CI logs, code reviewers, and coding agents reading stdout). Emit the full ReplayResult as a single JSON block so a consumer can json.loads it and reason about every field, including the new per-item duration_ms, tokens, and model. Never print only lengths, counts, hashes, or truncated previews, and never replace the JSON block with ad-hoc per-field log lines.Recommended script tail:
result = bitfab.replay(my_function, limit=limit)# Optional: human-readable summary first.print(f"Test run: {result['test_run_url']}")print(f"Items: {len(result['items'])}")# Then: full structured dump, ready for json.loads.print(json.dumps(result, indent=2, default=str))
The dumped object includes every item’s input, result, original_output, error, duration_ms, tokens, and model, plus test_run_id and test_run_url. Writing the same JSON to scripts/replay-result.json in parallel is optional but useful for later analysis.Per-item errors are part of the contract. If the wrapped function raises on a given trace, bitfab.replay catches it, sets item['error'], leaves item['result'] as None, and continues. Treat items with item['error'] set as unreplayable, not as failing outputs — compute pass/fail only over items where it’s None. This matters most for DB reads/writes: a stale FK, missing record, or rejected write is infra failure, not a regression.Don’t swallow per-item errors in the script. A custom try/except that returns a placeholder turns infra failures into fake successes. Let the SDK record them. The only allowed top-level except is a fatal handler around main() that exits non-zero, so callers can tell a whole-replay crash from a clean run with some unreplayable items.Environment. Replay executes in the app’s own process — the instrumented function is imported as a library, and its DB clients, env vars, config loaders, and model IDs resolve from whatever environment the replay script is run under. The script must bootstrap the same environment the app uses (e.g. load_dotenv() at the top, or run via dotenv run -- python scripts/replay.py). Do not mock these — they’re the same dependencies the app resolves in production. For replay to see the same DB rows the trace was captured against, point the script at the trace’s source environment (the environment field on the trace — production / staging / development).Input serialization caveat. Replay deserializes historical span inputs and passes them back to your function. This works for strings, numbers, and plain dicts. If your span wraps a function that takes hydrated domain objects (ORM models, class instances, DB records), they won’t round-trip through serialization — move the span to where inputs are IDs or plain data and let the function fetch objects internally, or reshape arguments in the wrapper.
Create a standalone script to regression-test your trace functions against production data with one command. The script maps pipeline names to their replay functions, accepts CLI flags, and prints a side-by-side comparison with delta summaries.
#!/usr/bin/env python3"""Replay production traces through instrumented functions.Uses bitfab.replay() to fetch real traces and re-run them throughthe current code, creating a test run for side-by-side comparison.Usage: python scripts/replay.py <pipeline> python scripts/replay.py <pipeline> --limit 20 python scripts/replay.py <pipeline> --trace-ids id1,id2"""import argparseimport jsonfrom dotenv import load_dotenvfrom lib.bitfab_client import bitfabfrom services.extraction import extract_memoriesfrom services.search import search_documentsload_dotenv()FUNCTIONS = { "extraction": "my-extraction-pipeline", "search": "my-search-pipeline",}# Each pipeline gets its own replay function — replay deserializes# historical inputs, so if the function signature doesn't match the# raw input shape, reshape the arguments in a thin wrapper here.def replay_extraction(limit: int, trace_ids: list[str] | None): def fn(conversation: str, existing_items: list): return extract_memories(conversation, existing_items) return bitfab.replay(fn, limit=limit, trace_ids=trace_ids)def replay_search(limit: int, trace_ids: list[str] | None): def fn(query: str, opts: dict): return search_documents(query, user_id=opts["user_id"], limit=opts.get("limit", 10)) return bitfab.replay(fn, limit=limit, trace_ids=trace_ids)REPLAY_FNS = { "extraction": replay_extraction, "search": replay_search,}def main(): parser = argparse.ArgumentParser(description="Replay production traces") parser.add_argument("pipeline", choices=FUNCTIONS.keys()) parser.add_argument("--limit", type=int, default=10) parser.add_argument("--trace-ids", type=str) args = parser.parse_args() trace_ids = [tid.strip() for tid in args.trace_ids.split(",")] if args.trace_ids else None function_key = FUNCTIONS[args.pipeline] print(f"[replay] Replaying {len(trace_ids) if trace_ids else args.limit} traces from \"{function_key}\"...\n") result = REPLAY_FNS[args.pipeline](args.limit, trace_ids) print(f"Test run: {result['test_run_url']}\n") changed = same = errors = 0 for item in result["items"]: raw_input = item.get("input") or [] label = str(raw_input[0])[:80] if raw_input else "unknown" if item["error"]: print(f' ✗ "{label}"') print(f" Error: {item['error']}") errors += 1 else: orig = item["original_output"] new = item["result"] orig_str = orig if isinstance(orig, str) else json.dumps(orig, default=str) new_str = new if isinstance(new, str) else json.dumps(new, default=str) is_same = orig_str == new_str marker = "=" if is_same else "Δ" print(f' {marker} "{label}"') print(f" Original: {orig_str}") print(f" New: {new_str}") if is_same: same += 1 else: changed += 1 print(f"\n─── Summary ───") print(f" Pipeline: {args.pipeline}") print(f" Replayed: {len(result['items'])}") print(f" Same: {same}") print(f" Changed: {changed}") if errors > 0: print(f" Errors: {errors}") print(f"\n {result['test_run_url']}")if __name__ == "__main__": main()
Adapt the imports, pipeline names, and per-pipeline replay functions to match your project’s instrumented workflows.