Python SDK - Bitfab

The Bitfab Python SDK captures your AI function calls to automatically generate evaluations. Re-run your prompts with different models, parameters, and inputs to iterate faster.

Installation

# pip
pip install bitfab-py

# Poetry
poetry add bitfab-py

# uv
uv add bitfab-py

Quick Start

import os
from bitfab import Bitfab

bitfab = Bitfab(api_key=os.environ["BITFAB_API_KEY"])

Need an API key? Get one from the Bitfab dashboard or see the API Keys guide for detailed setup instructions.

Coding Agent Prompt (Cursor, Claude Code)

Copy this prompt into your coding agent (tested with Cursor and Claude Code using Sonnet 4.5):

Modify existing Python code to add Bitfab tracing.
Do NOT browse or web search. Use ONLY the API described below.

Bitfab Python SDK (authoritative excerpt):
- Install: `pip install bitfab-py` or `poetry add bitfab-py` or `uv add bitfab-py`
- Init:
  import os
  from bitfab import Bitfab
  bitfab = Bitfab(api_key=os.environ["BITFAB_API_KEY"])
- Framework integrations:
  If the codebase uses LangGraph or LangChain (`langgraph`, `langchain`, `langchain_core`, or other `langchain_*` packages), use the callback handler instead of manually decorating graph nodes, tools, retrievers, or model calls:
    handler = bitfab.get_langgraph_callback_handler("<trace_function_key>")
    graph.invoke(input, config={"callbacks": [handler]})
  For plain LangChain chains, `get_langchain_callback_handler("<trace_function_key>")` is an identical alias. The handler records a replayable root from the framework input, so no outer `@span` root is needed when the workflow is just the graph/chain invocation. Add a same-key outer root only for meaningful surrounding application work.
- Manual instrumentation (when no framework handler applies, or for meaningful work around a framework call):
  # Declare trace function key once
  my_service = bitfab.get_function("<trace_function_key>")

  # Decorate methods with span
  @my_service.span()
  def method_name(): ...

  # Or with options:
  @my_service.span(name="DisplayName", type="function")
  def method_name(): ...

  # Span types: "llm", "agent", "function", "guardrail", "handoff", "custom"
- Decorator form ONLY; must be placed immediately ABOVE the `def` it instruments.
- DO NOT use context managers or manual span creation.
- DO NOT extract helper methods.

Task:
1) Ensure bitfab-py is installed and initialization exists.
2) Read the codebase and identify ALL AI workflows (LLM calls, agent runs, AI-driven decisions). Check for LangGraph/LangChain before planning manual instrumentation.
3) Present me with a numbered list of workflows you found. For each, describe:
   - What it does
   - Why it's worth instrumenting -- what visibility tracing gives you into each step
4) After I choose which workflow(s) to instrument:
   - If it uses LangGraph/LangChain, add the Bitfab callback handler to the framework invoke config instead of decorating framework-managed internals. Use `get_function("<trace_function_key>").get_langgraph_callback_handler()` only when a same-key outer `@span` root is needed for surrounding application work.
   - For non-framework workflows, create a function wrapper with `bitfab.get_function("<trace_function_key>")`
   - Add `@my_service.span()` directly ABOVE each non-framework method's `def`
   - Instrument intermediate steps (not just the final output) so each trace has enough context to diagnose issues
   - Ensure the bitfab client is initialized and accessible
5) Do not change method signature, behavior, or return value. Minimal diff.

Output:
- First: your numbered list of workflows with why each is worth instrumenting
- After my selection: minimal diffs for dependencies, initialization, and the method changes

Basic Configuration

Bitfab(api_key="...")

# Omit api_key entirely: the SDK reads BITFAB_API_KEY from the environment
Bitfab()

# Disable tracing (functions still execute, but no spans are sent)
Bitfab(api_key="...", enabled=False)

Missing API key doesn’t crash. If the API key is missing, empty, or whitespace-only, the SDK automatically disables tracing and logs a one-time warning at first use. All decorated functions still execute normally — no spans are sent, no errors are thrown. You don’t need any conditional logic around the API key.

API key resolution

The key is resolved lazily, the first time a span runs, not when the client is constructed. This matters in scripts: a module that builds the client at import time can run before the entrypoint calls load_dotenv(), so a key read at construction would be empty even though it is set moments later. Resolving at first use reads the key after env loading has happened.

# Pass a callable to defer resolution explicitly (resolved at first use):
Bitfab(api_key=lambda: os.environ.get("BITFAB_API_KEY"))

When no key is passed (or it resolves empty), the SDK falls back to reading BITFAB_API_KEY from the environment, again at first use. For standalone scripts where a run that emits no traces should be treated as a failure rather than silently skipped, set strict:

# Raises on the first traced call if no key resolves, instead of disabling quietly
Bitfab(api_key=os.environ.get("BITFAB_API_KEY"), strict=True)

If you load env with dotenv in a script, prefer loading it before the module graph is imported, for example dotenv run -- python script.py, so every module-level read sees the key.

Tracing

Custom (Recommended)

Using `get_function()` to Link Spans

Declare the trace function key once and link multiple spans together:

order_service = bitfab.get_function("order-processing")

@order_service.span()
def process_order(order_id: str) -> dict:
    return {"order_id": order_id}

@order_service.span()
def validate_order(order_id: str) -> dict:
    return {"valid": True}

Multi-File Projects

For projects with instrumented functions spread across multiple files, create a dedicated file that initializes Bitfab and exports the function. Import it wherever you need to instrument.

# lib/bitfab_client.py -- single source of truth
import os
from bitfab import Bitfab
bitfab = Bitfab(api_key=os.environ["BITFAB_API_KEY"])
order_service = bitfab.get_function("order-processing")

# services/process_order.py
from lib.bitfab_client import order_service

@order_service.span()
def process_order(order_id: str) -> dict:
    return {"order_id": order_id}

# services/validate_order.py
from lib.bitfab_client import order_service

@order_service.span()
def validate_order(order_id: str) -> dict:
    return {"valid": True}

Spans from different files are automatically linked as parent-child when one decorated function calls another.

Using `@bitfab.span()` Directly

For a single span without linking to a function group:

@bitfab.span("one-off-operation")
def standalone_task() -> str:
    return "done"

Automatic Nesting

Spans nest automatically based on call stack:

@bitfab.span("outer", type="agent")
def outer():
    inner()  # Becomes a child of "outer"

@bitfab.span("inner", type="function")
def inner():
    pass

Span Options

Parameters:

trace_function_key (required): String identifier for grouping spans
name (optional): Display name. Defaults to function name, then trace function key
type (optional): Span type. Defaults to "custom". A label only, used to organize and filter spans in the dashboard; it does not change how the span is traced, replayed, or evaluated
finalize (optional): Callable[[result], serializable]. Record a serializable view of a non-serializable result (a live stream). See Tracing streaming functions

Span Types:

SpanType = Literal[
    "llm",        # LLM calls
    "agent",      # Agent workflows
    "function",   # Function calls
    "guardrail",  # Safety checks
    "handoff",    # Human handoffs
    "custom"      # Default
]

Examples:

# Function name is automatically captured as span name
@bitfab.span("order-processing")
def process_order(order_id: str) -> dict:
    return {"order_id": order_id}
# Span name: "process_order"

# Override with name option
@bitfab.span("order-processing", name="OrderProcessor")
def process_order(order_id: str) -> dict:
    return {"order_id": order_id}
# Span name: "OrderProcessor"

# Set span type
@bitfab.span("safety-check", type="guardrail")
def check_content(content: str) -> dict:
    return {"safe": True}

Tracing Streaming Functions

A streaming function hands chunks to the caller as they arrive; the raw stream isn’t serializable as a trace output, and consuming it to record a summary would break streaming. The finalize option records a serializable, replayable view of the stream while the caller still receives every chunk. Because Python streams are single-consumer (unlike a JS stream you can tee), the non-destructive way to trace streaming is an async generator that yields its chunks. The span collects the chunks as they pass through to the caller, and finalize turns the collected chunks into a summary. Use the prebuilt finalizers.openai_chunks or finalizers.anthropic_events:

from bitfab import finalizers

@bitfab.span("chat", type="llm", finalize=finalizers.openai_chunks)
async def chat(messages):
    stream = await client.chat.completions.create(
        model="gpt-4o", messages=messages, stream=True
    )
    async for chunk in stream:
        yield chunk  # caller still receives every chunk

# The span records { text, finish_reason, usage, tool_calls } in the background.

finalize may also be a plain callable that builds whatever shape you want from the collected chunks:

@bitfab.span("chat", type="llm", finalize=lambda chunks: {"text": "".join(
    c.choices[0].delta.content or "" for c in chunks
)})
async def chat(messages):
    ...
    async for chunk in stream:
        yield chunk

For a non-generator function, finalize receives the return value instead of the collected chunks and is applied inline before the span is recorded (awaited on an async span). The caller’s return value is always the raw result, but a live single-consumer stream returned here will be blocked on and consumed, so use an async generator for streaming, and reserve the non-generator form for plain return values or results with non-destructive accessors. A finalize that raises records an error on the span instead of crashing the host. Inputs to the wrapped function must still be serializable for the trace to replay.

Span Context

Use get_current_span() to get a handle to the active span, then call .add_context() to attach contextual key-value pairs from inside a traced function — useful for runtime values like request IDs, computed scores, or dynamic context:

from bitfab import get_current_span

@bitfab.span("order-processing", type="function")
def process_order(order_id: str) -> dict:
    user_id = get_current_user()
    get_current_span().add_context({"user_id": user_id, "order_id": order_id})
    return {"order_id": order_id, "status": "completed"}

Each add_context call pushes the entire dictionary as one entry. Multiple calls accumulate entries:

get_current_span().add_context({"user_id": "u-123"})
get_current_span().add_context({"request_id": "req-789"})
# Result: contexts: [{"user_id": "u-123"}, {"request_id": "req-789"}]

get_current_span().id and .trace_id expose the canonical Bitfab span and trace IDs. Both are empty strings outside a span context.

Span Prompt

Use get_current_span() to set the prompt string on the current span. This is stored in span_data.prompt and is useful for capturing the exact prompt text sent to an LLM:

from bitfab import get_current_span

@bitfab.span("classification", type="llm")
def classify_text(text: str) -> str:
    prompt = f"Classify the following text: {text}"
    get_current_span().set_prompt(prompt)
    result = llm.complete(prompt)
    return result

The prompt is metadata only. It records the prompt text for display and reference in the dashboard; it does not send the prompt to any model or change what the span executes. The last set_prompt call wins — it overwrites any previously set prompt on the span. Calling set_prompt outside a span context is a no-op (it never crashes).

Framework Integrations

Bitfab provides automatic tracing for popular AI frameworks. See the dedicated guides for full API references:

LangGraph / LangChain

Callback handler that records a replayable framework root plus graph nodes, LLM calls, tools, and retrievers

OpenAI Agents SDK

Trace processor for agent runs

BAML

Auto-capture prompts and LLM metadata

Claude Agent SDK

Capture LLM turns, tool calls, and subagents

Trace Context

Use get_current_trace() to set context that applies to the entire trace (all spans within a single execution). This is useful for grouping traces by session or attaching trace-level metadata:

from bitfab import get_current_trace

@bitfab.span("order-processing", type="function")
def process_order(order_id: str) -> dict:
    trace = get_current_trace()

    # Set session ID (stored as database column, filterable in dashboard)
    trace.set_session_id("session-123")

    # Set trace metadata (stored in raw trace data)
    trace.set_metadata({"region": "us-west-2", "environment": "production"})

    # Add context entries (stored as key-value pairs, accumulates across calls)
    trace.add_context({"workflow": "checkout-flow", "batch_id": "batch-2024-01"})

    return {"order_id": order_id, "status": "completed"}

set_session_id(id) — Groups traces by user session. Stored as a database column for efficient filtering.
set_metadata(dict) — Arbitrary key-value metadata on the trace. Merges with existing metadata.
add_context(dict) — Key-value context entries. Accumulates across multiple calls.

Dropping a Trace

Call .drop() on the current-trace handle to discard the in-flight trace. Once flagged, spans that complete afterward are not uploaded at all, and the flag rides out on the completion payload, so when the trace completes the server scrubs any payloads that already raced out (the trace, its external trace, and sibling spans), deletes the archived S3 objects, and marks it dropped instead of completed, keeping only a skeleton audit row. Use it to discard runs you never want stored (health checks, test traffic) or a run you know carries sensitive data.

from bitfab import get_current_trace

@bitfab.span("order-processing", type="function")
def process_order(order_id: str) -> dict:
    if is_health_check(order_id):
        get_current_trace().drop()

    return {"order_id": order_id, "status": "completed"}

Safe to call outside a trace (a no-op), and never raises into your application.

Detached Trace

Use client.get_trace(trace_id) to get a handle to a trace that has already closed. This lets you add context, merge metadata, or set the session ID from any process, thread, or agent that knows the trace ID, with no shared in-memory state.

trace = client.get_trace(trace_id)
trace.add_context({"refund_status": "approved"})
trace.set_metadata({"region": "us-west"})
trace.set_session_id("session_xyz")

# Optional: wait for confirmation
thread = trace.set_metadata({"status": "complete"})
if thread:
    thread.join(timeout=5.0)

The trace_id is Bitfab’s canonical trace ID—the same UUID exposed by get_current_span().trace_id for native SDK traces and used in Bitfab trace URLs. All methods are fire-and-forget (return an optional threading.Thread you can .join() for confirmation, or None if the client is disabled). Pending requests are tracked so flush_traces() waits for them.

add_context(context) — Appends a context entry. Existing entries are preserved.
set_metadata(metadata) — Shallow-merges new keys into existing metadata.
set_session_id(session_id) — Replaces any existing session ID.

Read One Persisted Span

Use get_trace_span to fetch one span without loading the full trace. Both the trace ID and exact span ID are canonical Bitfab IDs; ingestion source IDs are not accepted. Repeated name matches default to the last span.

span = client.get_trace_span(trace_id, name="GenerateAnswer")
first = client.get_trace_span(
    trace_id, name="GenerateAnswer", occurrence="first"
)
exact = client.get_trace_span(trace_id, id=span_id)

occurrence also accepts a zero-based integer. A missing trace or span returns None.

Error Handling

Errors are captured in the span and re-raised:

@bitfab.span("risky-service")
def risky():
    raise ValueError("error")

try:
    risky()
except ValueError:
    pass
# Span records error and timing

Each error is classified by source. Errors raised by your code are recorded with error_source: "code". SDK-internal errors are recorded with source: "sdk". Both appear in the span’s errors array in the Bitfab dashboard.

Flushing Traces

from bitfab import flush_traces

flush_traces(timeout=30.0)  # Default: 30s

Traces flush automatically on process exit via atexit hook.

Replay

A trace is replayable when its root span has serializable inputs, or when the workflow is instrumented through a framework handler (whose recorded root input is itself serializable). One of these must hold for replay to work. Replay historical traces through a function and create a test run with comparison data. This is useful for testing changes to your functions against real production inputs.

@bitfab.span("my-function-key")
def my_function(text: str) -> dict:
    return {"processed": text.upper()}

result = bitfab.replay(my_function, limit=5)

# Or replay specific traces by ID
result = bitfab.replay(my_function, trace_ids=["trace-abc", "trace-def"])

print(f"Test Run: {result['test_run_url']}")
for item in result["items"]:
    print(f"  Input: {item['input']}")
    print(f"  Result: {item['result']}")
    print(f"  Original: {item['original_output']}")
    print(f"  Duration (ms): {item['duration_ms']}")
    print(f"  Tokens: {item['tokens']}")       # {"input", "output", "cached", "total"} or None
    print(f"  Model: {item['model']}")

Pass replay() either an already-@span-decorated function (it carries its trace function key, so it runs as-is) or, with an explicit key, a plain callable that re-invokes a raw entrypoint (which replay() wraps for you). Do not pass a plain closure that itself calls a @span-decorated function: replay() wraps the closure as the root span while the inner decorated function records its own span underneath, nesting a duplicate. If your root is already decorated, pass it directly: bitfab.replay(my_function, limit=5).

Parameters:

fn (required): The function to replay. Two call forms: replay(decorated_fn) reads the trace function key from the @span decorator; replay("key", fn) takes an explicit key with any plain callable (the SDK wraps it internally). Use the explicit-key form for handler-instrumented functions with no decorated root in the app; see Replaying handler-instrumented functions below.
limit (optional): Maximum number of recent traces to replay. Default: 5; maximum: 5,000. Ignored when trace_ids or dataset_id is passed: an explicit ID list or dataset already determines how many traces replay.
trace_ids (optional): List of trace IDs to replay (max 100). The ID count determines how many traces replay; limit is ignored when both are passed.
name (optional): Display name for the resulting experiment/test run.
max_concurrency (optional): Maximum items processed in parallel. 1 for sequential, None for unlimited. Default: 10
code_change_description (optional): Rationale for the code change being tested in this replay (stored on the experiment)
code_change_files (optional): List of edited files, each as {"path": str, "before": str, "after": str} (use "" for newly created or deleted files)
experiment_group_id (optional): UUID string that groups multiple replay runs into a single experiment batch. Pass the same ID across successive replay() calls to link them together in the dashboard.
grader_ids (optional): Array of grader UUIDs (max 100) attached directly to this replay run, independent of the dataset’s own graders. The resulting experiment is graded by the union of these and the dataset’s runnable graders. Use it to grade a single run with a check you don’t want to add to the dataset permanently. Each id must be an active grader in the same organization and trace function, or the replay is rejected with a 400. A replay with no dataset can still carry graders this way.
adapt_inputs (optional): Hook to reshape recorded inputs onto the function’s current signature when its shape changed after the traces were captured. See Adapting inputs after a signature change below.
on_progress (optional): Callback fired once per item as it settles, with running totals plus the settled item payload (source trace id, local replay trace id, input, result, original output, error, duration, tokens/model metadata). Use it to render live progress or start evaluating completed items while replay runs. A raising callback never crashes the run. Bitfab plugin replay scripts can pass the SDK’s ready-made report_replay_progress callback straight in (on_progress=report_replay_progress); it writes the event to stderr, which the Bitfab plugin polls to report live progress and write per-item result files while replay runs, while stdout remains available for direct-run ReplayResult JSON.
environment (optional): A ReplayEnvironment. When passed, the Bitfab server resolves a per-trace database branch from each source trace’s captured snapshot reference, and the SDK exposes that branch’s URL via environment.database_url inside the replayed function (releasing the branch after each item). Read environment.active to fall back to your live database when no branch was resolved (e.g. the trace predates snapshot capture, or DB branching isn’t configured). Construct one with ReplayEnvironment() and read it only inside the replayed function.

Returns:

{
    "items": [
        {
            "input": [...],             # The inputs passed to fn
            "result": ...,              # What fn returned
            "original_output": ...,     # What the original trace produced
            "error": None | str,        # Error message if fn raised
            "duration_ms": int | None,  # Original trace duration in ms
            "tokens": {                 # Original trace token usage, or None
                "input": int | None,
                "output": int | None,
                "cached": int | None,
                "total": int | None,
            } | None,
            "model": str | None,        # Original model name, or None
            "trace_id": str | None,     # Server trace ID for the replayed execution
            "db_snapshot_ref": dict | None,  # The source trace's snapshot pin, if any
        }
    ],
    "test_run_id": "...",
    "test_run_url": "..."
}

Replay waits for each item’s trace (spans + completion) to be persisted server-side before completing the test run, so trace_id is a real server trace ID for completed items. If NO completed item’s trace persisted (uploads wholesale failed, or the replayed function isn’t decorated with @span), replay() raises a RuntimeError instead of silently returning None trace IDs. If only SOME items’ traces are missing (a transient per-item upload failure), those items get None trace IDs with a logged error and the rest of the run is returned intact. trace_id is also None for errored (unreplayable) items, and for all items when the server predates the trace-ID mapping (a logged warning explains which). Per-item duration_ms and model come from the historical trace that fed the item. tokens is the replayed run’s token usage (the same numbers Studio’s experiments view shows), so comparing each item’s tokens["total"] against the original trace’s recorded usage tells you how your change moved cost. Each field is None when it wasn’t captured.

Replaying handler-instrumented functions

Workflows instrumented through a framework handler (get_langgraph_callback_handler, get_langchain_callback_handler, get_claude_agent_handler, get_openai_agent_handler) have no @span-decorated root in the application code: the handler (or run wrapper) records the framework invocation itself as the root span, with the framework’s own input (a LangGraph initial state, an agent prompt, the run input) as the recorded root input. These traces are fully replayable. Pass the handler’s trace function key explicitly, plus any plain callable that re-invokes the framework entrypoint:

The OpenAI Agents SDK uses get_openai_agent_handler(key).wrap_run(agent, input) (a drop-in for Runner.run) for the replayable root; the bare get_openai_tracing_processor captures internals only and records an empty-input root. The Claude Agent SDK handler needs a hint: the prompt is not present in the message stream, so pass it explicitly (wrap_query(stream, input=prompt), or wrap_response(stream, input=prompt)) for the handler to record a replayable root.

# scripts/replay.py
from my_app.agent import graph              # the compiled LangGraph graph
from my_app.bitfab_client import bitfab     # same client as instrumentation

handler = bitfab.get_langgraph_callback_handler("my-agent")  # same key


def replay_my_agent(state):
    config = {"callbacks": [handler], "configurable": build_replay_config()}
    return graph.invoke(state, config=config)


result = bitfab.replay("my-agent", replay_my_agent, limit=10)

How it fits together:

replay("key", fn) fetches the handler-recorded production traces under the key and wraps fn in a span under that key internally, so each replayed invocation records a trace tied to the test run. No decorator needed; the key is the only link between the production traces and the replay callable.
When the SDK auto-wraps a plain callable this way, a recorded dict root input (e.g. a LangGraph state) is passed to fn as a single positional argument (matching the TypeScript SDK) and reported faithfully on item["input"]. Decorated functions keep the decorated-path keyword-args semantics even when a matching key is also passed.
Attaching the handler inside the callable makes the replayed graph’s node/LLM/tool spans nest under the replay span, so replay traces have the same tree as production ones.
The callable rebuilds the runtime environment the trace never captured: framework config, dependency objects, API keys. Use safe no-op substitutes for side-effectful wiring (billing or credit callbacks, notification senders); replay should never charge or notify anyone.

Older SDKs (before explicit-key replay): decorate a wrapper in the replay script with the same key instead: @bitfab.span("my-agent") on def replay_my_agent(**state) (on that path the recorded dict splats into keyword args and item["input"] reports []), then call bitfab.replay(replay_my_agent, limit=10).

Mocking child spans during replay

For the workflow-level guide, see Replay Mocking. When iterating on a root function, child spans sometimes fail in your local environment for reasons unrelated to the code under test: a paid API key is missing, an external service is flaky, or a production-only DB row isn’t seeded locally. The mock keyword lets the child return its recorded output so the root function can still run. Three strategies on replay():

"marked" (default): only descendants declared with mock_on_replay=True are short-circuited; everything else runs real. This is the iteration-friendly mode.
"none": every child span runs real code.
"all": every descendant span returns its historical output. The root function still runs real, but every child is short-circuited. Useful for a quick sanity-check against recorded data; not the recommended iteration strategy because changes to descendants won’t actually execute.

Per-span opt-in via the mock_on_replay kwarg on @client.span(...):

@bitfab.span("fetch-article-from-db", mock_on_replay=True)
def fetch_article_from_db(article_id: str) -> Article:
    return db.articles.find_by_id(article_id)


@bitfab.span("summarize-article")
def summarize_article(article: Article) -> Summary:
    # Real summarization, no flag -- this is what we're iterating on.
    return Summary(...)


@bitfab.span("process-article")
def process_article(article_id: str) -> Summary:
    return summarize_article(fetch_article_from_db(article_id))


# During replay, fetch-article-from-db returns its recorded output;
# summarize-article runs real so you can iterate on it.
result = bitfab.replay(process_article, limit=10)

mock_on_replay is a per-span tag at definition time — it has no effect outside replay, and it’s read by the default mock="marked" strategy. The root function always runs real code; only descendants can be mocked. When no historical span matches a child call, execution falls through to the real function — never silent omission.

Injecting custom values with overrides

A mock override substitutes a value you supply for a matched span, so downstream real code runs against it — for “what if this step returned X” experiments without editing the traced code. An override is a MockOverride(match, value): match selects spans by structural metadata (node.trace_function_key, node.span_name, node.type, node.original_span_id); value is a flat value injected as-is, or a callable that returns one.

from bitfab import MockOverride

result = client.replay(
    process_article,
    mock="none",  # run everything real...
    mock_override=MockOverride(
        # ...except this span, which gets the value you supply
        match=lambda node: node.trace_function_key == "fetch-article-from-db",
        value={"id": "fixed", "title": "Fixed title"},  # flat value
    ),
)

A callable value receives a context with the live positional inputs, the live keyword kwargs (empty when the call used none), and get_original_output() (synchronous in Python) to tweak the recorded output instead of replacing it:

value=lambda ctx: {**ctx.get_original_output(), "score": 1}

Under marked/override replay the recorded output is fetched lazily on first access, so get_original_output() (and a marked span’s own recorded output) may block on a short HTTP request. Replay offloads that fetch off the event loop for async spans, so concurrent items are not stalled. A synchronous span tagged mock_on_replay (or matched by an override), when called from an async replay root, cannot offload and does the fetch on the loop thread, briefly serializing concurrent items. Make such a span async, or use mock="all" (eager, no per-span fetch), to avoid it.

Register overrides on the client to apply them to every replay (object or ordered form), and reset with clear_mock_overrides():

client.register_mock_override(
    MockOverride(match=lambda node: node.type == "llm", value={"label": "refund"})
)
# Ordered form (equivalent): client.register_mock_override(match, value)
client.clear_mock_overrides()

Precedence per span: per-call mock_override, then registered overrides, then the base mock strategy (a span no override matches falls back to it). Pass a single MockOverride or a list (first matcher wins).

Adapting inputs after a signature change

Replay deserializes each trace’s inputs exactly as they were captured against the function’s signature at trace time, then calls the current function with them. If the signature drifted since capture (a param renamed, reordered, folded into a dict, or a new required arg added), fn(*args, **kwargs) no longer lines up and raises. The adapt_inputs hook reshapes the recorded inputs onto the current signature so replay can still run:

# Recorded as (user_id, limit); current signature is (opts: dict).
def adapt(args, kwargs, ctx):
    user_id, limit = args
    return [{"user_id": user_id, "limit": limit}], {}


result = bitfab.replay(my_function, adapt_inputs=adapt)

The hook receives the deserialized (args, kwargs) plus a per-trace ctx ({"original_trace_id", "original_span_id"}, with deprecated source_* aliases) and returns the (args, kwargs) actually passed to the function. The returned args is what item["input"] reports. It runs once per item, inside the same error boundary as the function: if it raises, that item’s error is set and the run continues, so one unmappable trace never crashes the batch. ctx["original_trace_id"] (the original Bitfab trace ID) lets a table-driven adapter look up a per-trace transform. That’s the escape hatch for reshapes that need judgement rather than mechanical rearrangement: compute the adapted inputs per trace up front, then have the hook look them up by original_trace_id, keeping replay deterministic instead of calling a model mid-replay. When the new signature has a genuinely new required input with no analog in the recorded trace, don’t fabricate one — there’s nothing faithful to map it to. Leave those traces unmapped (let them raise) rather than inventing test inputs. For anything beyond a one-liner, keep the adapter in its own file next to the replay script and import it:

# scripts/replay_adapters/extraction.py
def adapt_inputs(args, kwargs, ctx):
    user_id, limit = args
    return [{"user_id": user_id, "limit": limit}], {}

# scripts/replay.py
from replay_adapters.extraction import adapt_inputs

bitfab.replay(my_function, limit=limit, adapt_inputs=adapt_inputs)

That keeps the transform versioned and reviewable alongside the function it adapts, and you add the import only when a drift actually needs it.

Attaching a Code Change

Each replay creates an experiment (test run). When you’re iterating on a function and replaying after every edit, attach the change so the dashboard can show exactly what was edited alongside the results. Read each file before editing, edit, then read it again — the two strings go straight into code_change_files. There’s no diff format to construct.

with open("src/foo.py") as f:
    before = f.read()

# ...edit src/foo.py...

with open("src/foo.py") as f:
    after = f.read()

result = bitfab.replay(
    my_function,
    code_change_description="fix off-by-one in retry logic",
    code_change_files=[{"path": "src/foo.py", "before": before, "after": after}],
)

Both options are optional and independent — you can pass just code_change_description for a quick rationale-only annotation, or just code_change_files to record the literal edits. If you pass neither, replay() falls back to capturing your working-tree diff against the trunk merge-base (best-effort, only inside a git repo), so an experiment still shows a diff. Passing code_change_files explicitly always wins and is the way to record a precise per-edit before/after. Set BITFAB_DISABLE_CODE_CHANGE_CAPTURE to turn the fallback off. Notes:

In the replay(fn) form the function must be decorated with @span — the trace function key is read from the decorator. When the production code has a decorated root, pass the decorated function itself, not an undecorated wrapper around it; the @span attribute is what identifies the trace key. An undecorated wrapper has no key, so replay() wraps it as the root and the inner decorated function then records its own span underneath, nesting a duplicate. For nested decorators (e.g. @retry(@cache(@span(fn)))), pass the outermost — replay walks the __wrapped__ chain to find @span. For handler-instrumented functions with no decorated root, use the explicit-key form replay("key", fn) with any plain callable (see Replaying handler-instrumented functions above). Passing an explicit key that contradicts the decorator’s key raises.
For decorated methods on classes, pass the unbound function on the class (MyClass.method) to replay traces for all instances, or a bound method on a specific instance (instance.method) to replay through that instance’s state. Both resolve to the same trace function key.
Use a single Bitfab client across instrumentation and replay. If your instrumented module constructs Bitfab() at import and your replay script constructs another, they do not share registered trace functions — import the client from the instrumented module (or a shared singleton) rather than constructing a new one in the replay script.
The function can be sync or async (async functions are detected and run automatically)
If the function raises an error for one input, replay continues with the remaining inputs
Each replay creates a test run visible in the Bitfab dashboard
Works through nested decorators (e.g. @retry, @cache) — walks the __wrapped__ chain to find @span

Replay Output Contract

Replay results are typically consumed by automation (CI logs, code reviewers, and coding agents). When BITFAB_REPLAY_RESULT_PATH is set, bitfab.replay() automatically writes the full ReplayResult JSON to that file. For direct/manual runs, emit the full ReplayResult as a single stdout JSON block so a consumer can json.loads it and reason about every field, including the new per-item duration_ms, tokens, and model. Never print only lengths, counts, hashes, or truncated previews, and never replace the JSON block with ad-hoc per-field log lines. Recommended script tail:

result = bitfab.replay(my_function, limit=limit)

# Human-readable summary goes to stderr, so stdout stays pure JSON.
print(f"Test run: {result['test_run_url']}", file=sys.stderr)
print(f"Items:    {len(result['items'])}", file=sys.stderr)

# Then: full structured dump, ready for json.loads.
print(json.dumps(result, indent=2, default=str))

The dumped object includes every item’s input, result, original_output, error, duration_ms, tokens, model, and trace_id, plus test_run_id and test_run_url. When the Bitfab plugin runs this script, it sets BITFAB_REPLAY_RESULT_PATH; the SDK writes the final result there, and the plugin reads that file into the replay run’s .bitfab/replays/<run-id>/events.jsonl while writing large per-item payloads under .bitfab/replays/<run-id>/items/. Per-item errors are part of the contract. If the wrapped function raises on a given trace, bitfab.replay catches it, sets item['error'], leaves item['result'] as None, and continues. Treat items with item['error'] set as unreplayable, not as failing outputs — compute pass/fail only over items where it’s None. This matters most for DB reads/writes: a stale FK, missing record, or rejected write is infra failure, not a regression. Don’t swallow per-item errors in the script. A custom try/except that returns a placeholder turns infra failures into fake successes. Let the SDK record them. The only allowed top-level except is a fatal handler around main() that exits non-zero, so callers can tell a whole-replay crash from a clean run with some unreplayable items. Environment. Replay executes in the app’s own process — the instrumented function is imported as a library, and its DB clients, env vars, config loaders, and model IDs resolve from whatever environment the replay script is run under. The script must bootstrap the same environment the app uses (e.g. load_dotenv() at the top, or run via dotenv run -- python scripts/replay.py). Do not mock these — they’re the same dependencies the app resolves in production. For replay to see the same DB rows the trace was captured against, point the script at the trace’s source environment (the environment field on the trace — production / staging / development). Input serialization caveat. Replay deserializes historical span inputs and passes them back to your function. This works for strings, numbers, and plain dicts. If your span wraps a function that takes hydrated domain objects (ORM models, class instances, DB records), they won’t round-trip through serialization — move the span to where inputs are IDs or plain data and let the function fetch objects internally, or reshape arguments in the wrapper.

Replay Script

Create a standalone script to regression-test your trace functions against production data with one command. The script maps pipeline names to their replay functions, accepts CLI flags, and prints a side-by-side comparison with delta summaries.

#!/usr/bin/env python3
"""
Replay production traces through instrumented functions.

Uses bitfab.replay() to fetch real traces and re-run them through
the current code, creating a test run for side-by-side comparison.

Usage:
    python scripts/replay.py <pipeline>
    python scripts/replay.py <pipeline> --limit 20
    python scripts/replay.py <pipeline> --trace-ids id1,id2
"""
import argparse
import json
import sys
from dotenv import load_dotenv
from bitfab import report_replay_progress
from lib.bitfab_client import bitfab
from services.extraction import extract_memories
from services.search import search_documents

load_dotenv()

FUNCTIONS = {
    "extraction": "my-extraction-pipeline",
    "search": "my-search-pipeline",
}


# Each pipeline gets its own replay function. If a signature drifts after traces
# were captured, add an input adapter (see the "Adapting inputs" section): write
# scripts/replay_adapters/<name>.py with an `adapt_inputs(args, kwargs, ctx)`
# function, import it, and pass it here, e.g.
#   from replay_adapters.extraction import adapt_inputs
#   return bitfab.replay(fn, limit=limit, trace_ids=trace_ids, adapt_inputs=adapt_inputs)

# limit is ignored when trace_ids is passed -- an explicit ID list already
# determines how many traces replay.

def replay_extraction(limit: int, trace_ids: list[str] | None):
    def fn(conversation: str, existing_items: list):
        return extract_memories(conversation, existing_items)
    if trace_ids:
        return bitfab.replay(fn, trace_ids=trace_ids, on_progress=report_replay_progress)
    return bitfab.replay(fn, limit=limit, on_progress=report_replay_progress)


def replay_search(limit: int, trace_ids: list[str] | None):
    def fn(query: str, opts: dict):
        return search_documents(query, user_id=opts["user_id"], limit=opts.get("limit", 10))
    if trace_ids:
        return bitfab.replay(fn, trace_ids=trace_ids, on_progress=report_replay_progress)
    return bitfab.replay(fn, limit=limit, on_progress=report_replay_progress)


REPLAY_FNS = {
    "extraction": replay_extraction,
    "search": replay_search,
}


def main():
    parser = argparse.ArgumentParser(description="Replay production traces")
    parser.add_argument("pipeline", choices=FUNCTIONS.keys())
    parser.add_argument("--limit", type=int, default=10)
    parser.add_argument("--trace-ids", type=str)
    args = parser.parse_args()

    trace_ids = [tid.strip() for tid in args.trace_ids.split(",")] if args.trace_ids else None
    function_key = FUNCTIONS[args.pipeline]

    # Human-readable output goes to stderr; stdout carries only the ReplayResult JSON.
    print(f"[replay] Replaying {len(trace_ids) if trace_ids else args.limit} traces from \"{function_key}\"...\n", file=sys.stderr)

    result = REPLAY_FNS[args.pipeline](args.limit, trace_ids)
    print(f"Test run: {result['test_run_url']}\n", file=sys.stderr)

    changed = same = errors = 0
    for item in result["items"]:
        raw_input = item.get("input") or []
        label = str(raw_input[0])[:80] if raw_input else "unknown"

        if item["error"]:
            print(f'  ✗ "{label}"', file=sys.stderr)
            print(f"    Error: {item['error']}", file=sys.stderr)
            errors += 1
        else:
            orig = item["original_output"]
            new = item["result"]
            orig_str = orig if isinstance(orig, str) else json.dumps(orig, default=str)
            new_str = new if isinstance(new, str) else json.dumps(new, default=str)
            is_same = orig_str == new_str
            marker = "=" if is_same else "Δ"

            print(f'  {marker} "{label}"', file=sys.stderr)
            print(f"    Original: {orig_str}", file=sys.stderr)
            print(f"    New:      {new_str}", file=sys.stderr)

            if is_same:
                same += 1
            else:
                changed += 1

    print("\n─── Summary ───", file=sys.stderr)
    print(f"  Pipeline: {args.pipeline}", file=sys.stderr)
    print(f"  Replayed: {len(result['items'])}", file=sys.stderr)
    print(f"  Same:     {same}", file=sys.stderr)
    print(f"  Changed:  {changed}", file=sys.stderr)
    if errors > 0:
        print(f"  Errors:   {errors}", file=sys.stderr)
    print(f"\n  {result['test_run_url']}", file=sys.stderr)

    # Full ReplayResult as JSON to stdout (the Replay Output Contract).
    print(json.dumps(result, indent=2, default=str))


if __name__ == "__main__":
    main()

Adapt the imports, pipeline names, and per-pipeline replay functions to match your project’s instrumented workflows.

Advanced Configuration

Bitfab(
    api_key: str,                    # Required
    service_url: str | None = None,  # Default: https://bitfab.ai
    env_vars: dict[str, str] | None = None,  # For local function execution
    enabled: bool = True,            # Enable/disable tracing
    baml_client: Any = None          # Generated BAML client (for wrap_baml)
)

env_vars: Pass LLM provider API keys for local execution (e.g., {"OPENAI_API_KEY": "..."})
enabled: When False, all tracing is disabled. Decorated functions still execute normally but no spans are sent.
baml_client: The generated BAML client instance (e.g., b from baml_client). See BAML framework guide for full usage.

​Installation

​Quick Start

​Basic Configuration

​API key resolution

​Tracing

​Custom (Recommended)

​Using get_function() to Link Spans

​Multi-File Projects

​Using @bitfab.span() Directly

​Automatic Nesting

​Span Options

​Tracing Streaming Functions

​Span Context

​Span Prompt

​Framework Integrations

LangGraph / LangChain

OpenAI Agents SDK

BAML

Claude Agent SDK

​Trace Context

​Dropping a Trace

​Detached Trace

​Read One Persisted Span

​Error Handling

​Flushing Traces

​Replay

​Replaying handler-instrumented functions

​Mocking child spans during replay

​Injecting custom values with overrides

​Adapting inputs after a signature change

​Attaching a Code Change

​Replay Output Contract

​Replay Script

​Advanced Configuration

Installation

Quick Start

Basic Configuration

API key resolution

Tracing

Custom (Recommended)

Using `get_function()` to Link Spans

Multi-File Projects

Using `@bitfab.span()` Directly

Automatic Nesting

Span Options

Tracing Streaming Functions

Span Context

Span Prompt

Framework Integrations

Trace Context

Dropping a Trace

Detached Trace

Read One Persisted Span

Error Handling

Flushing Traces

Replay

Replaying handler-instrumented functions

Mocking child spans during replay

Injecting custom values with overrides

Adapting inputs after a signature change

Attaching a Code Change

Replay Output Contract

Replay Script

Advanced Configuration