Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.bitfab.ai/llms.txt

Use this file to discover all available pages before exploring further.

The Bitfab Python SDK captures your AI function calls to automatically generate evaluations. Re-run your prompts with different models, parameters, and inputs to iterate faster.

Installation

# pip
pip install bitfab-py

# Poetry
poetry add bitfab-py

# uv
uv add bitfab-py

Quick Start

import os
from bitfab import Bitfab

bitfab = Bitfab(api_key=os.environ["BITFAB_API_KEY"])
Need an API key? Get one from the Bitfab dashboard or see the API Keys guide for detailed setup instructions.
Copy this prompt into your coding agent (tested with Cursor and Claude Code using Sonnet 4.5):
Modify existing Python code to add Bitfab tracing.
Do NOT browse or web search. Use ONLY the API described below.

Bitfab Python SDK (authoritative excerpt):
- Install: `pip install bitfab-py` or `poetry add bitfab-py` or `uv add bitfab-py`
- Init:
  import os
  from bitfab import Bitfab
  bitfab = Bitfab(api_key=os.environ["BITFAB_API_KEY"])
- Instrumentation (ONLY allowed form - use get_function):
  # Declare trace function key once
  my_service = bitfab.get_function("<trace_function_key>")

  # Decorate methods with span
  @my_service.span()
  def method_name(): ...

  # Or with options:
  @my_service.span(name="DisplayName", type="function")
  def method_name(): ...

  # Span types: "llm", "agent", "function", "guardrail", "handoff", "custom"
- Decorator form ONLY; must be placed immediately ABOVE the `def` it instruments.
- DO NOT use context managers or manual span creation.
- DO NOT extract helper methods.

Task:
1) Ensure bitfab-py is installed and initialization exists.
2) Read the codebase and identify ALL AI workflows (LLM calls, agent runs, AI-driven decisions).
3) Present me with a numbered list of workflows you found. For each, describe:
   - What it does
   - Why it's worth instrumenting — what visibility tracing gives you into each step
4) After I choose which workflow(s) to instrument:
   - Create a function wrapper with `bitfab.get_function("<trace_function_key>")`
   - Add `@my_service.span()` directly ABOVE each method's `def`
   - Instrument intermediate steps (not just the final output) so each trace has enough context to diagnose issues
   - Ensure the bitfab client is initialized and accessible
5) Do not change method signature, behavior, or return value. Minimal diff.

Output:
- First: your numbered list of workflows with why each is worth instrumenting
- After my selection: minimal diffs for dependencies, initialization, and the method changes

Basic Configuration

Bitfab(api_key: str)

# Disable tracing (functions still execute, but no spans are sent)
Bitfab(api_key: str, enabled: bool = True)
Missing API key doesn’t crash. If the API key is missing, empty, or whitespace-only, the SDK automatically disables tracing and logs a warning. All decorated functions still execute normally — no spans are sent, no errors are thrown. You don’t need any conditional logic around the API key.

Tracing

Declare the trace function key once and link multiple spans together:
order_service = bitfab.get_function("order-processing")

@order_service.span()
def process_order(order_id: str) -> dict:
    return {"order_id": order_id}

@order_service.span()
def validate_order(order_id: str) -> dict:
    return {"valid": True}

Multi-File Projects

For projects with instrumented functions spread across multiple files, create a dedicated file that initializes Bitfab and exports the function. Import it wherever you need to instrument.
# lib/bitfab_client.py — single source of truth
import os
from bitfab import Bitfab
bitfab = Bitfab(api_key=os.environ["BITFAB_API_KEY"])
order_service = bitfab.get_function("order-processing")
# services/process_order.py
from lib.bitfab_client import order_service

@order_service.span()
def process_order(order_id: str) -> dict:
    return {"order_id": order_id}
# services/validate_order.py
from lib.bitfab_client import order_service

@order_service.span()
def validate_order(order_id: str) -> dict:
    return {"valid": True}
Spans from different files are automatically linked as parent-child when one decorated function calls another.

Using @bitfab.span() Directly

For a single span without linking to a function group:
@bitfab.span("one-off-operation")
def standalone_task() -> str:
    return "done"

Automatic Nesting

Spans nest automatically based on call stack:
@bitfab.span("outer", type="agent")
def outer():
    inner()  # Becomes a child of "outer"

@bitfab.span("inner", type="function")
def inner():
    pass

Span Options

Parameters:
  • trace_function_key (required): String identifier for grouping spans
  • name (optional): Display name. Defaults to function name, then trace function key
  • type (optional): Span type. Defaults to "custom"
Span Types:
SpanType = Literal[
    "llm",        # LLM calls
    "agent",      # Agent workflows
    "function",   # Function calls
    "guardrail",  # Safety checks
    "handoff",    # Human handoffs
    "custom"      # Default
]
Examples:
# Function name is automatically captured as span name
@bitfab.span("order-processing")
def process_order(order_id: str) -> dict:
    return {"order_id": order_id}
# Span name: "process_order"

# Override with name option
@bitfab.span("order-processing", name="OrderProcessor")
def process_order(order_id: str) -> dict:
    return {"order_id": order_id}
# Span name: "OrderProcessor"

# Set span type
@bitfab.span("safety-check", type="guardrail")
def check_content(content: str) -> dict:
    return {"safe": True}

Span Context

Use get_current_span() to get a handle to the active span, then call .add_context() to attach contextual key-value pairs from inside a traced function — useful for runtime values like request IDs, computed scores, or dynamic context:
from bitfab import get_current_span

@bitfab.span("order-processing", type="function")
def process_order(order_id: str) -> dict:
    user_id = get_current_user()
    get_current_span().add_context({"user_id": user_id, "order_id": order_id})
    return {"order_id": order_id, "status": "completed"}
Each add_context call pushes the entire dictionary as one entry. Multiple calls accumulate entries:
get_current_span().add_context({"user_id": "u-123"})
get_current_span().add_context({"request_id": "req-789"})
# Result: contexts: [{"user_id": "u-123"}, {"request_id": "req-789"}]

Span Prompt

Use get_current_span() to set the prompt string on the current span. This is stored in span_data.prompt and is useful for capturing the exact prompt text sent to an LLM:
from bitfab import get_current_span

@bitfab.span("classification", type="llm")
def classify_text(text: str) -> str:
    prompt = f"Classify the following text: {text}"
    get_current_span().set_prompt(prompt)
    result = llm.complete(prompt)
    return result
The last set_prompt call wins — it overwrites any previously set prompt on the span. Calling set_prompt outside a span context is a no-op (it never crashes).

Supported Frameworks

Bitfab provides automatic tracing for popular AI frameworks. See the dedicated guides for full API references:

LangGraph / LangChain

Callback handler for graph nodes, LLM calls, and tools

OpenAI Agents SDK

Trace processor for agent runs

BAML

Auto-capture prompts and LLM metadata

Claude Agent SDK

Capture LLM turns, tool calls, and subagents

Trace Context

Use get_current_trace() to set context that applies to the entire trace (all spans within a single execution). This is useful for grouping traces by session or attaching trace-level metadata:
from bitfab import get_current_trace

@bitfab.span("order-processing", type="function")
def process_order(order_id: str) -> dict:
    trace = get_current_trace()

    # Set session ID (stored as database column, filterable in dashboard)
    trace.set_session_id("session-123")

    # Set trace metadata (stored in raw trace data)
    trace.set_metadata({"region": "us-west-2", "environment": "production"})

    # Add context entries (stored as key-value pairs, accumulates across calls)
    trace.add_context({"workflow": "checkout-flow", "batch_id": "batch-2024-01"})

    return {"order_id": order_id, "status": "completed"}
  • set_session_id(id) — Groups traces by user session. Stored as a database column for efficient filtering.
  • set_metadata(dict) — Arbitrary key-value metadata on the trace. Merges with existing metadata.
  • add_context(dict) — Key-value context entries. Accumulates across multiple calls.

Error Handling

Errors are captured in the span and re-raised:
@bitfab.span("risky-service")
def risky():
    raise ValueError("error")

try:
    risky()
except ValueError:
    pass
# Span records error and timing

Flushing Traces

from bitfab import flush_traces

flush_traces(timeout=30.0)  # Default: 30s
Traces flush automatically on process exit via atexit hook.

Replay

Replay historical traces through a function and create a test run with comparison data. This is useful for testing changes to your functions against real production inputs.
@bitfab.span("my-function-key")
def my_function(text: str) -> dict:
    return {"processed": text.upper()}

result = bitfab.replay(my_function, limit=5)

# Or replay specific traces by ID
result = bitfab.replay(my_function, trace_ids=["trace-abc", "trace-def"])

print(f"Test Run: {result['test_run_url']}")
for item in result["items"]:
    print(f"  Input: {item['input']}")
    print(f"  Result: {item['result']}")
    print(f"  Original: {item['original_output']}")
    print(f"  Duration (ms): {item['duration_ms']}")
    print(f"  Tokens: {item['tokens']}")       # {"input", "output", "cached", "total"} or None
    print(f"  Model: {item['model']}")
Parameters:
  • fn (required): The function to replay (must be decorated with @span)
  • limit (optional): Maximum number of traces to replay. Default: 5
  • trace_ids (optional): List of trace IDs to filter which traces are replayed
  • max_concurrency (optional): Maximum items processed in parallel. 1 for sequential, None for unlimited. Default: 10
  • code_change_description (optional): Rationale for the code change being tested in this replay (stored on the experiment)
  • code_change_files (optional): List of edited files, each as {"path": str, "before": str, "after": str} (use "" for newly created or deleted files)
Returns:
{
    "items": [
        {
            "input": [...],             # The inputs passed to fn
            "result": ...,              # What fn returned
            "original_output": ...,     # What the original trace produced
            "error": None | str,        # Error message if fn raised
            "duration_ms": int | None,  # Original trace duration in ms
            "tokens": {                 # Original trace token usage, or None
                "input": int | None,
                "output": int | None,
                "cached": int | None,
                "total": int | None,
            } | None,
            "model": str | None,        # Original model name, or None
        }
    ],
    "test_run_id": "...",
    "test_run_url": "..."
}
Per-item duration_ms, tokens, and model come from the historical trace that fed the item. Use them to reason about the cost and latency of the old runs. Each field is None when the underlying trace didn’t capture it.

Mocking child spans during replay

When iterating on a root function, child spans sometimes fail in your local environment for reasons unrelated to the code under test: a paid API key is missing, an external service is flaky, or a production-only DB row isn’t seeded locally. The mock keyword lets the child return its recorded output so the root function can still run. Three strategies on replay():
  • "none" (default): every child span runs real code.
  • "all": every descendant span returns its historical output. The root function still runs real, but every child is short-circuited. Useful for a quick sanity-check against recorded data; not the recommended iteration strategy because changes to descendants won’t actually execute.
  • "marked": only descendants declared with mock_on_replay=True are short-circuited; everything else runs real. This is the iteration-friendly mode.
Per-span opt-in via the mock_on_replay kwarg on @client.span(...):
@bitfab.span("fetch-article-from-db", mock_on_replay=True)
def fetch_article_from_db(article_id: str) -> Article:
    return db.articles.find_by_id(article_id)


@bitfab.span("summarize-article")
def summarize_article(article: Article) -> Summary:
    # Real summarization, no flag — this is what we're iterating on.
    return Summary(...)


@bitfab.span("process-article")
def process_article(article_id: str) -> Summary:
    return summarize_article(fetch_article_from_db(article_id))


# During replay, fetch-article-from-db returns its recorded output;
# summarize-article runs real so you can iterate on it.
result = bitfab.replay(process_article, limit=10, mock="marked")
mock_on_replay is a per-span tag at definition time — it has no effect outside replay, and it’s only read under mock="marked". The root function always runs real code; only descendants can be mocked. When no historical span matches a child call, execution falls through to the real function — never silent omission.

Attaching a Code Change

Each replay creates an experiment (test run). When you’re iterating on a function and replaying after every edit, attach the change so the dashboard can show exactly what was edited alongside the results. Read each file before editing, edit, then read it again — the two strings go straight into code_change_files. There’s no diff format to construct.
with open("src/foo.py") as f:
    before = f.read()

# ...edit src/foo.py...

with open("src/foo.py") as f:
    after = f.read()

result = bitfab.replay(
    my_function,
    code_change_description="fix off-by-one in retry logic",
    code_change_files=[{"path": "src/foo.py", "before": before, "after": after}],
)
Both options are optional and independent — you can pass just code_change_description for a quick rationale-only annotation, or just code_change_files to record the literal edits. Notes:
  • The function must be decorated with @span — the trace function key is read from the decorator. Pass the decorated function itself, not an undecorated wrapper around it; the @span attribute is what identifies the trace key. For nested decorators (e.g. @retry(@cache(@span(fn)))), pass the outermost — replay walks the __wrapped__ chain to find @span.
  • For decorated methods on classes, pass the unbound function on the class (MyClass.method) to replay traces for all instances, or a bound method on a specific instance (instance.method) to replay through that instance’s state. Both resolve to the same trace function key.
  • Use a single Bitfab client across instrumentation and replay. If your instrumented module constructs Bitfab() at import and your replay script constructs another, they do not share registered trace functions — import the client from the instrumented module (or a shared singleton) rather than constructing a new one in the replay script.
  • The function can be sync or async (async functions are detected and run automatically)
  • If the function raises an error for one input, replay continues with the remaining inputs
  • Each replay creates a test run visible in the Bitfab dashboard
  • Works through nested decorators (e.g. @retry, @cache) — walks the __wrapped__ chain to find @span

Replay Output Contract

Replay results are typically consumed by automation (CI logs, code reviewers, and coding agents reading stdout). Emit the full ReplayResult as a single JSON block so a consumer can json.loads it and reason about every field, including the new per-item duration_ms, tokens, and model. Never print only lengths, counts, hashes, or truncated previews, and never replace the JSON block with ad-hoc per-field log lines. Recommended script tail:
result = bitfab.replay(my_function, limit=limit)

# Optional: human-readable summary first.
print(f"Test run: {result['test_run_url']}")
print(f"Items:    {len(result['items'])}")

# Then: full structured dump, ready for json.loads.
print(json.dumps(result, indent=2, default=str))
The dumped object includes every item’s input, result, original_output, error, duration_ms, tokens, and model, plus test_run_id and test_run_url. Writing the same JSON to scripts/replay-result.json in parallel is optional but useful for later analysis. Per-item errors are part of the contract. If the wrapped function raises on a given trace, bitfab.replay catches it, sets item['error'], leaves item['result'] as None, and continues. Treat items with item['error'] set as unreplayable, not as failing outputs — compute pass/fail only over items where it’s None. This matters most for DB reads/writes: a stale FK, missing record, or rejected write is infra failure, not a regression. Don’t swallow per-item errors in the script. A custom try/except that returns a placeholder turns infra failures into fake successes. Let the SDK record them. The only allowed top-level except is a fatal handler around main() that exits non-zero, so callers can tell a whole-replay crash from a clean run with some unreplayable items. Environment. Replay executes in the app’s own process — the instrumented function is imported as a library, and its DB clients, env vars, config loaders, and model IDs resolve from whatever environment the replay script is run under. The script must bootstrap the same environment the app uses (e.g. load_dotenv() at the top, or run via dotenv run -- python scripts/replay.py). Do not mock these — they’re the same dependencies the app resolves in production. For replay to see the same DB rows the trace was captured against, point the script at the trace’s source environment (the environment field on the trace — production / staging / development). Input serialization caveat. Replay deserializes historical span inputs and passes them back to your function. This works for strings, numbers, and plain dicts. If your span wraps a function that takes hydrated domain objects (ORM models, class instances, DB records), they won’t round-trip through serialization — move the span to where inputs are IDs or plain data and let the function fetch objects internally, or reshape arguments in the wrapper.

Replay Script

Create a standalone script to regression-test your trace functions against production data with one command. The script maps pipeline names to their replay functions, accepts CLI flags, and prints a side-by-side comparison with delta summaries.
#!/usr/bin/env python3
"""
Replay production traces through instrumented functions.

Uses bitfab.replay() to fetch real traces and re-run them through
the current code, creating a test run for side-by-side comparison.

Usage:
    python scripts/replay.py <pipeline>
    python scripts/replay.py <pipeline> --limit 20
    python scripts/replay.py <pipeline> --trace-ids id1,id2
"""
import argparse
import json
from dotenv import load_dotenv
from lib.bitfab_client import bitfab
from services.extraction import extract_memories
from services.search import search_documents

load_dotenv()

FUNCTIONS = {
    "extraction": "my-extraction-pipeline",
    "search": "my-search-pipeline",
}


# Each pipeline gets its own replay function — replay deserializes
# historical inputs, so if the function signature doesn't match the
# raw input shape, reshape the arguments in a thin wrapper here.

def replay_extraction(limit: int, trace_ids: list[str] | None):
    def fn(conversation: str, existing_items: list):
        return extract_memories(conversation, existing_items)
    return bitfab.replay(fn, limit=limit, trace_ids=trace_ids)


def replay_search(limit: int, trace_ids: list[str] | None):
    def fn(query: str, opts: dict):
        return search_documents(query, user_id=opts["user_id"], limit=opts.get("limit", 10))
    return bitfab.replay(fn, limit=limit, trace_ids=trace_ids)


REPLAY_FNS = {
    "extraction": replay_extraction,
    "search": replay_search,
}


def main():
    parser = argparse.ArgumentParser(description="Replay production traces")
    parser.add_argument("pipeline", choices=FUNCTIONS.keys())
    parser.add_argument("--limit", type=int, default=10)
    parser.add_argument("--trace-ids", type=str)
    args = parser.parse_args()

    trace_ids = [tid.strip() for tid in args.trace_ids.split(",")] if args.trace_ids else None
    function_key = FUNCTIONS[args.pipeline]

    print(f"[replay] Replaying {len(trace_ids) if trace_ids else args.limit} traces from \"{function_key}\"...\n")

    result = REPLAY_FNS[args.pipeline](args.limit, trace_ids)
    print(f"Test run: {result['test_run_url']}\n")

    changed = same = errors = 0
    for item in result["items"]:
        raw_input = item.get("input") or []
        label = str(raw_input[0])[:80] if raw_input else "unknown"

        if item["error"]:
            print(f'  ✗ "{label}"')
            print(f"    Error: {item['error']}")
            errors += 1
        else:
            orig = item["original_output"]
            new = item["result"]
            orig_str = orig if isinstance(orig, str) else json.dumps(orig, default=str)
            new_str = new if isinstance(new, str) else json.dumps(new, default=str)
            is_same = orig_str == new_str
            marker = "=" if is_same else "Δ"

            print(f'  {marker} "{label}"')
            print(f"    Original: {orig_str}")
            print(f"    New:      {new_str}")

            if is_same:
                same += 1
            else:
                changed += 1

    print(f"\n─── Summary ───")
    print(f"  Pipeline: {args.pipeline}")
    print(f"  Replayed: {len(result['items'])}")
    print(f"  Same:     {same}")
    print(f"  Changed:  {changed}")
    if errors > 0:
        print(f"  Errors:   {errors}")
    print(f"\n  {result['test_run_url']}")


if __name__ == "__main__":
    main()
Adapt the imports, pipeline names, and per-pipeline replay functions to match your project’s instrumented workflows.

Advanced Configuration

Bitfab(
    api_key: str,                    # Required
    service_url: str | None = None,  # Default: https://bitfab.ai
    env_vars: dict[str, str] | None = None,  # For local function execution
    enabled: bool = True,            # Enable/disable tracing
    baml_client: Any = None          # Generated BAML client (for wrap_baml)
)
  • env_vars: Pass LLM provider API keys for local execution (e.g., {"OPENAI_API_KEY": "..."})
  • enabled: When False, all tracing is disabled. Decorated functions still execute normally but no spans are sent.
  • baml_client: The generated BAML client instance (e.g., b from baml_client). See BAML framework guide for full usage.