Arize with Portkey: Multi-LLM Debate with Traces & Evals

Learn how to use Portkey’s Universal API to orchestrate multiple LLMs in a structured debate while tracking performance and evaluating outputs with Arize.

Google Collab Link

Overview

This guide demonstrates how to:

Use Portkey’s Universal API to seamlessly switch between different LLMs (GPT-4, Claude, Gemini)
Implement distributed tracing with Arize and OpenTelemetry
Build a multi-agent debate system where LLMs take different roles
Export traces and run toxicity evaluations on outputs

Prerequisites

Before starting, you’ll need:

Portkey API key
Arize API key and Space ID
Providers for OpenAI, Anthropic, and Google Gemini added in Model Catalog

Installation

Install the required packages:

pip install portkey-ai openinference-instrumentation-portkey arize-otel arize-phoenix "arize[Tracing]>=7.1.0"

Setting Up Tracing

First, configure Arize tracing with Portkey’s instrumentor to capture all LLM calls:

Python

from arize.otel import register
from openinference.instrumentation.portkey import PortkeyInstrumentor

# Setup OpenTelemetry with Arize
tracer_provider = register(
    space_id=os.getenv("ARIZE_SPACE_ID"),
    api_key=os.getenv("ARIZE_API_KEY"),
    project_name="portkey-debate",
)

# Enable Portkey instrumentation
PortkeyInstrumentor().instrument(tracer_provider=tracer_provider)

Implementing the Multi-LLM Debate

Here’s how to set up different LLMs for different roles using Portkey’s Universal API:

Python

from portkey_ai import Portkey
import os

# Initialize Portkey client
PORTKEY_API_KEY = os.getenv("PORTKEY_API_KEY")
portkey = Portkey(api_key=PORTKEY_API_KEY)

# Use different providers by specifying model with @provider-slug/model format
# GPT-4 for "against" arguments: @openai-prod/gpt-4
# Claude for "pro" arguments: @anthropic-prod/claude-3-opus-20240229
# Gemini for moderation: @google-prod/gemini-1.5-pro

Debate Round Function

Create a function that orchestrates a single debate round:

Python

def debate_round(topic: str, debate_prompt: str) -> dict:
    """
    Runs one debate round:
    1. Claude makes the PRO argument
    2. GPT-4 makes the CON argument
    3. Gemini scores both and suggests a refined prompt
    """

    # PRO side (Claude)
    pro_resp = portkey.chat.completions.create(
        model="@anthropic-prod/claude-3-opus-20240229",
        messages=[{
            "role": "user",
            "content": f"Argue in favor of: {topic}\n\nContext: {debate_prompt}"
        }],
        max_tokens=250
    )
    pro_text = pro_resp.choices[0].message.content

    # CON side (GPT-4)
    con_resp = portkey.chat.completions.create(
        model="@openai-prod/gpt-4",
        messages=[{
            "role": "user",
            "content": f"Argue against: {topic}\n\nContext: {debate_prompt}"
        }]
    )
    con_text = con_resp.choices[0].message.content

    # Moderator (Gemini)
    mod_resp = portkey.chat.completions.create(
        model="@google-prod/gemini-1.5-pro",
        messages = [{
            "role": "user",
            "content": f"""You are a debate moderator. Evaluate these arguments on "{topic}":

            PRO: {pro_text}
            CON: {con_text}

            Suggest an improved debate prompt for more balanced arguments."""
        }]
    )
    new_prompt = mod_resp.choices[0].message.content.strip()

    return {"pro": pro_text, "con": con_text, "new_prompt": new_prompt}

Running Multiple Rounds

Execute the debate across multiple rounds with progressively refined prompts:

Python

topic = "Implementing a nationwide four-day workweek"
initial_prompt = "Debate the pros and cons of a four-day workweek."
rounds = 3

prompt = initial_prompt
for i in range(1, rounds + 1):
    result = debate_round(topic, prompt)
    print(f"\n── Round {i} ──")
    print("🔵 PRO:", result["pro"])
    print("\n🔴 CON:", result["con"])
    print("\n🛠️ Suggested New Prompt:", result["new_prompt"])
    prompt = result["new_prompt"]

Adding Evaluations

After running the debate, evaluate outputs for toxicity using Arize evals:

Export Traces to Dataset

Python

from arize.exporter import ArizeExportClient
from arize.utils.types import Environments
from datetime import datetime

client = ArizeExportClient()

# Export traces from Arize
primary_df = client.export_model_to_df(
    space_id='YOUR_SPACE_ID',
    model_id='portkey-debate',
    environment=Environments.TRACING,
    start_time=datetime.fromisoformat('2025-06-19T07:00:00.000+00:00'),
    end_time=datetime.fromisoformat('2025-06-28T06:59:59.999+00:00')
)

primary_df["input"] = primary_df["attributes.input.value"]
primary_df["output"] = primary_df["attributes.output.value"]

Run Toxicity Evaluation

Python

from phoenix.evals import (
    TOXICITY_PROMPT_RAILS_MAP,
    TOXICITY_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

# Configure evaluation model
model = OpenAIModel(model_name="@openai-prod/gpt-4", temperature=0.0)

# Run toxicity classification
rails = list(TOXICITY_PROMPT_RAILS_MAP.values())
toxic_classifications = llm_classify(
    dataframe=primary_df,
    template=TOXICITY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True,
)

Send Results Back to Arize

Python

from arize.pandas.logger import Client

arize_client = Client(
    space_id=os.getenv("ARIZE_SPACE_ID"),
    api_key=os.getenv("ARIZE_API_KEY")
)

# Log evaluation results
arize_client.log_evaluations_sync(toxic_classifications, 'portkey-debate')

Benefits of This Approach

Unified API: Use the same interface for all LLMs, making it easy to switch providers
Automatic Tracing: All LLM calls are automatically traced without modifying your code
Multi-Agent Orchestration: Different LLMs can play different roles based on their strengths
Comprehensive Observability: Monitor latency, costs, and outputs across all providers
Quality Assurance: Automated evaluations ensure outputs meet safety standards

Next Steps

Try different LLM combinations for various roles
Add more evaluation criteria beyond toxicity
Implement fallback strategies using Portkey’s gateway features
Set up alerts in Arize for performance degradation

Evals

Prompt Engineering

Whitepapers

Getting Started

Integrations

Use Cases

Arize with Portkey: Multi-LLM Debate with Traces & Evals

Google Collab Link

Overview

Prerequisites

Installation

Setting Up Tracing

Implementing the Multi-LLM Debate

Debate Round Function

Running Multiple Rounds

Adding Evaluations

Export Traces to Dataset

Run Toxicity Evaluation

Send Results Back to Arize

Benefits of This Approach

Next Steps

Evals

Prompt Engineering

Whitepapers

Getting Started

Integrations

Use Cases

Documentation Index

Google Collab Link

​Overview

​Prerequisites

​Installation

​Setting Up Tracing

​Implementing the Multi-LLM Debate

​Debate Round Function

​Running Multiple Rounds

​Adding Evaluations

​Export Traces to Dataset

​Run Toxicity Evaluation

​Send Results Back to Arize

​Benefits of This Approach

​Next Steps

​Related Resources

Overview

Prerequisites

Installation

Setting Up Tracing

Implementing the Multi-LLM Debate

Debate Round Function

Running Multiple Rounds

Adding Evaluations

Export Traces to Dataset

Run Toxicity Evaluation

Send Results Back to Arize

Benefits of This Approach

Next Steps

Related Resources