Documentation Index
Fetch the complete documentation index at: https://portkey-docs-feat-support-overview-page.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Learn how to use Portkeyโs Universal API to orchestrate multiple LLMs in a structured debate while tracking performance and evaluating outputs with Arize.
Overview
This guide demonstrates how to:
- Use Portkeyโs Universal API to seamlessly switch between different LLMs (GPT-4, Claude, Gemini)
- Implement distributed tracing with Arize and OpenTelemetry
- Build a multi-agent debate system where LLMs take different roles
- Export traces and run toxicity evaluations on outputs
Prerequisites
Before starting, youโll need:
Installation
Install the required packages:
pip install portkey-ai openinference-instrumentation-portkey arize-otel arize-phoenix "arize[Tracing]>=7.1.0"
Setting Up Tracing
First, configure Arize tracing with Portkeyโs instrumentor to capture all LLM calls:
from arize.otel import register
from openinference.instrumentation.portkey import PortkeyInstrumentor
# Setup OpenTelemetry with Arize
tracer_provider = register(
space_id=os.getenv("ARIZE_SPACE_ID"),
api_key=os.getenv("ARIZE_API_KEY"),
project_name="portkey-debate",
)
# Enable Portkey instrumentation
PortkeyInstrumentor().instrument(tracer_provider=tracer_provider)
Implementing the Multi-LLM Debate
Hereโs how to set up different LLMs for different roles using Portkeyโs Universal API:
from portkey_ai import Portkey
import os
# Initialize Portkey client
PORTKEY_API_KEY = os.getenv("PORTKEY_API_KEY")
portkey = Portkey(api_key=PORTKEY_API_KEY)
# Use different providers by specifying model with @provider-slug/model format
# GPT-4 for "against" arguments: @openai-prod/gpt-4
# Claude for "pro" arguments: @anthropic-prod/claude-3-opus-20240229
# Gemini for moderation: @google-prod/gemini-1.5-pro
Debate Round Function
Create a function that orchestrates a single debate round:
def debate_round(topic: str, debate_prompt: str) -> dict:
"""
Runs one debate round:
1. Claude makes the PRO argument
2. GPT-4 makes the CON argument
3. Gemini scores both and suggests a refined prompt
"""
# PRO side (Claude)
pro_resp = portkey.chat.completions.create(
model="@anthropic-prod/claude-3-opus-20240229",
messages=[{
"role": "user",
"content": f"Argue in favor of: {topic}\n\nContext: {debate_prompt}"
}],
max_tokens=250
)
pro_text = pro_resp.choices[0].message.content
# CON side (GPT-4)
con_resp = portkey.chat.completions.create(
model="@openai-prod/gpt-4",
messages=[{
"role": "user",
"content": f"Argue against: {topic}\n\nContext: {debate_prompt}"
}]
)
con_text = con_resp.choices[0].message.content
# Moderator (Gemini)
mod_resp = portkey.chat.completions.create(
model="@google-prod/gemini-1.5-pro",
messages = [{
"role": "user",
"content": f"""You are a debate moderator. Evaluate these arguments on "{topic}":
PRO: {pro_text}
CON: {con_text}
Suggest an improved debate prompt for more balanced arguments."""
}]
)
new_prompt = mod_resp.choices[0].message.content.strip()
return {"pro": pro_text, "con": con_text, "new_prompt": new_prompt}
Running Multiple Rounds
Execute the debate across multiple rounds with progressively refined prompts:
topic = "Implementing a nationwide four-day workweek"
initial_prompt = "Debate the pros and cons of a four-day workweek."
rounds = 3
prompt = initial_prompt
for i in range(1, rounds + 1):
result = debate_round(topic, prompt)
print(f"\nโโ Round {i} โโ")
print("๐ต PRO:", result["pro"])
print("\n๐ด CON:", result["con"])
print("\n๐ ๏ธ Suggested New Prompt:", result["new_prompt"])
prompt = result["new_prompt"]
Adding Evaluations
After running the debate, evaluate outputs for toxicity using Arize evals:
Export Traces to Dataset
from arize.exporter import ArizeExportClient
from arize.utils.types import Environments
from datetime import datetime
client = ArizeExportClient()
# Export traces from Arize
primary_df = client.export_model_to_df(
space_id='YOUR_SPACE_ID',
model_id='portkey-debate',
environment=Environments.TRACING,
start_time=datetime.fromisoformat('2025-06-19T07:00:00.000+00:00'),
end_time=datetime.fromisoformat('2025-06-28T06:59:59.999+00:00')
)
primary_df["input"] = primary_df["attributes.input.value"]
primary_df["output"] = primary_df["attributes.output.value"]
Run Toxicity Evaluation
from phoenix.evals import (
TOXICITY_PROMPT_RAILS_MAP,
TOXICITY_PROMPT_TEMPLATE,
OpenAIModel,
llm_classify,
)
# Configure evaluation model
model = OpenAIModel(model_name="@openai-prod/gpt-4", temperature=0.0)
# Run toxicity classification
rails = list(TOXICITY_PROMPT_RAILS_MAP.values())
toxic_classifications = llm_classify(
dataframe=primary_df,
template=TOXICITY_PROMPT_TEMPLATE,
model=model,
rails=rails,
provide_explanation=True,
)
Send Results Back to Arize
from arize.pandas.logger import Client
arize_client = Client(
space_id=os.getenv("ARIZE_SPACE_ID"),
api_key=os.getenv("ARIZE_API_KEY")
)
# Log evaluation results
arize_client.log_evaluations_sync(toxic_classifications, 'portkey-debate')
Benefits of This Approach
- Unified API: Use the same interface for all LLMs, making it easy to switch providers
- Automatic Tracing: All LLM calls are automatically traced without modifying your code
- Multi-Agent Orchestration: Different LLMs can play different roles based on their strengths
- Comprehensive Observability: Monitor latency, costs, and outputs across all providers
- Quality Assurance: Automated evaluations ensure outputs meet safety standards
Next Steps
- Try different LLM combinations for various roles
- Add more evaluation criteria beyond toxicity
- Implement fallback strategies using Portkeyโs gateway features
- Set up alerts in Arize for performance degradation