GET /v1/realtime — bidirectional WebSocket API

The Realtime API lets you hold low-latency, bidirectional voice and text conversations with a model over a WebSocket connection. Instead of sending discrete HTTP requests, you open a persistent connection and exchange event messages in both directions — sending audio or text input and receiving streamed audio or text responses as they are generated. Anyone’s Realtime endpoint is compatible with the OpenAI Realtime API format and also supports Azure OpenAI Realtime. You need a channel configured with a provider that supports Realtime (OpenAI or Azure OpenAI) before connecting.

Realtime requires a channel configured with a Realtime-capable provider (OpenAI or Azure OpenAI). Contact your Anyone administrator if the WebSocket connection is rejected.

Connecting

Endpoint: GET /v1/realtime Upgrade an HTTP GET request to a WebSocket connection. You can pass your API key as a query parameter or as an Authorization header in the WebSocket handshake.

ws://api.anyone.ai/v1/realtime

For TLS-secured instances:

wss://api.anyone.ai/v1/realtime

Authentication

Pass your API key in one of the following ways:

Query parameter: ?token=YOUR_TOKEN
Authorization header: Authorization: Bearer YOUR_TOKEN (set during the HTTP upgrade handshake)

Event types

Once connected, you exchange JSON event messages. Each message has a type field that identifies its purpose. The following are the core event types.

Client → server

Event type	Description
`session.update`	Configure session parameters such as voice, audio format, tools, and instructions.
`input_audio_buffer.append`	Stream base64-encoded audio bytes to the model’s input buffer.
`conversation.item.create`	Add a text message to the conversation.
`response.create`	Prompt the model to generate a response based on the current conversation and buffer.

Server → client

Event type	Description
`session.created`	Sent immediately after connecting, confirming the session is ready.
`session.updated`	Confirms that a `session.update` was applied.
`response.audio.delta`	A chunk of base64-encoded audio from the model’s response.
`response.audio_transcript.delta`	A chunk of the text transcript of the model’s audio output.
`response.function_call_arguments.delta`	Streamed function call arguments, when the model calls a tool.
`response.function_call_arguments.done`	Signals that function call arguments are complete.
`response.done`	Signals that the model has finished generating a response. Contains usage information.
`conversation.item.created`	Confirms that a conversation item was added.
`error`	An error occurred. Contains an error object with a message and code.

Session configuration

After connecting, send a session.update event to configure the session:

modalities

string[]

The interaction modes to enable. For example, ["text", "audio"].

instructions

string

System-level instructions that guide the model’s behavior for the session.

voice

string

The voice to use for audio output. For example, alloy, echo, nova, or shimmer.

input_audio_format

string

Format of the audio you send. For example, pcm16, g711_ulaw, or g711_alaw.

output_audio_format

string

Format of the audio the model returns. For example, pcm16.

input_audio_transcription

object

Configuration for transcribing your audio input.

Show properties

model

string

The transcription model to use, for example whisper-1.

turn_detection

object | null

Controls how the server detects end-of-turn in audio input. Set to null to disable automatic turn detection and manage it manually.

tools

object[]

A list of tool definitions available to the model during this session, following the OpenAI function-calling schema.

tool_choice

string

Controls when the model uses tools. auto, none, or a specific tool name.

temperature

number

Sampling temperature for the model. Defaults to 0.8.

Usage tracking

When the model finishes a response, the response.done event includes a usage object:

total_tokens

integer

Total tokens consumed by this response turn.

input_tokens

integer

Tokens in the input (audio + text).

output_tokens

integer

Tokens in the model’s output (audio + text).

input_token_details

object

Breakdown of input token types (e.g., cached, audio).

output_token_details

object

Breakdown of output token types (e.g., audio, text).

Example

The following JavaScript example connects to the Realtime endpoint, configures a session, and logs events as they arrive.

javascript

const ws = new WebSocket(
  "wss://api.anyone.ai/v1/realtime?token=YOUR_TOKEN"
);

ws.addEventListener("open", () => {
  // Configure the session
  ws.send(
    JSON.stringify({
      type: "session.update",
      session: {
        modalities: ["text", "audio"],
        instructions: "You are a helpful assistant. Respond concisely.",
        voice: "alloy",
        input_audio_format: "pcm16",
        output_audio_format: "pcm16",
        temperature: 0.8,
      },
    })
  );

  // Send a text message
  ws.send(
    JSON.stringify({
      type: "conversation.item.create",
      item: {
        type: "message",
        role: "user",
        content: [{ type: "input_text", text: "Hello, how are you?" }],
      },
    })
  );

  // Ask the model to respond
  ws.send(JSON.stringify({ type: "response.create" }));
});

ws.addEventListener("message", (event) => {
  const msg = JSON.parse(event.data);
  console.log(msg.type, msg);
});

ws.addEventListener("error", (err) => {
  console.error("WebSocket error", err);
});

Documentation Index

​Connecting

​Authentication

​Event types

​Client → server

​Server → client

​Session configuration

​Usage tracking

​Example

Connecting

Authentication

Event types

Client → server

Server → client

Session configuration

Usage tracking

Example