Create Chat Completion

curl --request POST \ --url https://api.example.com/v1/chat/completions/ \ --header 'Authorization: Bearer <token>' \ --header 'Content-Type: application/json' \ --data ' { "messages": [ { "content": "<string>", "role": "<string>", "name": "<string>" } ], "model": "<string>", "frequency_penalty": 0, "logit_bias": {}, "logprobs": false, "top_logprobs": 0, "max_tokens": 123, "max_completion_tokens": 123, "n": 1, "presence_penalty": 0, "response_format": { "type": "text", "json_schema": { "name": "<string>", "description": "<string>", "schema": {}, "strict": true } }, "seed": 0, "stop": [], "stream": false, "stream_options": { "include_usage": true, "continuous_usage_stats": false }, "temperature": 123, "top_p": 123, "tools": [ { "function": { "name": "<string>", "description": "<string>", "parameters": {} }, "type": "function" } ], "tool_choice": "none", "reasoning_effort": "low", "include_reasoning": true, "parallel_tool_calls": true, "user": "<string>", "use_beam_search": false, "top_k": 123, "min_p": 123, "repetition_penalty": 123, "length_penalty": 1, "stop_token_ids": [], "include_stop_str_in_output": false, "ignore_eos": false, "min_tokens": 0, "skip_special_tokens": true, "spaces_between_special_tokens": true, "truncate_prompt_tokens": 4611686018427388000, "prompt_logprobs": 123, "allowed_token_ids": [ 123 ], "bad_words": [ "<string>" ], "echo": false, "add_generation_prompt": true, "continue_final_message": false, "add_special_tokens": false, "documents": [ {} ], "chat_template": "<string>", "chat_template_kwargs": {}, "mm_processor_kwargs": {}, "structured_outputs": { "json": "<string>", "regex": "<string>", "choice": [ "<string>" ], "grammar": "<string>", "json_object": true, "disable_fallback": false, "disable_any_whitespace": false, "disable_additional_properties": false, "whitespace_pattern": "<string>", "structural_tag": "<string>", "_backend": "<string>", "_backend_was_auto": false }, "priority": 0, "request_id": "<string>", "return_tokens_as_token_ids": true, "return_token_ids": true, "cache_salt": "<string>", "kv_transfer_params": {}, "vllm_xargs": {}, "repetition_detection": { "max_pattern_size": 0, "min_pattern_size": 0, "min_count": 0 } } '

{ "model": "<string>", "choices": [ { "index": 123, "message": { "role": "<string>", "content": "<string>", "refusal": "<string>", "annotations": { "type": "<string>", "url_citation": { "end_index": 123, "start_index": 123, "title": "<string>", "url": "<string>" } }, "audio": { "id": "<string>", "data": "<string>", "expires_at": 123, "transcript": "<string>" }, "function_call": { "name": "<string>", "arguments": "<string>" }, "tool_calls": [ { "function": { "name": "<string>", "arguments": "<string>" }, "id": "<string>", "type": "function" } ], "reasoning": "<string>" }, "logprobs": { "content": [ { "token": "<string>", "logprob": -9999, "bytes": [ 123 ], "top_logprobs": [ { "token": "<string>", "logprob": -9999, "bytes": [ 123 ] } ] } ] }, "finish_reason": "stop", "stop_reason": 123, "token_ids": [ 123 ] } ], "usage": { "prompt_tokens": 0, "total_tokens": 0, "completion_tokens": 0, "prompt_tokens_details": { "cached_tokens": 123 } }, "id": "<string>", "object": "chat.completion", "created": 123, "service_tier": "auto", "system_fingerprint": "<string>", "prompt_logprobs": [ {} ], "prompt_token_ids": [ 123 ], "kv_transfer_params": {} }

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

messages

required

Developer-provided instructions that the model should follow, regardless of messages sent by the user. With o1 models and newer, developer messages replace the previous system messages.

Show child attributes

model

string | null

frequency_penalty

number | null

default:0

logit_bias

Logit Bias · object

Show child attributes

logprobs

boolean | null

default:false

top_logprobs

integer | null

default:0

max_tokens

integer | null

deprecated

max_completion_tokens

integer | null

default:1

presence_penalty

number | null

default:0

response_format

ResponseFormat · object

ResponseFormat
StructuralTagResponseFormat
LegacyStructuralTagResponseFormat

Show child attributes

seed

integer | null

Required range: -9223372036854776000 <= x <= 9223372036854776000

stop

default:[]

stream

boolean | null

default:false

stream_options

StreamOptions · object

Show child attributes

temperature

number | null

top_p

number | null

tools

ChatCompletionToolsParam · object[] | null

Show child attributes

tool_choice

default:none

Allowed value: "none"

reasoning_effort

enum<string> | null

Available options:

low,

medium,

high

include_reasoning

boolean

default:true

parallel_tool_calls

boolean | null

default:true

user

string | null

use_beam_search

boolean

default:false

top_k

integer | null

min_p

number | null

repetition_penalty

number | null

length_penalty

number

default:1

stop_token_ids

integer[] | null

include_stop_str_in_output

boolean

default:false

ignore_eos

boolean

default:false

min_tokens

integer

default:0

skip_special_tokens

boolean

default:true

spaces_between_special_tokens

boolean

default:true

truncate_prompt_tokens

integer | null

Required range: -1 <= x <= 9223372036854776000

prompt_logprobs

integer | null

allowed_token_ids

integer[] | null

bad_words

string[]

echo

boolean

default:false

If true, the new message will be prepended with the last message if they belong to the same role.

add_generation_prompt

boolean

default:true

If true, the generation prompt will be added to the chat template. This is a parameter used by chat template in tokenizer config of the model.

continue_final_message

boolean

default:false

If this is set, the chat will be formatted so that the final message in the chat is open-ended, without any EOS tokens. The model will continue this message rather than starting a new one. This allows you to "prefill" part of the model's response for it. Cannot be used at the same time as add_generation_prompt.

add_special_tokens

boolean

default:false

If true, special tokens (e.g. BOS) will be added to the prompt on top of what is added by the chat template. For most models, the chat template takes care of adding the special tokens so this should be set to false (as is the default).

documents

Documents · object[] | null

A list of dicts representing documents that will be accessible to the model if it is performing RAG (retrieval-augmented generation). If the template does not support RAG, this argument will have no effect. We recommend that each document should be a dict containing "title" and "text" keys.

Show child attributes

chat_template

string | null

A Jinja template to use for this conversion. As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.

chat_template_kwargs

Chat Template Kwargs · object

Additional keyword args to pass to the template renderer. Will be accessible by the chat template.

mm_processor_kwargs

Mm Processor Kwargs · object

Additional kwargs to pass to the HF processor.

structured_outputs

StructuredOutputsParams · object

Additional kwargs for structured outputs

Show child attributes

priority

integer

default:0

The priority of the request (lower means earlier handling; default: 0). Any priority other than 0 will raise an error if the served model does not use priority scheduling.

request_id

string

The request_id related to this request. If the caller does not set it, a random_uuid will be generated. This id is used through out the inference process and return in response.

return_tokens_as_token_ids

boolean | null

If specified with 'logprobs', tokens are represented as strings of the form 'token_id:{token_id}' so that tokens that are not JSON-encodable can be identified.

return_token_ids

boolean | null

If specified, the result will include token IDs alongside the generated text. In streaming mode, prompt_token_ids is included only in the first chunk, and token_ids contains the delta tokens for each chunk. This is useful for debugging or when you need to map generated text back to input tokens.

cache_salt

string | null

If specified, the prefix cache will be salted with the provided string to prevent an attacker to guess prompts in multi-user environments. The salt should be random, protected from access by 3rd parties, and long enough to be unpredictable (e.g., 43 characters base64-encoded, corresponding to 256 bit).

kv_transfer_params

Kv Transfer Params · object

KVTransfer parameters used for disaggregated serving.

vllm_xargs

Vllm Xargs · object

Additional request parameters with (list of) string or numeric values, used by custom extensions.

Show child attributes

repetition_detection

RepetitionDetectionParams · object

Parameters for detecting repetitive N-gram patterns in output tokens. If such repetition is detected, generation will be ended early. LLMs can sometimes generate repetitive, unhelpful token patterns, stopping only when they hit the maximum output length (e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature can detect such behavior and terminate early, saving time and tokens.

Show child attributes

Response

Successful Response

model

string

required

choices

ChatCompletionResponseChoice · object[]

required

Show child attributes

usage

UsageInfo · object

required

Show child attributes

string

object

string

default:chat.completion

Allowed value: "chat.completion"

created

integer

service_tier

enum<string> | null

Available options:

auto,

default,

flex,

scale,

priority

system_fingerprint

string | null

prompt_logprobs

(object | null)[] | null

Show child attributes

prompt_token_ids

integer[] | null

kv_transfer_params

Kv Transfer Params · object

KVTransfer parameters.

Serverless RL

Serverless SFT

API Reference

Authorizations

Body

Response