Skip to main content
Voice models enable speech-to-text (transcription) and text-to-speech (speech output) in VARIOS AI. Each voice model is assigned a model type that determines the available configuration options and cost fields.

Model Types

Model TypeDescription
TranscriptionConverts spoken language into text (speech-to-text). Example: gpt-4o-mini-transcribe.
SpeechConverts text into spoken language (text-to-speech). Example: gpt-4o-mini-tts.

Basic Data (Both Types)

FieldRequiredDescription
Image / TitleYesDisplay name and optional profile image of the model.
Model NameYesTechnical model name (e.g. gpt-4o-mini-transcribe).
CredentialsYesStored credentials for the selected provider (dropdown selection).
Model TypeYesType of voice model: Transcription or Speech (dropdown selection).

Costs by Model Type

Transcription Model Costs

Transcription models process audio inputs and produce text outputs.
FieldRequiredDescription
Cost in $ per Million Text Input TokensNoCost for text-based inputs (e.g. prompt).
Cost in $ per Million Text Output TokensNoCost for the generated text output (transcription).
Cost in $ per Million Audio Input TokensNoCost for the audio input (spoken language).

Speech Model Costs

Speech models process text inputs and produce audio outputs.
FieldRequiredDescription
Cost in $ per Million Text Input TokensNoCost for the text input (text to be spoken).
Cost in $ per Million Audio Output TokensNoCost for the generated audio output (spoken language).
Token costs for voice models differ depending on the model type. Transcription models have three cost fields (text input, text output, audio input), while speech models only require two cost fields (text input, audio output).
Voice models do not have DLP security settings, as data processing is secured through the associated chat and embedding models.