Voice models enable speech-to-text (transcription) and text-to-speech (speech output) in VARIOS AI. Each voice model is assigned a model type that determines the available configuration options and cost fields.
Model Types
| Model Type | Description |
|---|
| Transcription | Converts spoken language into text (speech-to-text). Example: gpt-4o-mini-transcribe. |
| Speech | Converts text into spoken language (text-to-speech). Example: gpt-4o-mini-tts. |
Basic Data (Both Types)
| Field | Required | Description |
|---|
| Image / Title | Yes | Display name and optional profile image of the model. |
| Model Name | Yes | Technical model name (e.g. gpt-4o-mini-transcribe). |
| Credentials | Yes | Stored credentials for the selected provider (dropdown selection). |
| Model Type | Yes | Type of voice model: Transcription or Speech (dropdown selection). |
Costs by Model Type
Transcription Model Costs
Transcription models process audio inputs and produce text outputs.
| Field | Required | Description |
|---|
| Cost in $ per Million Text Input Tokens | No | Cost for text-based inputs (e.g. prompt). |
| Cost in $ per Million Text Output Tokens | No | Cost for the generated text output (transcription). |
| Cost in $ per Million Audio Input Tokens | No | Cost for the audio input (spoken language). |
Speech Model Costs
Speech models process text inputs and produce audio outputs.
| Field | Required | Description |
|---|
| Cost in $ per Million Text Input Tokens | No | Cost for the text input (text to be spoken). |
| Cost in $ per Million Audio Output Tokens | No | Cost for the generated audio output (spoken language). |
Token costs for voice models differ depending on the model type.
Transcription models have three cost fields (text input, text output,
audio input), while speech models only require two cost fields
(text input, audio output).
Voice models do not have DLP security settings, as data processing is
secured through the associated chat and embedding models.