Speech Services

Elftia integrates two categories of speech service — speech recognition (ASR, Automatic Speech Recognition) and speech synthesis (TTS, Text-to-Speech) — allowing you to switch freely between voice and text in conversations and content creation.

Use Cases

Automatically transcribe voice messages or recorded audio files to text
Convert AI replies or custom text to speech for playback
Use voice input when typing is inconvenient
Add narration to generated content

Speech Recognition (ASR)

The speech recognition feature transcribes speech from audio files into text.

Supported Providers

Provider	Type identifier	Description
OpenAI Whisper	`openai-whisper`	OpenAI's Whisper speech recognition API; supports multiple languages
ElevenLabs Scribe	`elevenlabs-stt`	ElevenLabs' speech-to-text service
Fish Audio	`fishaudio-asr`	Fish Audio's speech recognition service

Configuration Steps

Open Settings > Media Providers > Speech Recognition
Select an ASR provider
Enter the API Key
Select a default model (e.g. whisper-1)
Save and enable
(Optional) Set it as the default provider

Configuration Parameters

Setting	Description	Required
API Key	The provider's API key	Yes
Default model	The model used for transcription	No (uses provider default)
Default provider	Whether to set this as the default ASR provider	No

How to Use

Select or record an audio clip
The system automatically sends the audio to the configured ASR provider
The transcription result is returned as text

Automatic Format Conversion

The ASR service has built-in automatic format conversion. If the uploaded audio file format is not natively supported by the Whisper API, the system attempts to automatically convert it to WAV using ffmpeg before transcription.

Natively Supported Audio Formats

Format	Extension
FLAC	`.flac`
M4A	`.m4a`
MP3	`.mp3`
MP4	`.mp4`
MPEG	`.mpeg`, `.mpga`
OGG	`.oga`, `.ogg`
WAV	`.wav`
WebM	`.webm`

Other formats (such as .amr or .silk) are automatically converted to WAV before processing.

:::info ffmpeg Dependency The automatic format conversion feature requires ffmpeg to be installed on the system. If ffmpeg is not installed, unsupported formats are sent directly to the API (and the API may refuse to process them). :::

Transcription Parameters

Parameter	Description
`providerId`	Specifies the ASR provider to use (optional; defaults to the default provider)
`modelId`	Specifies the model to use (optional)
`language`	Specifies the audio language (optional; auto-detected if not set)

Speech Synthesis (TTS)

The speech synthesis feature converts text content into natural-sounding speech.

Supported Providers

Provider	Type identifier	Description
ElevenLabs	`elevenlabs-tts`	High-quality multilingual speech synthesis with a rich selection of voices
Fish Audio	`fishaudio-tts`	Fish Audio's speech synthesis service

Configuration Steps

Open Settings > Media Providers > Speech Synthesis
Select a TTS provider
Enter the API Key
Select a default model
Select a default voice
Save and enable
(Optional) Set it as the default provider

Configuration Parameters

Setting	Description	Required
API Key	The provider's API key	Yes
Default model	The model used for speech synthesis	No
Default voice	The default voice ID to use	No
Default provider	Whether to set this as the default TTS provider	No

How to Use

Select the text content you want to convert
Choose a voice
The system sends the text to the TTS provider
The generated audio can be played directly or saved

Voice Selection

Each TTS provider offers multiple voices to choose from. You can:

View the list of available voices in the TTS settings
Preview the effect of different voices
Select one voice as the default

The voice list is fetched dynamically via API and will change as the provider updates its offerings.

Output Formats

Format	Description
`mp3`	Universal audio format (default)
`wav`	Lossless audio format
`ogg`	Open-source compressed format
`opus`	Efficient compressed format
`pcm`	Raw audio data

Synthesis Parameters

Parameter	Description
`providerId`	Specifies the TTS provider to use (optional)
`modelId`	Specifies the model to use (optional)
`voiceId`	Specifies the voice to use (optional)
`outputFormat`	Output audio format (optional; defaults to `mp3`)
`outputDir`	Save directory (optional)

Frequently Asked Questions

Question	Solution
ASR transcription is empty	Check that the audio file contains valid speech and is not corrupted
Unsupported audio format	Install ffmpeg to enable automatic format conversion
TTS-generated speech sounds unnatural	Try switching to a different voice or model; some voices perform better in specific languages
Voice list is empty	Confirm the API key is valid and the provider service is operating normally
Transcription result is inaccurate	Try specifying the correct language parameter, or select a more suitable model
TTS produces no audio output	Check the system volume and audio output device; confirm the generated file is not empty

Media Generation Overview — Overview of all media types
Chat — Using speech features in a conversation
Music Generation — AI music creation

Use Cases​

Speech Recognition (ASR)​

Supported Providers​

Configuration Steps​

Configuration Parameters​

How to Use​

Automatic Format Conversion​

Natively Supported Audio Formats​

Transcription Parameters​

Speech Synthesis (TTS)​

Supported Providers​

Configuration Steps​

Configuration Parameters​

How to Use​

Voice Selection​

Output Formats​

Synthesis Parameters​

Frequently Asked Questions​

Related Links​

Use Cases

Speech Recognition (ASR)

Supported Providers

Configuration Steps

Configuration Parameters

How to Use

Automatic Format Conversion

Natively Supported Audio Formats

Transcription Parameters

Speech Synthesis (TTS)

Supported Providers

Configuration Steps

Configuration Parameters

How to Use

Voice Selection

Output Formats

Synthesis Parameters

Frequently Asked Questions

Related Links