Create transcription
POST https://api.fastapi.ai/v1/audio/transcriptions
Transcribes audio into the input language.
Request body
file file Required
The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
model string Required
ID of the model to use. Available models include gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe-diarize, and whisper-1.
language string Optional
The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format can improve accuracy and latency.
prompt string Optional
An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language.
Not supported for gpt-4o-transcribe-diarize.
response_format string Optional Defaults to json
The format of the output.
whisper-1supportsjson,text,srt,verbose_json, andvtt.gpt-4o-mini-transcribesupportsjson,text, andverbose_json.gpt-4o-transcribeonly supportsjson.gpt-4o-transcribe-diarizesupportsjson,text, anddiarized_json.
temperature number Optional Defaults to 0
The sampling temperature, between 0 and 1.
timestamp_granularities[] array Optional Defaults to segment
The timestamp granularities to populate for this transcription. response_format must be set to verbose_json to use timestamp granularities. Either or both of these options are supported: word, or segment.
Not supported for gpt-4o-transcribe-diarize.
include[] array Optional
Additional information to include in the transcription response. Supported values include logprobs.logprobs is only supported for gpt-4o-transcribe and gpt-4o-mini-transcribe, and only when response_format is json.
chunking_strategy object or string Optional
Controls how the audio is chunked for transcription. Defaults to auto, which uses server-side VAD.
You can also pass {"type":"server_vad", "prefix_padding_ms": 0, "silence_duration_ms": 500} to customize chunking.
known_speaker_names array Optional
List of speaker names used by gpt-4o-transcribe-diarize to label speakers.
references array Optional
Reference examples for speaker voice attribution. Supported by gpt-4o-transcribe-diarize.
stream boolean Optional Defaults to false
If true, returns a stream of transcription events. Not supported for whisper-1.
Returns
Returns the transcription object, a verbose transcription object, or a diarized transcription object.
The transcription object (JSON)
text string
The transcribed text.
{
"text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that.",
"usage": {
"input_tokens": 333,
"input_duration_ms": 29801,
"output_tokens": 67,
"output_duration_ms": 0,
"total_tokens": 400
}
}Example
Request
curl https://api.fastapi.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $FAST_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F file="@/path/to/file/audio.mp3" \
-F model="gpt-4o-mini-transcribe"curl https://api.fastapi.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $FAST_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F file="@/path/to/file/audio.mp3" \
-F "timestamp_granularities[]=word" \
-F model="whisper-1" \
-F response_format="verbose_json"curl https://api.fastapi.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $FAST_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F file="@/path/to/file/audio.mp3" \
-F model="gpt-4o-mini-transcribe" \
-F stream=true