Skip to content

Create transcription

POST https://api.fastapi.ai/v1/audio/transcriptions

Transcribes audio into the input language.

Request body


file file Required
The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.


model string Required
ID of the model to use. Available models include gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe-diarize, and whisper-1.


language string Optional
The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format can improve accuracy and latency.


prompt string Optional
An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language.
Not supported for gpt-4o-transcribe-diarize.


response_format string Optional Defaults to json
The format of the output.

  • whisper-1 supports json, text, srt, verbose_json, and vtt.
  • gpt-4o-mini-transcribe supports json, text, and verbose_json.
  • gpt-4o-transcribe only supports json.
  • gpt-4o-transcribe-diarize supports json, text, and diarized_json.

temperature number Optional Defaults to 0
The sampling temperature, between 0 and 1.


timestamp_granularities[] array Optional Defaults to segment
The timestamp granularities to populate for this transcription. response_format must be set to verbose_json to use timestamp granularities. Either or both of these options are supported: word, or segment.
Not supported for gpt-4o-transcribe-diarize.


include[] array Optional
Additional information to include in the transcription response. Supported values include logprobs.
logprobs is only supported for gpt-4o-transcribe and gpt-4o-mini-transcribe, and only when response_format is json.


chunking_strategy object or string Optional
Controls how the audio is chunked for transcription. Defaults to auto, which uses server-side VAD.
You can also pass {"type":"server_vad", "prefix_padding_ms": 0, "silence_duration_ms": 500} to customize chunking.


known_speaker_names array Optional
List of speaker names used by gpt-4o-transcribe-diarize to label speakers.


references array Optional
Reference examples for speaker voice attribution. Supported by gpt-4o-transcribe-diarize.


stream boolean Optional Defaults to false
If true, returns a stream of transcription events. Not supported for whisper-1.

Returns


Returns the transcription object, a verbose transcription object, or a diarized transcription object.


The transcription object (JSON)


text string
The transcribed text.


OBJECT The transcription object (JSON)
bash
{
  "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that.",
  "usage": {
    "input_tokens": 333,
    "input_duration_ms": 29801,
    "output_tokens": 67,
    "output_duration_ms": 0,
    "total_tokens": 400
  }
}

Example

Request

bash
curl https://api.fastapi.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $FAST_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F file="@/path/to/file/audio.mp3" \
  -F model="gpt-4o-mini-transcribe"
bash
curl https://api.fastapi.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $FAST_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F file="@/path/to/file/audio.mp3" \
  -F "timestamp_granularities[]=word" \
  -F model="whisper-1" \
  -F response_format="verbose_json"
bash
curl https://api.fastapi.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $FAST_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F file="@/path/to/file/audio.mp3" \
  -F model="gpt-4o-mini-transcribe" \
  -F stream=true

那年我双手插兜, 让bug稳如老狗