victor-upmeet/whisperx

Accelerated transcription, word-level timestamps and diarization with whisperX large-v3

Input

Configure the inputs for the AI model.

Debug

Enable

Print out compute/inference times and memory usage information

Language

ISO code of the language spoken in the audio, specify None to perform language detection

Vad Onset

100

VAD onset

Audio File *

Audio file

Batch Size

100

Parallelization of input audio transcription

Vad Offset

100

VAD offset

Diarization

Enable

Assign speaker ID labels

Temperature

100

Temperature to use for sampling

Align Output

Enable

Aligns whisper output to get accurate word-level timestamps

Max Speakers

100

Maximum number of speakers if diarization is activated (leave blank if unknown)

Min Speakers

100

Minimum number of speakers if diarization is activated (leave blank if unknown)

Initial Prompt

Optional text to provide as a prompt for the first window

Huggingface Access Token

To enable diarization, please enter your HuggingFace token (read). You need to accept the user agreement for the models specified in the README.

Language Detection Min Prob

100

If language is not specified, then the language will be detected recursively on different parts of the file until it reaches the given probability

Language Detection Max Tries

100

If language is not specified, then the language will be detected following the logic of language_detection_min_prob parameter, but will stop after the given max retries. If max retries is reached, the most probable language is kept.

Output

The generated output will appear here.

No output yet

Click "Generate" to create an output.