Version: 2.31-unstable

FunASR

haystack_integrations.components.audio.funasr.transcriber

FunASRTranscriber

Transcribes audio files to Documents using FunASR.

FunASR is an open-source speech recognition toolkit from Alibaba DAMO Academy. It supports 50+ languages, speaker diarization, and timestamp extraction, and runs entirely locally — no API key required.

Models are downloaded from ModelScope on first use and cached in ~/.cache/modelscope.

Usage Example:

python

from haystack_integrations.components.audio.funasr import FunASRTranscriber

transcriber = FunASRTranscriber()
result = transcriber.run(sources=["speech.wav", "interview.mp3"])
documents = result["documents"]

Speaker diarization and punctuation:

python

from haystack.utils import ComponentDevice

transcriber = FunASRTranscriber(
    model="paraformer-zh",
    vad_model="fsmn-vad",
    punc_model="ct-punc",
    spk_model="cam++",
    device=ComponentDevice.from_str("cuda"),
)

SenseVoice with inverse text normalisation:

python

transcriber = FunASRTranscriber(
    model="iic/SenseVoiceSmall",
    generation_kwargs={"use_itn": True, "merge_vad": True, "language": "auto"},
)

init

python

__init__(
    *,
    model: str = "iic/SenseVoiceSmall",
    vad_model: str | None = "fsmn-vad",
    punc_model: str | None = "ct-punc",
    spk_model: str | None = None,
    device: ComponentDevice | None = None,
    batch_size_s: int = 300,
    store_full_path: bool = False,
    generation_kwargs: dict[str, Any] | None = None
) -> None

Create a FunASRTranscriber component.

Parameters:

model (str) – FunASR model name or local path. Defaults to "iic/SenseVoiceSmall", a multilingual model supporting 50+ languages that is 5-10x faster than Whisper. Alternatives include "paraformer-zh" (Chinese) or "paraformer-en" (English). Browse available models at https://modelscope.github.io/FunASR/model-selection.html.
vad_model (str | None) – Voice activity detection model used to split long audio into segments. Set to None to process the audio as a single stream. Browse available VAD models at https://www.modelscope.cn/models.
punc_model (str | None) – Punctuation restoration model. Set to None to disable punctuation. Browse available punctuation models at https://www.modelscope.cn/models.
spk_model (str | None) – Speaker diarization model (e.g. "cam++"). When set, a "speakers" key is included in the Document metadata. Defaults to None (diarization disabled). Browse available speaker diarization models at https://www.modelscope.cn/models.
device (ComponentDevice | None) – The device to run inference on. If None, the default device is selected automatically. Use ComponentDevice.from_str("cuda") for GPU inference.
batch_size_s (int) – Batch size in seconds for VAD-segmented audio. Larger values improve throughput at the cost of memory.
store_full_path (bool) – If True, store the full audio file path in Document metadata. If False (default), store only the file name.
generation_kwargs (dict[str, Any] | None) – Extra keyword arguments forwarded to AutoModel.generate(). Use this for model-specific options such as use_itn=True or merge_vad=True for SenseVoice, or hotword="..." for contextual recognition.

warm_up

python

warm_up() -> None

Load the FunASR model into memory.

Models are downloaded from ModelScope on first call and cached locally. This method is idempotent — calling it multiple times is safe.

to_dict

python

to_dict() -> dict[str, Any]

Serialize the component to a dictionary.

Returns:

dict[str, Any] – Dictionary with serialized data.

from_dict

python

from_dict(data: dict[str, Any]) -> FunASRTranscriber

Deserialize the component from a dictionary.

Parameters:

data (dict[str, Any]) – Dictionary to deserialize from.

Returns:

FunASRTranscriber – Deserialized component.

run

python

run(
    sources: list[str | Path | ByteStream],
    meta: dict[str, Any] | list[dict[str, Any]] | None = None,
) -> dict[str, list[Document]]

Transcribe audio sources to Documents.

Parameters:

sources (list[str | Path | ByteStream]) – Audio file paths (str or Path) or ByteStream objects. Supported formats: WAV, MP3, FLAC, OGG, M4A, AAC, and any format that FunASR's underlying audio backend (soundfile/ffmpeg) can decode.
meta (dict[str, Any] | list[dict[str, Any]] | None) – Metadata to attach to the produced Documents. Pass a single dict to apply the same metadata to all Documents, or a list aligned with sources.

Returns:

dict[str, list[Document]] – Dictionary with key "documents" — one Document per source whose content holds the full transcript text.

haystack_integrations.components.audio.funasr.transcriber​

FunASRTranscriber​

init​

warm_up​

to_dict​

from_dict​

run​

haystack_integrations.components.audio.funasr.transcriber

FunASRTranscriber

init

warm_up

to_dict

from_dict

run