Version: 2.26

Llama.cpp

haystack_integrations.components.generators.llama_cpp.chat.chat_generator

LlamaCppChatGenerator

Provides an interface to generate text using LLM via llama.cpp.

llama.cpp is a project written in C/C++ for efficient inference of LLMs. It employs the quantized GGUF format, suitable for running these models on standard machines (even without GPUs). Supports both text-only and multimodal (text + image) models like LLaVA.

Usage example:

python

from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
user_message = [ChatMessage.from_user("Who is the best American actor?")]
generator = LlamaCppGenerator(model="zephyr-7b-beta.Q4_0.gguf", n_ctx=2048, n_batch=512)

print(generator.run(user_message, generation_kwargs={"max_tokens": 128}))
# {"replies": [ChatMessage(content="John Cusack", role=<ChatRole.ASSISTANT: "assistant">, name=None, meta={...})}

Usage example with multimodal (image + text):

python

from haystack.dataclasses import ChatMessage, ImageContent

# Create an image from file path or base64
image_content = ImageContent.from_file_path("path/to/your/image.jpg")

# Create a multimodal message with both text and image
messages = [ChatMessage.from_user(content_parts=["What's in this image?", image_content])]

# Initialize with multimodal support
generator = LlamaCppChatGenerator(
    model="llava-v1.5-7b-q4_0.gguf",
    chat_handler_name="Llava15ChatHandler",  # Use llava-1-5 handler
    model_clip_path="mmproj-model-f16.gguf",  # CLIP model
    n_ctx=4096  # Larger context for image processing
)

result = generator.run(messages)
print(result)

init

python

__init__(
    model: str,
    n_ctx: int | None = 0,
    n_batch: int | None = 512,
    model_kwargs: dict[str, Any] | None = None,
    generation_kwargs: dict[str, Any] | None = None,
    *,
    tools: ToolsType | None = None,
    streaming_callback: StreamingCallbackT | None = None,
    chat_handler_name: str | None = None,
    model_clip_path: str | None = None
) -> None

Initialize LlamaCppChatGenerator.

Parameters:

model (str) – The path of a quantized model for text generation, for example, "zephyr-7b-beta.Q4_0.gguf". If the model path is also specified in the model_kwargs, this parameter will be ignored.
n_ctx (int | None) – The number of tokens in the context. When set to 0, the context will be taken from the model.
n_batch (int | None) – Prompt processing maximum batch size.
model_kwargs (dict[str, Any] | None) – Dictionary containing keyword arguments used to initialize the LLM for text generation. These keyword arguments provide fine-grained control over the model loading. In case of duplication, these kwargs override model, n_ctx, and n_batch init parameters. For more information on the available kwargs, see llama.cpp documentation.
generation_kwargs (dict[str, Any] | None) – A dictionary containing keyword arguments to customize text generation. For more information on the available kwargs, see llama.cpp documentation.
tools (ToolsType | None) – A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls. Each tool should have a unique name.
streaming_callback (StreamingCallbackT | None) – A callback function that is called when a new token is received from the stream.
chat_handler_name (str | None) – Name of the chat handler for multimodal models. Common options include: "Llava16ChatHandler", "MoondreamChatHandler", "Qwen25VLChatHandler". For other handlers, check llama-cpp-python documentation.
model_clip_path (str | None) – Path to the CLIP model for vision processing (e.g., "mmproj.bin"). Required when chat_handler_name is provided for multimodal models.

warm_up

python

warm_up() -> None

Load and initialize the llama.cpp model.

to_dict

python

to_dict() -> dict[str, Any]

Serializes the component to a dictionary.

Returns:

dict[str, Any] – Dictionary with serialized data.

from_dict

python

from_dict(data: dict[str, Any]) -> LlamaCppChatGenerator

Deserializes the component from a dictionary.

Parameters:

data (dict[str, Any]) – Dictionary to deserialize from.

Returns:

LlamaCppChatGenerator – Deserialized component.

run

python

run(
    messages: list[ChatMessage] | str,
    generation_kwargs: dict[str, Any] | None = None,
    *,
    tools: ToolsType | None = None,
    streaming_callback: StreamingCallbackT | None = None
) -> dict[str, list[ChatMessage]]

Run the text generation model on the given list of ChatMessages.

Parameters:

messages (list[ChatMessage] | str) – A list of ChatMessage instances representing the input messages. If a string is provided, it is converted to a list containing a ChatMessage with user role.
generation_kwargs (dict[str, Any] | None) – A dictionary containing keyword arguments to customize text generation. For more information on the available kwargs, see llama.cpp documentation.
tools (ToolsType | None) – A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls. Each tool should have a unique name. If set, it will override the tools parameter set during component initialization.
streaming_callback (StreamingCallbackT | None) – A callback function that is called when a new token is received from the stream. If set, it will override the streaming_callback parameter set during component initialization.

Returns:

dict[str, list[ChatMessage]] – A dictionary with the following keys:
replies: The responses from the model

run_async

python

run_async(
    messages: list[ChatMessage] | str,
    generation_kwargs: dict[str, Any] | None = None,
    *,
    tools: ToolsType | None = None,
    streaming_callback: StreamingCallbackT | None = None
) -> dict[str, list[ChatMessage]]

Async version of run. Runs the text generation model on the given list of ChatMessages.

Uses a thread pool to avoid blocking the event loop, since llama-cpp-python provides only synchronous inference.

Parameters:

messages (list[ChatMessage] | str) – A list of ChatMessage instances representing the input messages. If a string is provided, it is converted to a list containing a ChatMessage with user role.
generation_kwargs (dict[str, Any] | None) – A dictionary containing keyword arguments to customize text generation. For more information on the available kwargs, see llama.cpp documentation.
tools (ToolsType | None) – A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls. Each tool should have a unique name. If set, it will override the tools parameter set during component initialization.
streaming_callback (StreamingCallbackT | None) – A callback function that is called when a new token is received from the stream. If set, it will override the streaming_callback parameter set during component initialization.

Returns:

dict[str, list[ChatMessage]] – A dictionary with the following keys:
replies: The responses from the model

haystack_integrations.components.generators.llama_cpp.generator

LlamaCppGenerator

Provides an interface to generate text using LLM via llama.cpp.

llama.cpp is a project written in C/C++ for efficient inference of LLMs. It employs the quantized GGUF format, suitable for running these models on standard machines (even without GPUs).

Usage example:

python

from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
generator = LlamaCppGenerator(model="zephyr-7b-beta.Q4_0.gguf", n_ctx=2048, n_batch=512)

print(generator.run("Who is the best American actor?", generation_kwargs={"max_tokens": 128}))
# {'replies': ['John Cusack'], 'meta': [{"object": "text_completion", ...}]}

init

python

__init__(
    model: str,
    n_ctx: int | None = 0,
    n_batch: int | None = 512,
    model_kwargs: dict[str, Any] | None = None,
    generation_kwargs: dict[str, Any] | None = None,
) -> None

Initialize LlamaCppGenerator.

Parameters:

model (str) – The path of a quantized model for text generation, for example, "zephyr-7b-beta.Q4_0.gguf". If the model path is also specified in the model_kwargs, this parameter will be ignored.
n_ctx (int | None) – The number of tokens in the context. When set to 0, the context will be taken from the model.
n_batch (int | None) – Prompt processing maximum batch size.
model_kwargs (dict[str, Any] | None) – Dictionary containing keyword arguments used to initialize the LLM for text generation. These keyword arguments provide fine-grained control over the model loading. In case of duplication, these kwargs override model, n_ctx, and n_batch init parameters. For more information on the available kwargs, see llama.cpp documentation.
generation_kwargs (dict[str, Any] | None) – A dictionary containing keyword arguments to customize text generation. For more information on the available kwargs, see llama.cpp documentation.

warm_up

python

warm_up() -> None

Load and initialize the llama.cpp model.

run

python

run(
    prompt: str, generation_kwargs: dict[str, Any] | None = None
) -> dict[str, list[str] | list[dict[str, Any]]]

Run the text generation model on the given prompt.

Parameters:

prompt (str) – the prompt to be sent to the generative model.
generation_kwargs (dict[str, Any] | None) – A dictionary containing keyword arguments to customize text generation. For more information on the available kwargs, see llama.cpp documentation.

Returns:

dict[str, list[str] | list[dict[str, Any]]] – A dictionary with the following keys:
replies: the list of replies generated by the model.
meta: metadata about the request.

haystack_integrations.components.generators.llama_cpp.chat.chat_generator​

LlamaCppChatGenerator​

init​

warm_up​

to_dict​

from_dict​

run​

run_async​

haystack_integrations.components.generators.llama_cpp.generator​

LlamaCppGenerator​

init​

warm_up​

run​

haystack_integrations.components.generators.llama_cpp.chat.chat_generator

LlamaCppChatGenerator

init

warm_up

to_dict

from_dict

run

run_async

haystack_integrations.components.generators.llama_cpp.generator

LlamaCppGenerator

init

warm_up

run