`agents.models`¶

The following model specification classes are meant to define a comman interface for initialization parameters for ML models across supported model serving platforms.

Module Contents¶

Classes¶

`GenericLLM`	A generic LLM configuration for OpenAI-compatible /v1/chat/completions APIs.
`GenericMLLM`	A generic Multimodal LLM configuration for OpenAI-compatible APIs.
`GenericTTS`	A generic Text-to-Speech model for OpenAI-compatible /v1/audio/speech APIs.
`GenericSTT`	A generic Speech-to-Text model for OpenAI-compatible /v1/audio/transcriptions APIs.
`TransformersLLM`	An LLM model that needs to be initialized with any LLM checkpoint available on HuggingFace transformers. This model can be used with a roboml client.
`TransformersMLLM`	An MLLM model that needs to be initialized with any MLLM checkpoint available on HuggingFace transformers. This model can be used with a roboml client.
`OllamaModel`	An Ollama model that needs to be initialized with an ollama tag as checkpoint.
`Whisper`	Whisper is an automatic speech recognition (ASR) system by OpenAI trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Details
`TransformersTTS`	Generic text-to-speech model from HuggingFace Transformers.
`VisionModel`	Object Detection Model with Optional Tracking.
`RoboBrain2`	RoboBrain 2.0 by BAAI supports interactive reasoning with long-horizon planning and closed-loop feedback, spatial perception for precise point and bbox prediction from complex instructions and temporal perception for future trajectory estimation. @article{RoboBrain2.0TechnicalReport, title={RoboBrain 2.0 Technical Report}, author={BAAI RoboBrain Team}, journal={arXiv preprint arXiv:2507.02029}, year={2025} }

API¶

class agents.models.GenericLLM¶

Bases: agents.models.Model

A generic LLM configuration for OpenAI-compatible /v1/chat/completions APIs.

This class supports any model served via an OpenAI-compatible endpoint (e.g., vLLM, LMDeploy, DeepSeek, Groq, or OpenAI itself).

Parameters:

name (str) – An arbitrary name given to the model.
checkpoint (str) – The model identifier on the remote server (e.g., “gpt-4o”, “meta-llama/Llama-3-70b”). For OpenAI models, consult: https://platform.openai.com/docs/models
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.
options (dict, optional) –
Optional dictionary to configure default inference behavior. Options that conflict with component config options such as (max_tokens and temperature) will be overridden if set in component config. Supported keys match standard OpenAI API parameters:
- temperature (float): Sampling temperature (0-2).
- top_p (float): Nucleus sampling probability.
- max_tokens (int): Max tokens to generate.
- presence_penalty (float): Penalty for new tokens (-2.0 to 2.0).
- frequency_penalty (float): Penalty for frequent tokens (-2.0 to 2.0).
- stop (str or list): Stop sequences.
- seed (int): Random seed for deterministic sampling.

Example usage:

gpt4 = GenericLLM(
    name='gpt4',
    checkpoint="gpt-4o",
    options={"temperature": 0.7, "max_tokens": 500}
)

get_init_params() → Dict¶: Get init params from models

class agents.models.GenericMLLM¶

Bases: agents.models.GenericLLM

A generic Multimodal LLM configuration for OpenAI-compatible APIs.

Use this for models that accept image/audio inputs alongside text (e.g., GPT-4o, Claude 3.5 Sonnet via wrapper, Gemini via OpenAI adapter).

Parameters:

name (str) – An arbitrary name given to the model.
checkpoint (str) – The model identifier. Consult provider documentation.
options (dict, optional) – Optional dictionary for default inference parameters (see GenericLLM).

Example usage:

gpt4_vision = GenericMLLM(name='gpt4v', checkpoint="gpt-4o")

get_init_params() → Dict¶: Get init params from models

class agents.models.GenericTTS¶

Bases: agents.models.Model

A generic Text-to-Speech model for OpenAI-compatible /v1/audio/speech APIs.

Parameters:

name (str) – An arbitrary name given to the model.
checkpoint (str) – The model identifier (e.g., “tts-1”, “tts-1-hd”). For details: https://platform.openai.com/docs/models/tts
voice (str) – The voice ID to use. OpenAI standard voices: ‘alloy’, ‘echo’, ‘fable’, ‘onyx’, ‘nova’, ‘shimmer’. Other providers may have different IDs.
speed (float) – The speed of the generated audio. Select a value from 0.25 to 4.0. Default is 1.0.
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.

Example usage:

tts = GenericTTS(
    name='openai_tts',
    checkpoint="tts-1-hd",
    voice="nova",
    speed=1.2
)

get_init_params() → Dict¶: Get init params from models

class agents.models.GenericSTT¶

Bases: agents.models.Model

A generic Speech-to-Text model for OpenAI-compatible /v1/audio/transcriptions APIs.

Parameters:

name (str) – An arbitrary name given to the model.
checkpoint (str) – The model identifier (e.g., “whisper-1”). For details: https://platform.openai.com/docs/models/whisper
language (str, optional) – The language of the input audio (ISO-639-1 format, e.g., ‘en’, ‘fr’). Improves accuracy if known. Default is None (auto-detect).
temperature (float) – The sampling temperature (0-1). Lower values are more deterministic. Default is 0.
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.

Example usage:

stt = GenericSTT(
    name='openai_stt',
    checkpoint="whisper-1",
    language="en",
    temperature=0.2
)

get_init_params() → Dict¶: Get init params from models

class agents.models.TransformersLLM¶

Bases: agents.models.LLM

An LLM model that needs to be initialized with any LLM checkpoint available on HuggingFace transformers. This model can be used with a roboml client.

Parameters:

name (str) – An arbitrary name given to the model.
checkpoint (str) – The name of the pre-trained model’s checkpoint. Default is “Qwen/Qwen3-0.6B”. For available checkpoints consult HuggingFace LLM Models
quantization (str or None) – The quantization scheme used by the model. Can be one of “4bit”, “8bit” or None (default is “4bit”).
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.

Example usage:

llm = TransformersLLM(name='llm', checkpoint="meta-llama/Meta-Llama-3.1-8B-Instruct")

get_init_params() → Dict¶: Get init params from models

class agents.models.TransformersMLLM¶

Bases: agents.models.TransformersLLM

An MLLM model that needs to be initialized with any MLLM checkpoint available on HuggingFace transformers. This model can be used with a roboml client.

Parameters:

name (str) – An arbitrary name given to the model.
checkpoint (str) – The name of the pre-trained model’s checkpoint. Default is “Qwen/Qwen2.5-VL-3B-Instruct”. For available checkpoints consult HuggingFace Image-Text to Text Models
quantization (str or None) – The quantization scheme used by the model. Can be one of “4bit”, “8bit” or None (default is “4bit”).
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.

Example usage:

mllm = TransformersMLLM(name='mllm', checkpoint="gemma2:latest")

get_init_params() → Dict¶: Get init params from models

class agents.models.OllamaModel¶

Bases: agents.models.LLM

An Ollama model that needs to be initialized with an ollama tag as checkpoint.

Parameters:

name (str) – An arbitrary name given to the model.
checkpoint (str) – The name of the pre-trained model’s checkpoint. For available checkpoints consult Ollama Models
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.
options –
Optional dictionary to configure generation behavior. Options that conflict with component config options such as (num_predict and temperature) will be overridden if set in component config. Only the following keys with their specified value types are allowed. For details check Ollama api documentation:
- num_keep: int
- seed: int
- num_predict: int
- top_k: int
- top_p: float
- min_p: float
- typical_p: float
- repeat_last_n: int
- temperature: float
- repeat_penalty: float
- presence_penalty: float
- frequency_penalty: float
- penalize_newline: bool
- stop: list of strings
- numa: bool
- num_ctx: int
- num_batch: int
- num_gpu: int
- main_gpu: int
- use_mmap: bool
- num_thread: int
- think: bool

llm = OllamaModel(
    name='ollama1',
    checkpoint="gemma2:latest",
    options={"temperature": 0.7, "num_predict": 50}
)

get_init_params() → Dict¶: Get init params from models

class agents.models.Whisper¶

Bases: agents.models.Model

Whisper is an automatic speech recognition (ASR) system by OpenAI trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Details

Parameters:

name (str) – An arbitrary name given to the model.
checkpoint (str) – Size of the model to use (tiny, tiny.en, base, base.en, small, small.en, distil-small.en, medium, medium.en, distil-medium.en, large-v1, large-v2, large-v3, large, distil-large-v2, distil-large-v3, large-v3-turbo, or turbo). For more information check here
compute_type (str or None) – The compute type used by the model. Can be one of “int8”, “fp16”, “fp32”, None (default is “int8”).
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.

Example usage:

whisper = Whisper(name='s2t', checkpoint="small") # Initialize with a different checkpoint

get_init_params() → Dict¶: Get init params from models

class agents.models.TransformersTTS¶

Bases: agents.models.Model

Generic text-to-speech model from HuggingFace Transformers.

Supports all TTS models registered in Transformers including Bark, VITS, SpeechT5, SeamlessM4T, and more. The model automatically detects whether to use generative inference (Bark, SpeechT5) or forward-only inference (VITS).

Parameters:

name (str) – An arbitrary name given to the model.
checkpoint (str) – The HuggingFace model ID. Default is “facebook/mms-tts-eng” (VITS). Other examples: “suno/bark-small”, “microsoft/speecht5_tts” (SpeechT5).
voice (str, optional) – Voice preset. For Bark, use presets like “v2/en_speaker_6”. For other models, this may be unused. Default is “v2/en_speaker_6”.
vocoder_checkpoint (str, optional) – Vocoder model ID for spectrogram models (e.g. SpeechT5). If not provided, defaults to “microsoft/speecht5_hifigan” when needed.
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.

Example usage:

tts = TransformersTTS(name='t2s')  # Default Bark
tts = TransformersTTS(name='t2s', checkpoint="facebook/mms-tts-eng", voice=None)  # VITS
tts = TransformersTTS(name='t2s', checkpoint="microsoft/speecht5_tts", vocoder_checkpoint="microsoft/speecht5_hifigan")  # SpeechT5

get_init_params() → Dict¶: Get init params from models

class agents.models.VisionModel¶

Bases: agents.models.Model

Object Detection Model with Optional Tracking.

This vision model provides a flexible framework for object detection and tracking using HuggingFace Transformers. It supports any HuggingFace detection model (RT-DETR, DETR, Grounding DINO, YOLOS, etc.) with optional ByteTrack tracking.

Parameters:

name (str) – An arbitrary name given to the model.
checkpoint (str) – HuggingFace model ID for object detection. Default is “PekingU/rtdetr_r50vd_coco_o365”. For available models see HuggingFace Object Detection Models.
setup_trackers (bool) – Whether to set up ByteTrack trackers. Default is False.
tracking_distance_threshold (int) – The IoU threshold (as percentage) for tracking association. Default is 30, with a minimum value of 1.
_num_trackers (int) – The number of trackers to use. This number depends on the number of input image streams being given to the component. It is set automatically if setup_trackers is True.
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.

Example usage:

model = VisionModel(name='detection1', setup_trackers=True, tracking_distance_threshold=20)

get_init_params() → Dict¶: Get init params from models

class agents.models.RoboBrain2¶

Bases: agents.models.Model

RoboBrain 2.0 by BAAI supports interactive reasoning with long-horizon planning and closed-loop feedback, spatial perception for precise point and bbox prediction from complex instructions and temporal perception for future trajectory estimation. @article{RoboBrain2.0TechnicalReport, title={RoboBrain 2.0 Technical Report}, author={BAAI RoboBrain Team}, journal={arXiv preprint arXiv:2507.02029}, year={2025} }

Parameters:

name (str) – An arbitrary name given to the model.
checkpoint (str) – The name of the pre-trained model’s checkpoint. Default is “BAAI/RoboBrain2.0-3B”. For available checkpoints consult RoboBrain2 Model Collection on HuggingFace.
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.

Example usage:

robobrain = RoboBrain2(name='robobrain', checkpoint="BAAI/RoboBrain2.0-32B")

get_init_params() → Dict¶: Get init params from models

agents.models¶

Module Contents¶

Classes¶

API¶

`agents.models`¶