agents.models

The following model specification classes are meant to define a comman interface for initialization parameters for ML models across supported model serving platforms.

Module Contents

Classes

GenericLLM

A generic LLM configuration for OpenAI-compatible /v1/chat/completions APIs.

GenericMLLM

A generic Multimodal LLM configuration for OpenAI-compatible APIs.

GenericTTS

A generic Text-to-Speech model for OpenAI-compatible /v1/audio/speech APIs.

GenericSTT

A generic Speech-to-Text model for OpenAI-compatible /v1/audio/transcriptions APIs.

TransformersLLM

An LLM model that needs to be initialized with any LLM checkpoint available on HuggingFace transformers. This model can be used with a roboml client.

TransformersMLLM

An MLLM model that needs to be initialized with any MLLM checkpoint available on HuggingFace transformers. This model can be used with a roboml client.

OllamaModel

An Ollama model that needs to be initialized with an ollama tag as checkpoint.

Whisper

Whisper is an automatic speech recognition (ASR) system by OpenAI trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Details

TransformersTTS

Generic text-to-speech model from HuggingFace Transformers.

VisionModel

Object Detection Model with Optional Tracking.

RoboBrain2

RoboBrain 2.0 by BAAI supports interactive reasoning with long-horizon planning and closed-loop feedback, spatial perception for precise point and bbox prediction from complex instructions and temporal perception for future trajectory estimation. @article{RoboBrain2.0TechnicalReport, title={RoboBrain 2.0 Technical Report}, author={BAAI RoboBrain Team}, journal={arXiv preprint arXiv:2507.02029}, year={2025} }

API

class agents.models.GenericLLM

Bases: agents.models.Model

A generic LLM configuration for OpenAI-compatible /v1/chat/completions APIs.

This class supports any model served via an OpenAI-compatible endpoint (e.g., vLLM, LMDeploy, DeepSeek, Groq, or OpenAI itself).

Parameters:
  • name (str) – An arbitrary name given to the model.

  • checkpoint (str) – The model identifier on the remote server (e.g., “gpt-4o”, “meta-llama/Llama-3-70b”). For OpenAI models, consult: https://platform.openai.com/docs/models

  • init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.

  • options (dict, optional) –

    Optional dictionary to configure default inference behavior. Options that conflict with component config options such as (max_tokens and temperature) will be overridden if set in component config. Supported keys match standard OpenAI API parameters:

    • temperature (float): Sampling temperature (0-2).

    • top_p (float): Nucleus sampling probability.

    • max_tokens (int): Max tokens to generate.

    • presence_penalty (float): Penalty for new tokens (-2.0 to 2.0).

    • frequency_penalty (float): Penalty for frequent tokens (-2.0 to 2.0).

    • stop (str or list): Stop sequences.

    • seed (int): Random seed for deterministic sampling.

Example usage:

gpt4 = GenericLLM(
    name='gpt4',
    checkpoint="gpt-4o",
    options={"temperature": 0.7, "max_tokens": 500}
)
get_init_params() Dict

Get init params from models

class agents.models.GenericMLLM

Bases: agents.models.GenericLLM

A generic Multimodal LLM configuration for OpenAI-compatible APIs.

Use this for models that accept image/audio inputs alongside text (e.g., GPT-4o, Claude 3.5 Sonnet via wrapper, Gemini via OpenAI adapter).

Parameters:
  • name (str) – An arbitrary name given to the model.

  • checkpoint (str) – The model identifier. Consult provider documentation.

  • options (dict, optional) – Optional dictionary for default inference parameters (see GenericLLM).

Example usage:

gpt4_vision = GenericMLLM(name='gpt4v', checkpoint="gpt-4o")
get_init_params() Dict

Get init params from models

class agents.models.GenericTTS

Bases: agents.models.Model

A generic Text-to-Speech model for OpenAI-compatible /v1/audio/speech APIs.

Parameters:
  • name (str) – An arbitrary name given to the model.

  • checkpoint (str) – The model identifier (e.g., “tts-1”, “tts-1-hd”). For details: https://platform.openai.com/docs/models/tts

  • voice (str) – The voice ID to use. OpenAI standard voices: ‘alloy’, ‘echo’, ‘fable’, ‘onyx’, ‘nova’, ‘shimmer’. Other providers may have different IDs.

  • speed (float) – The speed of the generated audio. Select a value from 0.25 to 4.0. Default is 1.0.

  • init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.

Example usage:

tts = GenericTTS(
    name='openai_tts',
    checkpoint="tts-1-hd",
    voice="nova",
    speed=1.2
)
get_init_params() Dict

Get init params from models

class agents.models.GenericSTT

Bases: agents.models.Model

A generic Speech-to-Text model for OpenAI-compatible /v1/audio/transcriptions APIs.

Parameters:
  • name (str) – An arbitrary name given to the model.

  • checkpoint (str) – The model identifier (e.g., “whisper-1”). For details: https://platform.openai.com/docs/models/whisper

  • language (str, optional) – The language of the input audio (ISO-639-1 format, e.g., ‘en’, ‘fr’). Improves accuracy if known. Default is None (auto-detect).

  • temperature (float) – The sampling temperature (0-1). Lower values are more deterministic. Default is 0.

  • init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.

Example usage:

stt = GenericSTT(
    name='openai_stt',
    checkpoint="whisper-1",
    language="en",
    temperature=0.2
)
get_init_params() Dict

Get init params from models

class agents.models.TransformersLLM

Bases: agents.models.LLM

An LLM model that needs to be initialized with any LLM checkpoint available on HuggingFace transformers. This model can be used with a roboml client.

Parameters:
  • name (str) – An arbitrary name given to the model.

  • checkpoint (str) – The name of the pre-trained model’s checkpoint. Default is “Qwen/Qwen3-0.6B”. For available checkpoints consult HuggingFace LLM Models

  • quantization (str or None) – The quantization scheme used by the model. Can be one of “4bit”, “8bit” or None (default is “4bit”).

  • init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.

Example usage:

llm = TransformersLLM(name='llm', checkpoint="meta-llama/Meta-Llama-3.1-8B-Instruct")
get_init_params() Dict

Get init params from models

class agents.models.TransformersMLLM

Bases: agents.models.TransformersLLM

An MLLM model that needs to be initialized with any MLLM checkpoint available on HuggingFace transformers. This model can be used with a roboml client.

Parameters:
  • name (str) – An arbitrary name given to the model.

  • checkpoint (str) – The name of the pre-trained model’s checkpoint. Default is “Qwen/Qwen2.5-VL-3B-Instruct”. For available checkpoints consult HuggingFace Image-Text to Text Models

  • quantization (str or None) – The quantization scheme used by the model. Can be one of “4bit”, “8bit” or None (default is “4bit”).

  • init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.

Example usage:

mllm = TransformersMLLM(name='mllm', checkpoint="gemma2:latest")
get_init_params() Dict

Get init params from models

class agents.models.OllamaModel

Bases: agents.models.LLM

An Ollama model that needs to be initialized with an ollama tag as checkpoint.

Parameters:
  • name (str) – An arbitrary name given to the model.

  • checkpoint (str) – The name of the pre-trained model’s checkpoint. For available checkpoints consult Ollama Models

  • init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.

  • options

    Optional dictionary to configure generation behavior. Options that conflict with component config options such as (num_predict and temperature) will be overridden if set in component config. Only the following keys with their specified value types are allowed. For details check Ollama api documentation:

    • num_keep: int

    • seed: int

    • num_predict: int

    • top_k: int

    • top_p: float

    • min_p: float

    • typical_p: float

    • repeat_last_n: int

    • temperature: float

    • repeat_penalty: float

    • presence_penalty: float

    • frequency_penalty: float

    • penalize_newline: bool

    • stop: list of strings

    • numa: bool

    • num_ctx: int

    • num_batch: int

    • num_gpu: int

    • main_gpu: int

    • use_mmap: bool

    • num_thread: int

    • think: bool

llm = OllamaModel(
    name='ollama1',
    checkpoint="gemma2:latest",
    options={"temperature": 0.7, "num_predict": 50}
)
get_init_params() Dict

Get init params from models

class agents.models.Whisper

Bases: agents.models.Model

Whisper is an automatic speech recognition (ASR) system by OpenAI trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Details

Parameters:
  • name (str) – An arbitrary name given to the model.

  • checkpoint (str) – Size of the model to use (tiny, tiny.en, base, base.en, small, small.en, distil-small.en, medium, medium.en, distil-medium.en, large-v1, large-v2, large-v3, large, distil-large-v2, distil-large-v3, large-v3-turbo, or turbo). For more information check here

  • compute_type (str or None) – The compute type used by the model. Can be one of “int8”, “fp16”, “fp32”, None (default is “int8”).

  • init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.

Example usage:

whisper = Whisper(name='s2t', checkpoint="small") # Initialize with a different checkpoint
get_init_params() Dict

Get init params from models

class agents.models.TransformersTTS

Bases: agents.models.Model

Generic text-to-speech model from HuggingFace Transformers.

Supports all TTS models registered in Transformers including Bark, VITS, SpeechT5, SeamlessM4T, and more. The model automatically detects whether to use generative inference (Bark, SpeechT5) or forward-only inference (VITS).

Parameters:
  • name (str) – An arbitrary name given to the model.

  • checkpoint (str) – The HuggingFace model ID. Default is “facebook/mms-tts-eng” (VITS). Other examples: “suno/bark-small”, “microsoft/speecht5_tts” (SpeechT5).

  • voice (str, optional) – Voice preset. For Bark, use presets like “v2/en_speaker_6”. For other models, this may be unused. Default is “v2/en_speaker_6”.

  • vocoder_checkpoint (str, optional) – Vocoder model ID for spectrogram models (e.g. SpeechT5). If not provided, defaults to “microsoft/speecht5_hifigan” when needed.

  • init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.

Example usage:

tts = TransformersTTS(name='t2s')  # Default Bark
tts = TransformersTTS(name='t2s', checkpoint="facebook/mms-tts-eng", voice=None)  # VITS
tts = TransformersTTS(name='t2s', checkpoint="microsoft/speecht5_tts", vocoder_checkpoint="microsoft/speecht5_hifigan")  # SpeechT5
get_init_params() Dict

Get init params from models

class agents.models.VisionModel

Bases: agents.models.Model

Object Detection Model with Optional Tracking.

This vision model provides a flexible framework for object detection and tracking using HuggingFace Transformers. It supports any HuggingFace detection model (RT-DETR, DETR, Grounding DINO, YOLOS, etc.) with optional ByteTrack tracking.

Parameters:
  • name (str) – An arbitrary name given to the model.

  • checkpoint (str) – HuggingFace model ID for object detection. Default is “PekingU/rtdetr_r50vd_coco_o365”. For available models see HuggingFace Object Detection Models.

  • setup_trackers (bool) – Whether to set up ByteTrack trackers. Default is False.

  • tracking_distance_threshold (int) – The IoU threshold (as percentage) for tracking association. Default is 30, with a minimum value of 1.

  • _num_trackers (int) – The number of trackers to use. This number depends on the number of input image streams being given to the component. It is set automatically if setup_trackers is True.

  • init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.

Example usage:

model = VisionModel(name='detection1', setup_trackers=True, tracking_distance_threshold=20)
get_init_params() Dict

Get init params from models

class agents.models.RoboBrain2

Bases: agents.models.Model

RoboBrain 2.0 by BAAI supports interactive reasoning with long-horizon planning and closed-loop feedback, spatial perception for precise point and bbox prediction from complex instructions and temporal perception for future trajectory estimation. @article{RoboBrain2.0TechnicalReport, title={RoboBrain 2.0 Technical Report}, author={BAAI RoboBrain Team}, journal={arXiv preprint arXiv:2507.02029}, year={2025} }

Parameters:
  • name (str) – An arbitrary name given to the model.

  • checkpoint (str) – The name of the pre-trained model’s checkpoint. Default is “BAAI/RoboBrain2.0-3B”. For available checkpoints consult RoboBrain2 Model Collection on HuggingFace.

  • init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.

Example usage:

robobrain = RoboBrain2(name='robobrain', checkpoint="BAAI/RoboBrain2.0-32B")
get_init_params() Dict

Get init params from models