Architecture Overview¶
This document describes the internal architecture of EmbodiedAgents for developers who want to extend the framework, contribute new components, or understand how the pieces fit together.
Component Hierarchy¶
Every processing unit in EmbodiedAgents is a component. Components form a strict inheritance chain:
BaseComponent (ros_sugar)
└── Component (agents.components.component_base)
└── ModelComponent (agents.components.model_component)
├── LLM
├── VLM / MLLM
├── VLA
├── Vision
├── SpeechToText
├── TextToSpeech
├── SemanticRouter
├── MapEncoding
└── VideoMessageMaker
BaseComponent¶
BaseComponent is provided by Sugarcoat and wraps a ROS2 Lifecycle Node. It manages the node lifecycle (configure, activate, deactivate, shutdown), subscriber/publisher creation, and the multiprocess execution model. You should never subclass BaseComponent directly.
Component¶
Component (defined in agents.components.component_base) adds:
Input/output validation via
allowed_inputsandallowed_outputsdictionaries.Trigger system for deciding when
_execution_step()fires.Event/Action wiring through
custom_on_configure()andactivate_all_triggers().
ModelComponent¶
ModelComponent (defined in agents.components.model_component) extends Component with:
A
model_clientslot (ModelClientinstance) initialized during the configure lifecycle phase.Support for
additional_model_clientsand hot-swapping viachange_model_client()._call_inference()dispatching to HTTP or WebSocket clients.Output topic validation against
handled_outputs.Warmup logic.
Streaming support via a fast timer (
_handle_websocket_streaming()).
All specialized components (LLM, VLM, Vision, etc.) subclass ModelComponent.
The _execution_step() Pattern¶
Every concrete component must implement _execution_step(**kwargs). This is the core processing callback that runs each time the component is triggered. The general flow inside a ModelComponent._execution_step() is:
Gather inputs – read the latest data from all input callbacks.
Create inference input – call
_create_input()to assemble a dict suitable for the model client.Run inference – call
_call_inference(inference_input), which delegates to theModelClient.Publish results – call
_publish(result)to send output to all registered publishers.
def _execution_step(self, **kwargs):
# 1. Read inputs from callbacks
text = self.callbacks["text0"].get_output()
# 2. Assemble inference dict
inference_input = self._create_input(text)
if inference_input is None:
return
# 3. Call the model
result = self._call_inference(inference_input)
if result is None:
return
# 4. Publish
self._publish(result)
For non-model components (subclassing Component directly), _execution_step() performs custom logic without calling a model client.
Input/Output Validation¶
Components declare what topic types they accept using two class-level dictionaries:
self.allowed_inputs = {
"Required": [Image, [String, Audio]],
"Optional": [CompressedImage],
}
self.allowed_outputs = {
"Required": [String],
}
Required: Each entry must have at least one matching topic. A list entry like
[String, Audio]means “one of these types.”Optional: Accepted but not mandatory.
Validation runs in Component.__init__() via _validate_topics(). It checks that every provided topic’s msg_type is a subclass of at least one allowed type, and that all required types are covered.
Trigger System¶
The trigger parameter controls when _execution_step() fires:
Trigger Type |
Value |
Behavior |
|---|---|---|
Timed |
|
Runs at a fixed frequency (Hz). Sets |
Topic |
|
Fires when a message arrives on that topic. The topic must be one of the component’s inputs. Sets |
Multi-topic |
|
Fires when any of the listed topics receives a message. |
Event |
|
Fires when an external event is raised. Wired via an |
None |
|
Only valid for |
When a Topic trigger is set, the topic’s callback is moved from self.callbacks to self.trig_callbacks. The trigger callback’s on_callback_execute() is wired to call _execution_step() in activate_all_triggers().
Streaming¶
Components that support streaming (LLM, TextToSpeech) use WebSocket-based model clients. The flow:
ModelComponent.custom_on_configure()detectsconfig.stream == Truewith aRoboMLWSClient.A fast timer (1ms period) is created calling
_handle_websocket_streaming().Inference requests go into
self.req_queue; responses come back viaself.resp_queue.The child component’s
_handle_websocket_streaming()reads partial results from the queue and publishes them incrementally (e.g., token-by-token for LLM, audio chunks for TTS).
For HTTP clients, streaming is handled at the client level using generator-based responses.
Configuration Chain¶
All configuration uses the attrs library with the @define decorator:
BaseComponentConfig (ros_sugar)
└── ModelComponentConfig (agents.config)
├── LLMConfig
│ └── MLLMConfig (aliased as VLMConfig)
├── VLAConfig
├── VisionConfig
├── SpeechToTextConfig
├── TextToSpeechConfig
├── SemanticRouterConfig
└── VideoMessageMakerConfig
└── MapConfig (extends BaseComponentConfig directly)
Each config class:
Uses
@define(kw_only=True)for explicit, keyword-only construction.Declares fields with
field(), including defaults, validators (frombase_validators), and converters.Implements
_get_inference_params() -> Dictto extract the subset of parameters passed to the model client at inference time.
The config is deep-copied at component init so that multiple component instances sharing the same config class do not interfere with each other.
Model / Client / Component Relationship¶
The three-layer pattern is central to the architecture:
Model (data class) --> ModelClient (connection logic) --> ModelComponent (ROS node)
Model (
agents.models.Model): Anattrs@defineclass holding model metadata (name, checkpoint, platform-specific options). Its_get_init_params()returns a dict sent to the serving platform.ModelClient (
agents.clients.model_base.ModelClient): Manages the connection to a model serving platform. Implements_check_connection(),_initialize(),_inference(),_deinitialize(). Must be serializable for multiprocess execution.ModelComponent: Holds a
ModelClientinstance, calls it during_execution_step(), and manages the ROS lifecycle around it.
This separation means the same model can be served by different clients (Ollama, RoboML, GenericHTTP), and the same client can be used across different component types.
Local Model Deployment¶
Components that subclass ModelComponent can optionally run without a remote model client by enabling a built-in local model. This is controlled via enable_local_model=True in the component’s config.
How It Works¶
When enable_local_model is set, the component’s custom_on_configure() calls _deploy_local_model(), which instantiates a lightweight local inference wrapper. The _call_inference() dispatcher in ModelComponent automatically routes to the local model when no model_client is set:
ModelComponent._call_inference()
├── model_client (RoboML, Ollama, GenericHTTP, ...)
└── local_model (LocalLLM, LocalVLM, LocalSTT, LocalTTS)
Runtime Backends¶
Each component type uses a runtime optimized for edge deployment:
Component |
Backend |
Package |
Default Model |
|---|---|---|---|
LLM |
llama.cpp |
|
Qwen3-0.6B (GGUF) |
MLLM/VLM |
llama.cpp + MoondreamChatHandler |
|
Moondream2 (GGUF) |
Vision |
ONNX Runtime |
|
DEIM (CVPR 2025) |
SpeechToText |
sherpa-onnx (Whisper) |
|
Whisper tiny.en |
TextToSpeech |
sherpa-onnx (Kokoro) |
|
Kokoro English |
These backends require no PyTorch, no Transformers, and no heavy ML frameworks – they are designed for robots and edge devices including NVIDIA Jetson.
Local Model Wrappers¶
The local model wrappers live in agents/utils/ and follow a simple callable interface:
LocalLLM(model_path, device, ncpu)– wrapsllama-cpp-python, returns{"output": str}or{"output": generator}for streaming, with optional"tool_calls"LocalVLM(model_path, device, ncpu)– wrapsllama-cpp-pythonwithMoondreamChatHandler, accepts images as RGB numpy arraysLocalVisionModel(model_path, device, ncpu)– wrapsonnxruntimefor object detection, returns bounding boxes, labels, and scoresLocalSTT(model_path, device, ncpu)– wrapssherpa-onnxOfflineRecognizer, accepts audio bytes or numpy arraysLocalTTS(model_path, device, ncpu)– wrapssherpa-onnxOfflineTts, returns WAV bytes
Customizing Local Models¶
Each config exposes a local_model_path field that accepts a HuggingFace repository ID or a local file path. Users can swap in any compatible model by setting this field:
config = LLMConfig(
enable_local_model=True,
local_model_path="bartowski/Llama-3.2-1B-Instruct-GGUF", # any GGUF model
)
For available STT and TTS models, see the sherpa-onnx pretrained models catalog.