12/22/2024
Dr. Eliahu Khalastchi,​ Research Scientist

Human-Robot Interaction: Harnessing the Power of Generative AI for Seamless Integration in an HRI Toolkit with NVIDIA Jetson.

The integration of Human-Robot Interaction (HRI) capabilities in a robot product not only aims to make the solution more natural and user-friendly but also addresses the inherent challenges of navigating dynamic environments, especially those involving humans

In dynamic scenarios, it is nearly impossible to anticipate and prepare for every conceivable situation. HRI allows robots to bridge this gap by enabling them to seek assistance from humans when faced with uncertainties or missing information, enhancing their problem-solving abilities. The ability of a robot to interact with humans in real-time fosters collaborative problem-solving and ensures a more adaptive response to the unpredictable nature of dynamic environments. However, achieving seamless and effective HRI is no small feat; it requires overcoming technological, psychological, and societal challenges. The pursuit of these ambitious goals in the realm of HRI highlights Cogniteam’s dedication to creating advanced robotic systems that can not only coexist but actively collaborate with humans to accomplish complex tasks.

In the vast landscape of HRI, various implementations have emerged, each holding the potential to enhance the way humans and robots communicate. However, the key to a transformative HRI experience lies not just in the availability of diverse components but in their seamless integration. Until recently, efforts to create a cohesive system for HRI capable of completing the sense-think-act cycle of a robot while remaining extensible and modular for future innovations were scarce.

Recognizing this crucial gap, the Israeli HRI consortium, comprised of leading robotics companies and esteemed university researchers in the field, is developing a global solution. At the forefront of this initiative, Cogniteam is leading the HRI toolkit group within the consortium. In the HRI Toolkit, all the products of the consortium are implemented and embedded in an extensive ROS2 graph, meticulously crafted to fulfill the sense-think-act cycle of robots through multiple layers of cognition.

Fig 1: The multiple (main) Layers of Cognition in the HRI Toolkit as an infinite cycle of natural interaction with a human.

The “Basic Input” layer is responsible for acquiring speech and 3D skeleton models from the robot sensors. The onboard data stream from an array of microphones, a depth camera, and other sensors allows this layer to fuse directional speech with 3D skeletons. In addition, this layer keeps track of the 3D bodies as they roam near the robot.

Given this information, the “Interpreter” layer is responsible for interpreting different social cues. A unique algorithm learns and detects, on the fly, different visual gestures that can later be linked to direct robot reactions or interpretations about the person. An emotion detector is used to detect the facial expressions of the interacting person. These emotions can later be used for the reinforcement of specific robot gestures that receive implicit positive feedback. A text filtering mechanism enables the detection of vocal instructions, facts, and queries made by the person.  

One of the most novel and interesting implementations of the HRI toolkit is the “Context Manager” layer, which is responsible for collecting the entire social information, including the current and past interactions, the social context of the environment, the state of the robot, and its tasks. This social information is provided by the person manager component, the ad-hoc scene detectors layer (e.g., intentional blockage detection), and additional information from the user code via an integration layer (e.g., map, robot state and tasks, detected objects, etc.). “Scene Understanding” is a key component that is responsible for processing this information and extracting useful social insights such as user patterns and availability. As such, this layer can truly benefit from the release of novel generative AI components such as image transformers, LLMs, etc. 

The “Reasoning” layer checks whether the collected social information is sound and consistent. It does so with the help of novel Anomaly Detection Algorithms. Whenever a contradiction in the robot’s beliefs is detected, some information is flagged as missing, or some other anomalous event has occurred, this layer detects it and passes it onward – typically leading the robot to ask for help.

The “Social Planning” layer issues timed personalized, socially aware plans for interaction. This layer contains components such as “Social Reinforcement Learner” and “Autonomy Adjuster,” which adapt to specific users and social contexts; “Natural Action Selector,” which acts as the decision maker for issuing proactive and interactive instructions;  “Scheduler,” which decides when to “interrupt” the user by applying different interruption management heuristics; “Social Navigator,” which adds a social layer on top of typical navigation algorithms such as Nav2; and other relevant components. 

Finally, the “Natural Action Generator” layer applies these instructions in a manner that will be perceived as natural by humans, constantly adjusting different dynamic parameters and applying visual and vocal gestures according to an HRI-Code-Book. For instance, it may apply movement gestures that relay gratitude or happiness or play a message at a certain volume and pace that fits the given social context. Given these actions, if the human reacts naturally, these actions are again perceived by the “Basic Input,” and thus, the sense-think-act cycle continues.

Recently, NVIDIA released several generative AI tools that are very relevant for HRI. These include optimized tools and tutorials for deploying open-source large language models (LLMs), vision language models (VLMs), and vision transformers (ViTs) that combine vision AI and natural language processing to provide a comprehensive understanding of the scene. These tools are specifically relevant for the “Context Manager” layer and the “Scene Understanding” component of our HRI Toolkit. At the heart of this layer is the LLM, a pivotal component that significantly contributes to the nuanced understanding of social interactions.

In our exploration of this critical layer, we delve into the integration of NVIDIA’s dockerized LLM component, which is a part of NVIDIA’s suite of generative AI tools. Our experimental robot “Michelle” uses the NVIDIA Jetson AGX Orin developer kit with 8GB RAM. However, in order to run both the HRI toolkit and the LLM component on-site, we had to use the Jetson AGX Orin dev kit with 64GB RAM as an on-premises server that can run the LLM component and serve several robots at the same time using application programming interface calls—the HRI clients, e.g., Michelle, use cache optimizations to reduce server loads and response times.

Our decision to define the HRI Toolkit as a collection of ROS interfaces makes individual implementations interchangeable. Hence, it is easy to switch between edge-based, fog-based, or cloud-based implementations of the LLM component according to the specific requirements of the domain. This not only showcases the flexibility of the HRI Toolkit but also highlights the adaptability of NVIDIA’s dockerized LLM component. The computational power of the Jetson AGX Orin helped ensure that the LLM could seamlessly integrate into the broader architecture, augmenting the context management layer’s capabilities.

The successful integration of an LLM into the HRI Toolkit not only addresses the challenges of social data processing but also sets the stage for future innovations. A collection of Cogniteam’s HRI components will be made available in May 2025.

 

Want to learn more about the HRI Toolkit for your application?