Large Language Models (LLMs) have demonstrated remarkable planning abilities across various domains, including robotics manipulation and navigation. While recent efforts in robotics have leveraged LLMs both for high-level and low-level planning, these approaches often face significant challenges, such as hallucinations in long-horizon tasks and limited adaptability due to the generation of plans in a single pass without real-time feedback. To address these limitations, we propose a novel multi-agent LLM framework, Multi-Agent Large Language Model for Manipulation (MALMM) that distributes high-level planning and low-level control code generation across specialized LLM agents, supervised by an additional agent that dynamically manages transitions. By incorporating observations from the environment after each step, our framework effectively handles intermediate failures and enables adaptive re-planning. Unlike existing methods, our approach does not rely on pre-trained skill policies or in-context learning examples and generalizes to a variety of tasks. We evaluate our approach on nine RLBench tasks, including long-horizon tasks, and demonstrate its ability to solve robotics manipulation tasks in a zero-shot setting, thereby overcoming key limitations of existing LLM-based manipulation methods.
An overview of our multi-agent system, MALMM , consists of three LLM agents—Planner, Coder, and Supervisor—and a Code executor tool. Each agent operates with a specific system prompt defining its role: (1) the Planner generates high-level plans and replans in case of intermediate failures, (2) the Coder converts these plans into low-level executable code, and (3) the Supervisor coordinates the system by managing the transitions between the Planner, Coder, and Code executor.
The Table shows the success rate (%) for zero-shot evaluation on nine tasks RLBench. In the table, we have highlighted the best-performing method for each task in bold and the second-best-performing method is underlined. We can observe that MALMM outperforms existing baseline by a significant margin.
In this table, we do a comparison of the performance between Single Agent and MALMM in the setup with vision-based environment observations. We run our evaluation on three tasks from RLBench. From the table, we can observe that for all the task MALMM is better than Single Agent.