Author(s)

Mohammad Sohel Jalil Ahmad, Prof. Kalpesh Marathe

  • Manuscript ID: 140332
  • Volume: 2
  • Issue: 6
  • Pages: 719–731

Subject Area: Computer Science

Abstract

The proliferation of large language model (LLM)-based agentic systems has catalysed the development of dedicated multi-agent orchestration frameworks designed to coordinate multiple autonomous agents toward complex, multi-step goals. This paper presents a systematic empirical benchmarking study of three prominent open-source frameworks—LangGraph, CrewAI, and AutoGen—evaluated across six dimensions: task completion accuracy, end-to-end latency, token efficiency, scalability, developer experience, and fault tolerance. Experiments employ a standardised 80-task suite spanning code generation, research summarisation, data analysis, and multi-hop question answering, all backed by GPT-4o at a fixed temperature of T = 0.2. Results demonstrate that LangGraph excels in stateful pipeline control and fault resilience; CrewAI offers superior developer ergonomics and role-based collaboration; and AutoGen delivers competitive code-generation accuracy with an asynchronous, conversation-first architecture. A practitioner decision framework is provided to guide framework selection.

Keywords
multi-agent systemsLLM orchestrationLangGraphCrewAIAutoGenbenchmarkingautonomous agentsagentic AI.