Alibaba Cloud has revealed a new GPU pooling system that slashed the number of Nvidia accelerators needed for large-scale inference by more than 80%. The system, known as Aegaeon, was presented at the 2025 SOSP conference in Korea and piloted in Alibaba’s own production environment. It allows multiple large language models to share a single GPU. By doing so, it cuts the hardware footprint for inference workloads to a fraction of what was previously required.
The company claims it served dozens of LLMs…








