OpenTela
OpenTela is a decentralized orchestration platform for distributing machine learning workloads across peer-to-peer networks without requiring a central coordinator. It uses CRDT-based state management and gossip protocols to maintain cluster health and service discovery, and integrates with HPC batch schedulers like Slurm to enable cloud-like serving infrastructure on supercomputing systems.
OpenTela is a decentralized orchestration platform that uses CRDT-based state management and gossip protocols to distribute ML workloads across peer-to-peer networks without central coordination. It specifically targets HPC environments by providing cloud-like serving capabilities that integrate with batch schedulers like Slurm while operating entirely in user space.
- ✓Solves real infrastructure problem by bridging HPC batch schedulers with interactive serving engines without requiring root privileges
- ✓Implements sophisticated distributed systems concepts like CRDTs and gossip protocols for fault-tolerant decentralized orchestration
- ✓Already in production use at SwissAI Initiative, demonstrating practical value and real-world validation
- →Add specific code examples and API usage patterns in documentation to help developers understand implementation details
- →Include performance benchmarks and comparison metrics with centralized alternatives to quantify the benefits of the decentralized approach