Public summary
Join an international AI research team based in Munich, Germany, to build and maintain distributed machine learning infrastructure supporting research and robotics teams. The role involves designing scalable training workflows, containerized ML environments, CI/CD pipelines, monitoring systems, and collaborating cross-functionally to operationalize ML models.
Location and work setup
- Location
- Munich
- Remote status
- On-site
- German requirement signal
- No German Required Detected
- Detected job language
- English
Responsibilities
Design and scale distributed training workflows for large ML models using PyTorch Distributed, DeepSpeed and schedulers such as SLURM or Kubernetes. Build and maintain containerized ML environments for reproducible experimentation and benchmarking. Develop and maintain CI/CD pipelines for reliable testing, training, and deployment of ML models. Implement lifecycle management including experiment tracking and model versioning using tools like ClearML or Weights & Biases. Set up and manage observability systems such as Prometheus and Grafana to monitor model performance and detect drift. Collaborate closely with research, data, and robotics teams to integrate models into production systems.
Qualifications
Degree in Computer Science, Software Engineering, or related field with professional experience in building and operating ML or software infrastructure in production. Strong experience with distributed training systems using Kubernetes, Docker, PyTorch Distributed, DeepSpeed, and cluster schedulers such as SLURM. Proficient in developing CI/CD pipelines tailored for ML workloads. Experience operating ML workloads on cloud platforms, preferably AWS. Hands-on experience with experiment tracking and model versioning tools like MLflow or Weights & Biases. Familiarity with monitoring tools such as Prometheus and Grafana. Strong Python programming and system design skills. Knowledge of infrastructure-as-code tools (e.g., Terraform) and ML orchestration tools is beneficial. Exposure to large-scale multimodal ML systems and high-performance distributed computing environments is a plus.