Public summary

Join an international AI research team based in Munich, Germany, to build and maintain distributed machine learning infrastructure supporting research and robotics teams. The role involves designing scalable training workflows, containerized ML environments, CI/CD pipelines, monitoring systems, and collaborating cross-functionally to operationalize ML models.

Location and work setup

Location: Munich
Remote status: On-site
German requirement signal: No German Required Detected
Detected job language: English

Responsibilities

Design and scale distributed training workflows for large ML models using PyTorch Distributed, DeepSpeed and schedulers such as SLURM or Kubernetes. Build and maintain containerized ML environments for reproducible experimentation and benchmarking. Develop and maintain CI/CD pipelines for reliable testing, training, and deployment of ML models. Implement lifecycle management including experiment tracking and model versioning using tools like ClearML or Weights & Biases. Set up and manage observability systems such as Prometheus and Grafana to monitor model performance and detect drift. Collaborate closely with research, data, and robotics teams to integrate models into production systems.

Qualifications

Degree in Computer Science, Software Engineering, or related field with professional experience in building and operating ML or software infrastructure in production. Strong experience with distributed training systems using Kubernetes, Docker, PyTorch Distributed, DeepSpeed, and cluster schedulers such as SLURM. Proficient in developing CI/CD pipelines tailored for ML workloads. Experience operating ML workloads on cloud platforms, preferably AWS. Hands-on experience with experiment tracking and model versioning tools like MLflow or Weights & Biases. Familiarity with monitoring tools such as Prometheus and Grafana. Strong Python programming and system design skills. Knowledge of infrastructure-as-code tools (e.g., Terraform) and ML orchestration tools is beneficial. Exposure to large-scale multimodal ML systems and high-performance distributed computing environments is a plus.

Skills

Distributed Training PyTorch Distributed DeepSpeed SLURM Kubernetes Containerization CI/CD Pipelines Experiment Tracking Model Versioning Prometheus Grafana Cloud Infrastructure (AWS) Python System Design Infrastructure as Code (Terraform) ML Orchestration High-Performance Computing

ML Platform Engineer

Public summary

Location and work setup

Responsibilities

Qualifications

Skills