Public summary
A global research technology company is seeking a Staff MLOps Engineer to lead the development and management of their AI/ML platform. The role involves owning the platform infrastructure to support various AI initiatives including synthetic data models and fraud detection. Responsibilities include building and enhancing training infrastructure, experiment tracking, model serving and monitoring, cost management, and mentoring engineers. Candidates should have deep ML platform expertise, broad engineering experience, and leadership skills. The position offers a high-autonomy, remote-friendly environment primarily based in Europe with a strong AI and engineering culture.
Responsibilities
Assess current AI/ML training and serving infrastructure, decide on enhancement or rebuilding efforts, and own those decisions. Build shared AI/ML platform components such as training infrastructure, experiment tracking, model registry, model serving, and monitoring solutions for data and model drift. Oversee full ML lifecycle support from data ingestion to annotation workflows. Manage training infrastructure on Databricks and Unity Catalog ensuring fast, reproducible, and traceable model training. Implement model serving with low-latency APIs and batch scoring, integrating with existing services. Develop monitoring dashboards and alerts for model performance and drift. Optimize cost and performance of ML compute resources and communicate ROI to stakeholders. Mentor and coach AI/ML engineers in best practices and lead AI tooling adoption for platform teams. Maintain strong collaboration with AI/ML teams across locations.
Qualifications
Proven experience leading ML platform engineering at scale with deep knowledge of feature stores, model registries, serving patterns, and ML observability. Strong engineering background across multiple disciplines with senior leadership and mentorship experience. Skilled in platform product thinking including API design, documentation, and user adoption. Expertise in Databricks, Unity Catalog, Kubernetes, AWS EKS, and cloud infrastructure. Programming proficiency in Python, Scala, Java, Terraform. Commercial awareness with ability to justify platform investments to executive leadership. Comfortable working in a high-autonomy, remote-friendly environment with AI-native engineering tools and approaches.