Public summary
Join a remote-first company that develops an open-source performance testing platform used globally by engineering teams to ensure resilient and high-performing systems. Role focuses on advancing operational excellence and reliability engineering practices for a large-scale distributed SaaS product, with opportunities to influence architecture and lead product development. The position is remote in Germany time zones, within a transparent, innovative culture emphasizing collaboration, autonomy, and career growth.
Salary
EUR 109709.00 - 131651.00 year
Responsibilities
Define and scale a culture of operational excellence by establishing reliability standards and coaching teams on ownership and availability. Drive advanced DevOps and SRE practices such as incident management, alerting, observability, runbooks, and release/change management. Implement reliability frameworks including SLIs, SLOs, error budgets, and utilize metrics for prioritization and engineering decisions. Guide design, development, and operation of distributed cloud systems. Influence product and system architecture through collaboration and technical leadership. Provide clear documentation and technical communication internally and externally. Evolve role into broader application and product leadership as reliability foundation matures.
Qualifications
Strong experience in DevOps and SRE practices operating production systems at scale. Proficiency or strong background in programming languages (primarily Python or Go). Expertise in designing, building, and operating large-scale distributed cloud systems. Deep understanding of reliability engineering concepts, including incident response, observability, and failure modes. Experience with test automation for performance and functional testing. Ability to influence engineering practices via clear communication, code review, and collaboration. Familiarity with modern software engineering processes. Self-driven and comfortable with autonomy and ambiguity. Bonus points for experience with containerization (Docker, Kubernetes), cloud platforms (AWS), observability tools, event-driven or asynchronous systems, and defining/applying SLIs/SLOs or error budgets.