Public summary

A leading healthtech company in Europe is seeking a Director of Site Reliability Engineering to own and lead core infrastructure systems including cloud, databases, networking, and observability. The role involves managing a team of 25+ engineers, driving a multi-cloud strategy, leading incident response improvements, and shaping a reliability culture that directly impacts patient outcomes and innovation velocity. The ideal candidate will have 12+ years in software engineering with 5+ years managing infrastructure or SRE teams, expertise in multi-cloud and large-scale database operations, and strong leadership and communication skills. English fluency is required; German is not mandatory.

Location and work setup

Location: Berlin
Remote status: Hybrid
German requirement signal: No German Required Detected
Detected job language: English

Responsibilities

Lead and scale a Site Reliability Engineering organization of over 25 engineers across Cloud Infrastructure, Database & Storage, Network Infrastructure, Observability Tooling, and Operations Center. Develop and execute infrastructure strategy and roadmap aligned with company objectives. Manage incident response standards and drive mean time to recovery (MTTR) reductions through cultural and process improvements. Architect and implement multi-cloud strategies to reduce vendor lock-in and support international growth. Oversee large-scale network infrastructure including load balancing, CDN/WAF, virtual private clouds, peering, and zero-trust networking. Champion observability products to provide system health visibility to engineering teams. Serve as a senior technical leader and key voice on platform and technology leadership teams.

Qualifications

Minimum 12 years of experience in software engineering with at least 5 years leading managers and running infrastructure or SRE organizations at scale. Proven track record of transitioning SRE functions from reactive to proactive with measurable improvements in incident frequency and MTTR. Strong expertise in multi-cloud architectures and network infrastructure at high traffic scale including load balancing, CDN/WAF, VPCs, and peerings. Deep experience in large-scale database operations (PostgreSQL, Aurora), streaming/CDC technologies (Kafka), and data layer financial operations. Skilled in building observability platforms involving metrics, logs, traces, and alerting. Proficient in implementing SLOs, error budgets, incident management, and blameless postmortems. Excellent communication and influencing abilities at executive and technical levels. Fluent in English; German language skills are a bonus but not required. Familiarity with healthcare compliance standards is advantageous.

Skills

site reliability engineering cloud infrastructure database operations network infrastructure observability tooling multi-cloud strategy incident response load balancing CDN/WAF VPCs peering zero-trust networking PostgreSQL Aurora Kafka metrics logs traces alerting SLOs error budgets blameless post-mortems cost efficiency engineering culture leadership