I&O Edge Site Reliability Engineer (SRE)
Under the general direction of the I&O SRE & Technical Process Manager, the Edge Site Reliability Engineer is responsible for driving continuous improvement in uptime, availability, reliability, massive automation, and the evolution of systems to drive improving customer experience. He/She works in close collaboration with peers in Digital Engineering, Applications, Operations, Security, Enterprise Architecture, Process Management and Global SRE to drive service evolution. The outcome is robust operational capabilities married with product evolution that delivers results for the business. The Edge SRE is also responsible for the availability of the digital applications in which he/she is involved either in the cloud and on premises. He/She might spend ~50% of their time in hands-on Operations within the Product Teams for SLO attainment and their remaining time:
Engaging in healthy design debate to achieve a suitable balance between cost optimization and reliability.
Providing expert consulting services within Product Teams to drive the minimization of manual tasks through mass automation and self healing capabilities.
Leading the postmortem process and outcomes whilst encouraging blameless review of defects / service impacts and identifying ways to improve.
Influencing the adjustment of the end-to-end operations, release processes and technologies to drive attainment of SLO targets and increase product reliability.
Supports the Problem Management process and Root Cause Analysis following P1 incidents by promoting:
o Error budget control.
o Post mortem culture. Let’s learn from the errors.
o React under security breach and promote an incident protocol.
A strong relationship with the security and operation team to support continuous improvement of security assessments regarding
o Compliance (Agents/clients installed)
Release strategy. Defining the involved parts, creating guidelines for version control and name conventions, recommended testing phases and releases.
Contribute to new demand assessment by providing technical validation of the demand and is in charge of the reliability engineering component of the demand.
Continuous improvement functions as eliminating toil , learning through Chaos engineering testing, creating and collaborating on improvement plans. Relation with business continuity, helping with the assessment if it is required, doing or participating in the DR design and reviewing the runbooks. Helping to prepare for Chaos Engineering tests.
Participate in communication strategies, showing zone technical trends, reports of his/her function and helping to prepare the training path for a new edge SRE with recommended readings, practices and training if it is required.
Maintain and review technology solutions catalog.
Providing early engagement consulting to discuss specific architectures and design choices in detail, and to help validate assumptions with the help of targeted prototypes
To assist in ensuring that the Infrastructure & Operations practices & processes are aligned with:
o LafargeHolcim business objectives and priorities (Health & Safety, Communication, Distribution Model, Innovation, ...)
o LafargeHolcim IT infrastructure strategy
o LafargeHolcim Identity Management Systems
o LafargeHolcim Business Systems
o LafargeHolcim IT Security Policies and Directives
o LafargeHolcim Demand, Project Portfolio and Finance Management Policies and standards.