I&O Global Edge Site Reliability Engineer (SRE)
SUMMARY OF THE JOB
Under the general direction of the Regional I&O Zone Head and the technical direction of the Global Digital Engineering Manager, the Edge Site Reliability Engineer is responsible for driving continuous improvement in uptime, availability, reliability, massive automation, and the evolution of systems to drive improving customer experience. He/She works in close collaboration with peers in Digital Engineering, Applications, Operations, Security, Enterprise Architecture, Process Management and Global SRE to drive service evolution. The outcome is robust operational capabilities married with product evolution that delivers results for the business.
The Edge SRE is also responsible for the availability of the digital applications in which he/she is involved either in the cloud and on premises. He/She might spend ~50% of their time in hands-on Operations within the Product Teams for SLO attainment and their remaining time:
● Engaging in healthy design debate to achieve a suitable balance between cost optimization and reliability.
● Providing expert consulting services within Product Teams to drive the minimization of manual tasks through mass automation and self healing capabilities.
● Leading the postmortem process and outcomes whilst encouraging blameless review of defects / service impacts and identifying ways to improve.
● Influencing the adjustment of the end-to-end operations, release processes and technologies to drive attainment of SLO targets and increase product reliability.
● Guarantee the general system uptime, focus on availability to comply with the defined SLA, SLO and SLI.
● Define metrics. As applications evolve over time, edge SRE is responsible for adapting the right SLI and SLO and identifying significant projects that result in substantial cost savings or revenues.
● Spend <=50% of their time spent on hands-on Operational run activities (toil). The remaining 50% should be focused on reliability, performance and efficiency improvements for Products
● Supports the Problem Management process and Root Cause Analysis following P1 incidents by promoting:
○ Error budget control.
○ Post mortem culture. Let’s learn from the errors.
○ React under security breach and promote an incident protocol.
● A strong relationship with the security and operation team to support continuous improvement of security assessments regarding
○ Compliance (Agents/clients installed)
● Release strategy. Defining the involved parts, creating guidelines for version control and name conventions, recommended testing phases and releases.
● Contribute to new demand assessment by providing technical validation of the demand and is in charge of the reliability engineering component of the demand.
● Continuous improvement functions as eliminating toil , learning through Chaos engineering testing, creating and collaborating on improvement plans. Relation with business continuity, helping with the assessment if it is required, doing or participating in the DR design and reviewing the runbooks. Helping to prepare for Chaos Engineering tests.
● Participate in communication strategies, showing zone technical trends, reports of his/her function and helping to prepare the training path for a new edge SRE with recommended readings, practices and training if it is required..
● Maintain and review technology solutions catalog.
● Providing early engagement consulting to discuss specific architectures and design choices in detail, and to help validate assumptions with the help of targeted prototypes
● To assist in ensuring that the Infrastructure & Operations practices & processes are aligned with:
○ Holcim business objectives and priorities (Health & Safety, Communication, Distribution Model, Innovation, …)
○ Holcim IT infrastructure strategy
○ Holcim Identity Management Systems
○ Holcim Business Systems
○ Holcim IT Security Policies and Directives
○ Holcim Demand, Project Portfolio and Finance Management Policies and standards.
- Sales, number of people, budget, volumes etc.:
○ Global infrastructure to support the overall group with a turnover of ~35 bn EUR
○ ~60 countries, ~4 000 sites, ~70 000 IT users and ~80 000 employees
○ 3 zones to coordinate (Americas, EMEA, APAC)
○ Infrastructure services provided on a 24x7x365 basis
○ Number of servers: ~10 000
○ CAPEX + OPEX management (budget to be defined)
- List of direct reports: N/A
- Key interfaces, stakeholders and relationships:
○ Enterprise Architects: collaborative design, validation and communication of standards, strategic alignment of road maps
○ Edge SRE: contributing to the definition of general guidelines for designs, assessment in new products and facilitating knowledge transfer inside the SRE network organization.
○ Product owner/project managers: driving technical designs recommendations regarding availability, supporting active product release and maintenance and the different project phases (in case of waterfall project requirements).
○ I&O Security, RSOs: Enforcement, compliance and effective risk management.
○ I&O ITSM and ITSM community: Collaboration ensures consistent and effective service processes in domains of expertise. Contribution to Product management process design and alignment in collaboration with ITSM.
○ Financial Manager: Design of cloud solutions financial models and planning (MTP support)
○ Procurement and Vendor Management: Alignment on requirements and operational needs and management of key vendor relationships.
○ Technical account manager from AWS and the most important technologies to a the best assessment related new tendencies.
○ Edge engineers, automation engineers, release engineers and technical finance engineers: giving support and complementing these roles to work all together in an orchestrated way.
- Level of education/qualifications normally required:
○ Bachelor’s Degree in Information Technology or related discipline.
○ Distinctive qualifications relating to his/her area of expertise.
○ Preferred AWS solution architect certification and/other public cloud service providers.
○ Desired GCP Professional Cloud architect certification
- Specific work experience:
○ Experience working in devops team
○ Previous experience in this role is desirable
○ Proven experience collaborating in technical designs oriented to availability and reliability.
○ Experience in using and integrating cloud solutions.
○ Experience designing for scalability, capacity planning and resource management
○ At least 3 years or more of experience in Cloud and Devops teams
○ At least 7 years experience in Applications, Infrastructure, Storage, Platforms
- Required technical / functional skills:
o Well versed and proficient on Automation Tools for IaaS / PaaS services such as:
▪ Infrastructure as a Code (i.e Cloud Formation,Terraform, Azure RM … etc)
▪ Cloud most used mark-up languages (YAML, JSON)
▪ Configuration Management Tools (i.e AWS System Manager, Ansible, Chef , Puppet … etc)
▪ Scripting for Operations (i.e Bash, PowerShell, Python.. etc).
▪ Source Control Management (Git, bitbucket, gitlab, github)
▪ CI/CD Orchestration Tools (i.e Bitbucket pipelines, Jenkins, CircleCI, Github Actions, AWS Code Deploy, Azure DevOps….etc).
o Proficient in Operation IaaS services on AWS
- Preferred technical/functional skills:
o Knowledge in Network and Security technologies ( SDWAN( Meraki, velocloud), MPLS, Cisco and Nexus switches, Cisco Routers, Cisco firewalls ( ASA, FPR) and Load Balancers i.e F5/Netscaler).
o Understanding of Converged Infrastructure(i.e VMware+Cisco UCS+EMC Storage) and HyperConverged (i.e Nutanix)
o Knowledge of Citrix solutions (namely XenApp)
o Understanding of VOIP and Call centre technologies / architectures / operations.
o Identity Access Management:. Understanding of Identity Lifecycle, access management, Identity Federation, provisioning, certification, governance, Active/Google Directory, MFA, Anti-virus and security, SAP-GRC, Sailpoints, Ping, etc.)
o General Distributed Systems Understanding (i.e DBaaS, Hadoop Based Systems, Kafka ... etc)
o Knowledge in relational databases (Oracle, MS SQL Server, PostgreSQL, MySQL) and non relational databases (MongoDB, Redshift, Coachbase....).
o Virtualization and Containerization Technologies (i.e Kubernetes, Docker, Tunzu, VMware on AWS … etc)
o SAP Systems (BASIS administrators)
o Disaster recovery tools (Druva, CPM, etc.)
o End-to-End Monitoring tools (appdynamic, Dynatrace)
- Behavioural competencies:
○ Open-minded, collaborative and an effective team player.
■ Trusted to de-escalate conflicts inside the team.
■ Set an example for the team with positive and inclusive leadership and foment proactive discussion
○ Able to manage and negotiate challenging decisions and influence change in a positive manner.
○ Capable written, presentation and verbal communication skills
○ Ability to work in a multicultural and multi-located team.
○ Driven for success and aspiring to a culture of service excellence, always putting the customer, our people and our business at the centre of everything he/she does.
○ Ability to deal with ambiguity.
○ Ability to communicate openly and effectively with many diverse stakeholders, and with external vendors and auditors to obtain commitments.
○ Ability to work proactively and under pressure considering the criticality required to ensure the right quality of service for the business.
○ Capacity for innovation.
○ Reliable in delivering on commitments.
○ Relies on end to end solution design to deliver and sustain quality customer experience
- Leadership and managerial abilities:
○ Ability to be an effective member of a multicultural virtual team of both internal and external subject specialists.
○ Ability to drive transformation and change management.
○ Ability to build trust relationships with internal personnel and external providers.
○ Demonstrated ability to network in a complex matrix type organization structure.
○ Ability to build good working relationships with providers.
○ Capable of mentoring interns and intermediate SREs in all areas and clothes SREs in their area of expertise
- Linguistic skills:
○ Fluency in English, both verbal and written.
○ Other local languages are an asset.
- Location Flexibility:
- Mobility requirements:
○ Occasional International travel