bet365, one of the world's leading online gambling companies, is a driving force in the development of enterprise and Internet technology. We have rapidly grown into a global operation, delivering an unrivalled online experience to more than 45 million customers in 20 languages.
The Site Reliability
team is looking for a Site Reliability Engineering Technical Lead
to join this exciting and vital part of the bet365 family.
The Site Reliability Engineering (SRE) team has 2 focus areas:
- Develop software to automate the Operational activity required to keep the site running smoothly.
- Build dashboards, monitoring and alerting to ensure that System Health is understood in real time.
Given the scale, load, breadth and complexity of our estate, SRE is a key part of how we intend to continue our success. The SRE team at bet365 formed in 2020, this is a new venture and joining now offers a real opportunity to help shape the team, approaches, practices and technology choices.
We're looking for a Technical Lead with strong experience with Software Engineering principles and how these can be used in an Operational and Infrastructure context. As a Technical Lead you will ensure that the team achieves technical excellence in all its work via keen adherence to contemporary software engineering principles.
We maximise the use of system data to help the business make more informed decisions regarding capacity requirements and application health in the production estate. The right candidate will have a strong understanding of how to analyse and ensure System Health and the ability to overcome obstacles that arise in bringing this to life in existing systems.
We hire people with a broad set of technical skills who are ready to tackle some of technology's greatest challenges and continue to break new ground in software innovation.What will you be doing?
What do you need to excel in this role?
- Leading a team in utilizing automation and orchestration platforms (e.g. Ansible, Jenkins), to automate manual activity
- Leading a team building sophisticated monitoring solutions using log data, metrics and events in modern monitoring and graphing technologies (e.g. Grafana, Splunk, ELK, Nimsoft).
- POCing new technology, innovation and carrying out development and configuration activities to agreed timescales in line with the agreed Software Development Lifecycle for the SRE team
- Taking account for quality of team output
- Mentoring colleagues in the use of new technologies and practices
- Contributing to discussions re suitable architecture and technology choice for SRE software
- Evangelising SRE principles and practices to other areas of the Technology department
- Excellent knowledge of contemporary monitoring, analytics tooling and best practice
- Strong experience with Automation and orchestration platforms (e.g. Ansible, Jenkins)
- Strong experience of working directly with infrastructure, networking and application monitoring systems
- Strong experience working in a large scale, 24/7 enterprise where system uptime and stability is of paramount importance to the business
- Ability to handle and thrive under pressure, often multitasking and dealing with reprioritisation of work
- Ability to hold yourself to account, to set high expectations of your work and to learn from mistakes in an open and transparent way
- Ability to work with autonomy but also collaborate well and progress work as part of a cross functional team