Distributed Systems Performance Engineer

12 Oct 2018
10 Nov 2018
Contract Type
Full Time

It seems that every problem today requires technology to solve it, and most of those solutions require massive-scale computing - genome editing, curing cancer, autonomous vehicles, simulating reality. And the only ways to implement those solutions, up to now, have been costly, complex or both. Super-computers, the cloud and edge computing all have their issues - they’re unreliable to operate and scale, difficult to build and manage, and too unpredictable and expensive to budget for and fund. Fundamentally these problems are down to a stack that was designed 30-40 years ago.

What We’re Looking For

You are familiar with the techniques used in decentralized/distributed systems to achieve Reliability, Availability and Consistency.

Have a varied development background, with a understanding and contributions at various levels of the lower part of the stack (kernel, network protocols, middleware).

You have strong development experience in ‘systems’ languages (C, C++, Rust, Assembly):

Have at least 5 years of performance-focused experience at an operating system level

Have at least 3 years experience focused on distributed workloads.

You have extensive optimisation and scaling experience in ‘systems’ languages (C, C++, Rust, Assembly). You should have at least one of the following (or comparable):

Spent time developing libraries for HPC environments, understanding the low level limitations and bottlenecks that guide your design to achieve optimal performance

Profiled and identified bottlenecks in a distributed database, identified the underlying cause and fixed it to achieve better performance than any single-machine database can claim

Been inspired by a popular series of blog posts on testing software against their CAP claims, analysing your own projects to identify their own guarantees and performance in unusual situations

Spent a year in a top-500 traffic site, optimizing the JIT of the custom in-house JVM used to achieve the traffic guarantees required by the company

You’ve benchmarked actor systems using the popular ‘ring test’ and can talk about how you would go about improving the performance of existing systems, either through redesign or more targeted changes

You have an evidence-based approach to performance. Speculative or premature optimization hold no interest for you.

You can provide concrete evidence of situations in the past where you've made cross-cutting performance changes in a larger system.

You have experience in estimating performance characteristics and analysing designs for potential performance problems.

What You’ll Be Doing

Making things go fast! (YOLO)

Using performance and profiling tools to identify and fix hotspots, bottlenecks, and any possible areas for improvements by fixing the code or re-configuring the system.

Writing tooling to automate performance analysis and capture results, integration with our Continuous Integration system and reporting tools.

Become the Owner of Performance on the team and champion a performance mindful approach in design discussions, code reviews and engineering culture in general.

Documenting your code, as well as contributing to other internal documentation, external documentation, and our company blog.

Participating in the ideation phase of planning, providing your unique perspective on the prioritization, design, and implementation of engineering work.

The usual agile things, you know the drill: scrum kanban scrumban, standups, retros, pull requests, gitflow, open-close jira tickets, etc.

Nice To Haves

Experience (at least 5 years) building massive scale data ingestion and manipulation systems at a global leader in the industry (e.g., Google, Facebook, Apple, Amazon).

Experience working with “industry standard” benchmarks (e.g., SPEC, TPC)

Experience with or affinity for the Rust programming language.

Experience with OS profiling (e.g. kernel, file systems, network stack)

Experience writing distributed system analysis tools, formed of components like:

Network monitoring tools (e.g. wireshark)

Distributed process profiling (e.g. erlang percept)

Distributed debugger

Distributed clock synchronization

Similar jobs

Similar jobs