At Weights & Biases, our mission is to build the best developer tools for machine learning. Weights & Biases is a series C company with $200 million in funding and a rapidly growing user base. Our platform is an essential piece of the daily work for machine learning engineers, from academic research institutions like FAIR and UC Berkeley to massive enterprise teams including iRobot, OpenAI, Toyota Research Institute, Samsung, NVIDIA, Salesforce, Blue Cross Blue Shield, Lyft, and more.
As a Senior Site Reliability Engineer you’ll own the monitoring and observability stack, working closely with the Infrastructure Team and other developers to scale wandb.ai in lockstep with our exponentially growing user base and fleet of customer deployments. You’ll be instrumental in building the foundations of an SRE team at a fast-growing startup, establishing the patterns and practices necessary to operate highly reliable services at scale.
What you'll achieve
- Scale a system trusted by leaders in the ML industry to ingest and query terabytes of data daily.
- Build a monitoring and observability platform to pinpoint issues across a fleet of customer deployments.
- Establish the foundations of an SRE team at a fast-growing startup.
- Advise and educate development teams on how to build observable, reliable services.
What's needed in this role
- In-depth knowledge of at least one cloud provider (AWS, GCP, Azure).
- Strong grasp of at least one higher-level language and its ecosystem (Go, Python, TypeScript, etc.).
- A willingness to dive into and debug issues at any layer of the tech stack, from the application layer to the network.
- Deep experience managing, monitoring, and debugging distributed systems / databases (MySQL, Postgres, BigTable, etc.) in production.
- A demonstrated ability to think critically under pressure.
- Excellent communication skills and an ability to explain deeply technical concepts simply.