As an SRE, you will play a crucial role in ensuring the reliability, performance & Scalability of our production systems.
Minimum 8+ years of experience with Production support, SRE roles.
Own the reliability of central AI models and agent registry, deployment pipelines, AI SecOps products, and other products in our portfolio.
Ensure the quality, Security, Reliability, and compliance of solutions by applying SRE best practices.
Own incident management, root cause analysis and implement preventative measures.
Support capacity, disaster recovery planning, and cost management.
Collect and analyse operational data and identify SLI's from key metrics to define achievable SLO's for the project set.
Collaborate with data scientist and other stakeholders to collect the feedback and incorporate it into solutions.
Automate process leveraging predictive monitoring, auto scaling or self-healing.
Apply performance analysis, log analytics, automated testing and communicate areas for improvement.
Work in agile way, foster strong collaboration and communication between development and operations, and contribute to the engineering culture in the team.