Dijitprogram - job | Service Reliability Engineers

Jobs Description

About Us: we are committed to building software that solves real-world problems. Our Site Reliability Engineers (SREs) play a crucial role in ensuring our systems are reliable, scalable, and efficient. We are looking for an experienced SRE to join our team and help us maintain and improve our infrastructure.

Responsibilities:

Monitor and Maintain Systems: Ensure the availability, performance, and reliability of our production environment by monitoring system health and responding to incidents.
Automation: Develop and implement automation tools to reduce manual intervention and improve system efficiency.
Collaboration: Work closely with development teams to design and implement scalable and reliable systems.
Performance Tuning: Analyze system metrics to identify performance bottlenecks and optimize system performance.
Incident Management: Lead incident response efforts, conduct root cause analysis, and implement preventive measures.
Documentation: Create and maintain comprehensive documentation for system architecture, processes, and procedures.
Capacity Planning: Conduct capacity planning and ensure systems can handle future growth.
Qualifications:

Experience: 6+ years of experience in site reliability engineering, operations, or software engineering.
Education: Bachelor's degree in Computer Science, Engineering, or a related field.
Technical Skills: Proficiency in scripting languages (e.g., Python, Ruby), experience with containerization (Docker, Kubernetes), and familiarity with cloud platforms (AWS, GCP, Azure).
System Knowledge: Strong understanding of Linux/Unix systems, networking, and infrastructure components.
Problem-Solving: Excellent troubleshooting and problem-solving skills.
Communication: Strong communication and collaboration skills to work effectively with cross-functional teams.
Certifications: Relevant certifications (e.g., AWS Certified Solutions Architect, Certified Kubernetes Administrator) are a plus.
Preferred Skills:

Experience with configuration management tools (e.g., Ansible, Chef, Puppet).
Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
Why Join Us:

Innovative Environment: Work on cutting-edge technologies and projects.
Growth Opportunities: Opportunities for professional development and career advancement.
Collaborative Culture: Join a team that values collaboration, diversity, and inclusion.
Competitive Benefits: Comprehensive benefits package including health insurance, retirement plans, and more.

• Experience: 6+ years in Site Reliability Engineering, operations, or software engineering.
• Education: Bachelor’s in Computer Science, Engineering, or related field.
• Scripting: Proficiency in Python, Ruby, or similar scripting languages.
• Containerization: Docker, Kubernetes.
• Cloud Platforms: AWS, GCP, or Azure.
• Systems Knowledge: Strong understanding of Linux/Unix, networking, infrastructure components.
• Problem-Solving: Advanced troubleshooting skills.
• Collaboration: Strong communication skills for cross-functional teamwork.

• Certifications: AWS Certified Solutions Architect, Certified Kubernetes Administrator.
• Configuration Management: Ansible, Chef, Puppet.
• CI/CD: Jenkins, GitLab CI, or similar tools.
• Monitoring & Logging: Prometheus, Grafana, ELK stack.
• Performance Optimization: Experience with capacity planning and tuning.