Job Description

A Site Reliability Engineer (SRE) plays a pivotal role in ensuring that an organization's IT services and infrastructure are highly available, scalable, and efficient. This position often involves a blend of development, operations, and troubleshooting tasks.

  • System Reliability and Availability: Ensure high availability and reliability of services and infrastructure. This includes proactive monitoring, incident response, and post-mortem analysis to prevent recurrence of incidents.
  • Performance Management: Monitor and optimize system performance to meet the service level objectives (SLOs) and service level agreements (SLAs). This involves understanding and managing the capacity and scalability of services.
  • Incident Management and Response: Lead the response to system outages and performance issues, including on-call duties. Develop automation tools to help in the rapid resolution of incidents and to prevent their recurrence.
  • Automation and Tooling: Design and implement automation tools and frameworks to reduce manual operational work. This could include scripts for deployment, monitoring, and infrastructure management.
  • Cross-functional Collaboration: Work closely with development teams to design and implement scalable, reliable, and efficient systems. This involves providing input on architectural decisions, optimizing resource utilization, and ensuring system resilience.
  • Continuous Improvement: Continuously analyze current processes and systems for improvement opportunities. Implement best practices for system reliability and availability.
  • Disaster Recovery and Backup: Develop and maintain disaster recovery plans, including regular testing to ensure system resilience.
  • Documentation: Maintain detailed documentation of the system architecture, configurations, processes, and service records to ensure that the knowledge is shared and accessible within the team.

Requirements:

  • Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
  • Proven experience in a site reliability engineering role or similar, with a strong background in software development and system administration.
  • Technical Skills:
  • Proficiency in programming languages.
  • Experience with cloud services and container orchestration tools (Kubernetes, Docker).
  • Strong understanding of networking principles and protocols.
  • Experience with continuous integration and deployment (CI/CD) practices.
  • Problem-Solving Skills: Ability to troubleshoot and resolve complex technical issues under pressure.
  • Communication Skills: Excellent verbal and written communication skills, with the ability to effectively communicate technical concepts to non-technical stakeholders.
  • Teamwork: Ability to work collaboratively in a cross-functional team and interact effectively with developers, operations teams, and management.

Job Benefits:

  • Loans.
  • Health insurance.
  • Game room.
  • Snacks.
  • Breakfast.
  • Lunch.
  • Occasional packages and gifts.
  • Learning stipends.
  • Resting space.

برای مشاهده‌ی شغل‌هایی که ارتباط بیشتری با حرفه‌ی شما دارد،