Devops Engineer
Role: Senior DevOps Engineer
Job Type: Contract
Work Location: Remote
Start Date: Immediate to 30 days candidates are preferred
Salary & Benefits: (Negotiable based on the expertise)
Job Description:
We are looking for a Senior DevOps Engineer with extensive experience in designing and managing cloud-based infrastructure for AI/ML applications, as well as the broader IT infrastructure of a growing organization. This role will be critical in supporting the deployment of generative AI models, while also overseeing the infrastructure that underpins all of our enterprise applications and services. The ideal candidate will have a strong background in Infrastructure as Code (IaC), cloud services, and DevOps best practices.
Key Responsibilities:
Infrastructure Design and Management:
- Architect, implement, and manage scalable, secure, and highly available cloud infrastructure that supports LLM and generative AI model deployment.
- Design and maintain the organization's overall IT infrastructure, ensuring seamless operation of all enterprise applications, databases, and services.
- Develop and maintain Infrastructure as Code (IaC) solutions using tools like Terraform, AWS CloudFormation, or similar, to automate infrastructure provisioning and management across the organization.
Performance and Scalability:
- Ensure that infrastructure supporting AI models and organizational systems is optimized for high performance and scalability to handle millions of requests and large-scale data processing.
- Implement auto-scaling strategies for both AI and organizational environments to ensure resource efficiency and cost-effectiveness.
Security and Compliance:
- Establish and enforce security best practices across all organizational infrastructure, ensuring compliance with industry standards and regulations.
- Implement and maintain robust security protocols, including network security, data encryption, and identity management.
Automation and CI/CD:
- Design and implement CI/CD pipelines for AI model deployments and enterprise application updates, ensuring continuous integration and delivery.
- Automate routine tasks such as software deployments, system updates, and infrastructure monitoring to reduce manual intervention and increase operational efficiency.
Monitoring and Incident Management:
- Set up and maintain monitoring, logging, and alerting systems to ensure the health and performance of infrastructure across the organization.
- Lead incident management and response efforts, diagnosing and resolving issues promptly to minimize downtime and impact on the organization.
Collaboration and Support:
- Work closely with AI/ML teams, software developers, and IT staff to understand infrastructure needs and provide technical support.
- Collaborate with cross-functional teams to ensure smooth integration and operation of AI solutions within the broader organizational infrastructure.
- Provide mentorship and leadership to junior DevOps engineers and IT staff, fostering a culture of continuous learning and improvement.
Innovation and Conti
nuous Improvement:
- Stay current with the latest developments in cloud infrastructure, DevOps practices, and AI/ML technologies.
- Proactively identify opportunities to improve infrastructure, tools, and processes, driving innovation and efficiency within the organization.
Qualifications:
Experience:
- 7+ years of experience in DevOps, cloud infrastructure, or a related field.
- Proven experience in designing and managing infrastructure for both AI/ML applications and enterprise IT systems.
- Strong expertise in Infrastructure as Code (IaC) using Terraform, CloudFormation, or similar tools.
- Experience with containerization (Docker, Kubernetes) and orchestration tools, and managing hybrid cloud environments.
Technical Skills:
- Proficiency in cloud platforms (AWS, GCP, or Azure) with a deep understanding of their services, particularly those relevant to AI/ML workloads and enterprise applications.
- Strong knowledge of CI/CD pipelines, monitoring, and alerting tools (e.g., Jenkins, GitLab CI, Prometheus, Grafana).
- Experience with distributed systems, microservices architecture, and high-performance computing environments.
- Solid scripting and automation skills (Python, Bash, or similar).
Soft Skills:
- Excellent problem-solving skills with the ability to troubleshoot complex systems in a dynamic environment.
- Strong communication skills, capable of explaining technical concepts to diverse stakeholders.
- Leadership and mentoring experience, with the ability to guide and develop a high-performing DevOps team.
Preferred Qualifications:
- Experience in deploying and managing AI/ML models in production environments.
- Certification in cloud platforms (AWS Certified DevOps Engineer, Google Professional DevOps Engineer, etc.).
- Familiarity with MLOps practices and enterprise IT infrastructure management.
Summary
- Employment Status: Contract
- Job Location: Remote