Principal Site Reliability Engineer
Spectraforce
Raleigh, North Carolina
a day ago
Job Description
Job Title-: Principal Site Reliability Engineer
Location-: Remote
Duration-: 12 Months
About this role-:
The client’s Product Portfolio Marketing and Learning team is seeking a Principal Site Reliability Engineer. In this role, you will play a crucial role in ensuring the availability, reliability, and performance of clients Demo Platform AI services and OpenShift Virtualization infrastructure supporting clients major marketing events, demos, workshops and proof of concept projects using Red Hat and Intel AI products. You will be responsible for managing complex cloud systems, implementing Model as a Service infrastructure, monitoring LLM performance and solving intricate problems that have a significant impact on the Demo Platform service quality, stability and cost effective operations.
As a Principal Site Reliability Engineer, you are expected to have extensive experience in software and systems engineering to automate operations, reduce toil, and lead continuous improvement across the service lifecycle. Additionally, the role involves collaborating with the engineering, go to market and technical marketing teams through in-person and remote interactions by aligning their requirements and use cases to the Demo Platform functional capabilities and services.
This is a remote position and can be located anywhere in the United States or Canada. Successful applicants must reside where client is registered to do business.
What you will do:
At SPECTRAFORCE, we are committed to maintaining a workplace that ensures fair compensation and wage transparency in adherence with all applicable state and local laws. This position’s starting pay is: $ 90.00/hr.
Location-: Remote
Duration-: 12 Months
About this role-:
The client’s Product Portfolio Marketing and Learning team is seeking a Principal Site Reliability Engineer. In this role, you will play a crucial role in ensuring the availability, reliability, and performance of clients Demo Platform AI services and OpenShift Virtualization infrastructure supporting clients major marketing events, demos, workshops and proof of concept projects using Red Hat and Intel AI products. You will be responsible for managing complex cloud systems, implementing Model as a Service infrastructure, monitoring LLM performance and solving intricate problems that have a significant impact on the Demo Platform service quality, stability and cost effective operations.
As a Principal Site Reliability Engineer, you are expected to have extensive experience in software and systems engineering to automate operations, reduce toil, and lead continuous improvement across the service lifecycle. Additionally, the role involves collaborating with the engineering, go to market and technical marketing teams through in-person and remote interactions by aligning their requirements and use cases to the Demo Platform functional capabilities and services.
This is a remote position and can be located anywhere in the United States or Canada. Successful applicants must reside where client is registered to do business.
What you will do:
- Design, develop, and implement robust, scalable, and secure IT infrastructure solutions, aligned with business objectives and industry best practices.
- Implement automation and DevOps processes to improve the cloud life cycle, including infrastructure and application uptime, availability, right-sizing and time-to-market
- Collaborate with teammates and project stakeholders to meet timelines, goals and SLA
- Design and implement Model as a Service platform utilizing Red Hat AI products and GPU enabled Intel hardware systems
- Perform architectural planning, deployment, and management of OpenShift Container Platform environments.
- Architect and optimize virtualization solutions using KVM/QEMU, including advanced capabilities offered by OpenShift Virtualization (Kubevirt).
- Design and implement advanced network architectures, particularly Software-Defined Networking (SDN) and Open Virtual Network (OVN), ensuring high performance and reliability.
- Develop comprehensive storage strategies, including the design and administration of physical storage solutions and distributed storage systems like Ceph / OpenShift Data Foundation (ODF).
- Oversee the administration and automation of bare-metal infrastructure, ensuring optimal performance and resource utilization.
- Drive automation initiatives using Ansible and Red Hat Advanced Cluster Manager for Kubernetes (ACM) for infrastructure provisioning, configuration management, and operational tasks.
- Establish and optimize CI/CD pipelines for infrastructure and platform deployments, promoting agile and efficient delivery.
- Provide technical leadership, mentorship, and guidance to engineering teams on architectural patterns and best practices.
- Evaluate new technologies and trends, recommending solutions that enhance our IT landscape and provide competitive advantages.
- Collaborate cross-functionally with development, operations, and business teams to gather requirements and translate them into architectural designs.
- Create and maintain detailed architectural documentation, including design specifications, diagrams, and operational guides.
- Contribute to performance testing and tuning, quality assurance (QA), ticket and incident management
- 8+ years of progressive experience in IT architecture, with a significant focus on infrastructure design and implementation
- 5+ years of experience with Public Cloud, Virtualization and Linux technologies, specifically KVM/QEMU, and a strong understanding of OpenShift Virtualization (Kubevirt)
- 5+ years of experience with Red Hat OpenShift Container Platform or Kubernetes including cluster operations, networking, storage integration, and security
- 3+ years of experience with automation frameworks and tools like Ansible or Terraform
- Hands-on experience with Bare-metal administration, including hardware provisioning, firmware management, and operating system deployment.
- Solid understanding and practical experience with CI/CD methodologies and tools for automated deployments.
- Strong problem-solving abilities, analytical skills, and a strategic mindset.
- Excellent communication, presentation, and interpersonal skills, capable of articulating complex technical concepts to diverse audiences.
- Experience with AI/ML technologies and recent developments including OpenShift AI, inference systems and technologies like vLLM
- Proven experience with enterprise-grade storage solutions, Software-Defined Storage technologies especially Ceph and ODF
- Advanced knowledge of Software-Defined Networking (SDN) principles and practical experience with Open Virtual Network (OVN).
- Extensive experience with Red Hat Enterprise Linux (RHEL) administration and design
- Experience with AppDev automation and pipelines including technologies like Jenkins, Tekton, ArgoCD, etc
- Experience with networking technologies including VLANs, routing protocols, IPAM solutions, and Load Balancers
- Experience in designing and delivering implementations using various public and private cloud infrastructure technologies and providers
- Experience in Python development for automation, scripting, and tool development.
- Experience in Go development for building high-performance applications or infrastructure components.
- Relevant certifications (e.g., Red Hat Certified Architect, Kubernetes certifications, industry cloud certifications).
Applicant Notices & Disclaimers
- For information on benefits, equal opportunity employment, and location-specific applicant notices, click here
At SPECTRAFORCE, we are committed to maintaining a workplace that ensures fair compensation and wage transparency in adherence with all applicable state and local laws. This position’s starting pay is: $ 90.00/hr.