Skip to content

General information

Location
Cambridge, MA
Ref #
39560
Job Family
Administration
Date published
25-Apr-2024
Time Type
Full time

Description & Requirements

The Broad IT Services (BITS) group believes exceptional people produce exceptional products and services. We are committed to building the best team we can in service of the Institute's mission of "Accelerating the Understanding and Treatment of Disease." 


Our team of highly accomplished technical experts work with thousands of Broad researchers to create, scale, and run a wide range of technology solutions. We believe that a diverse and inclusive community is essential to achieving our mission. We are always looking for committed, mission-driven individuals to bring new viewpoints, experiences, and creativity to the team. We are seeking driven candidates who are motivated to learn new technologies and are willing to take on challenges with enthusiasm!


We are looking for an experienced Principal System Engineer to join our HPC/Cloud/Orchestration IT Infrastructure team.  This role sits at the intersection of technology, engineering and science. We’re looking for a teammate who thrives in a creative, opportunistic environment, adjusts well to change, and has the passion to make themselves, their colleagues, and our collaborators better. 


The candidate is expected to regularly participate with other skilled Engineers to design and administer the underlying IT infrastructure that enables both our research applications and scientific pipelines. A successful candidate will need to work effectively with a wide community of technologists, researchers, and external partners to contribute technical solutions that advance biomedical and genomic data science. The position supports the Product Owner by translating the needs of our researchers into technical requirements, designing corresponding solutions, implementing technical capabilities, and iterating with stakeholders to ensure requirements are met. Technical needs will vary by the research need, so the selected candidate must be a generalist interested in: High Performance Computing, cloud computing, Infrastructure as Code, and more. This is a hybrid, two days per week on-site role. 


KEY RESPONSIBILITIES

  • Design, deploy, and manage the infrastructure that enables Hybrid Cloud HPC capabilities, blending both on-prem and cloud computational capabilities.

  • Maintain and optimize HPC cluster performance, ensuring high availability and scalability. 

  • Develop and implement automation tools for cluster provisioning, configuration management, and job scheduling.

  • Stay up-to-date on emerging HPC technologies and software trends to facilitate the development of genomic sequencing at scale with distributed scheduling technologies such as Slurm or GridEngine.

  • Stay updated on industry trends, best practices, and emerging technologies in SRE and DevOps domains, and apply them to enhance our systems and processes.

  • Provide technical consulting to researchers and scientists using the HPC cluster. This may include assisting users with job submission, optimization, and troubleshooting, and collaborating with researchers to understand their computational needs and tailor HPC/Cloud solutions accordingly.

  • Work well with others to drive technology adoption and best practices at scale from early prototypes through building reusable code & adoption playbooks.

  • Perform capacity planning and resource allocation for optimal cluster utilization understanding performance analysis and optimization of HPC applications and contributing to the development of HPC best practices and policies.

  • You value being mission-driven, fostering trust, inclusion, teamwork, continuous improvement, and like to have fun!

EXPERIENCE & QUALIFICATIONS 

  • 7+ years of experience in designing, deploying, and managing HPC clusters (experience with specific vendors like Dell a plus).

  • 5+ years of experience in designing, deploying, and working with cloud computing platforms (e.g. GCP, Azure, or AWS) 

  • Deep knowledge of infrastructure as code (IaC) tools such as Terraform, Puppet, or Ansible.

  • Proven experience in  at least one programming language (e.g., Bash, Python) for automating HPC cluster tasks and managing job schedulers (e.g., Slurm, GridEngine, LSF).

  • In-depth knowledge of HPC Linux environments (e.g., CentOS, RHEL) with experience in system administration, resource management, and performance optimization.

  • Familiarity with monitoring and observability tools such as Prometheus, Grafana, ELK Stack, or similar.

  • Experience with container orchestration tools such as Kubernetes and containerization technologies like Docker.

  • Comfortable with AGILE methodologies and practice. 

  • Previous experience in mentoring and coaching junior engineers.

  • BS/BA or MS in Computer Science, Computer Information Systems, Bioinformatics, Mathematics, or similar discipline is required