DE Jobs

Search from over 2 Million Available Jobs, No Extra Steps, No Extra Forms, Just DirectEmployers

Job Information

SLAC National Accelerator Laboratory Senior Linux HPC Engineer in Menlo Park, California

Senior Linux HPC Engineer

Job ID

5435

Location

SLAC - Menlo Park, CA

Full-Time

Regular

SLAC Job Postings

Position Overview:

Would you like to configure and troubleshoot modern high-performance Linux clusters? Does contributing to breakthrough discoveries in science and medicine excite you? SLAC National Accelerator Laboratory seeks an energetic, motivated developer-operations engineer who enjoys teamwork, learning cutting edge technologies and engaging the user community.

SLAC is one of the world’s premier research laboratories, with internationally leading capabilities in photon science, accelerator physics, high energy physics (HEP), and energy sciences. The Controls & Data Systems (CDS) Division in the Technology Innovation Directorate (TID) is involved in many national and international projects, which, among others, include the Rubin Observatory, the Linac Coherent Light Source (LCLS) user facility, CryoEM user facilities, the LHC ATLAS detector at CERN, and accelerator controls for LCLS-II.

This position will play a critical role in deploying, maintaining and monitoring the large-scale scientific computing infrastructure that supports SLAC’s data analysis and Machine Learning capabilities. Thousands of scientists worldwide rely on these systems to perform their research activities. We are seeking a creative, resourceful system administrator that can assist in the configuration, deployment and ongoing maintenance of various platforms and services across hundreds of nodes utilizing the best in class automation platforms. Our working environment is highly collaborative. Skills and responsibilities are shared across our department. We value strong communication and documentation. Host platforms include bare-metal, virtual machines and kubernetes.

We encourage free-thinking open dialog and provide opportunities to explore and implement new technologies and ideas. There is huge potential for career growth. High performance computing is recognized as a SLAC core competency.

Given the nature of this position, SLAC is open to on-site and hybrid work options.

Your specific responsibilities will be to:

  • Lead the management and system administration tasks across hundreds of Linux hosts

  • Maintain and extend our scientific software catalog - working closely with our scientific partners to build workflows and pipelines

  • Architect, administer and tune our batch scheduling systems

  • Lead contribution to developing and standardizing our configuration management platform

  • Help architect and support core monitoring and alerting (notification) capabilities to track health and performance

  • Support day-to-day operations and troubleshooting of scientific computing services and infrastructure

  • Help direct and perform end-user support via our incident platforms and communication channels

  • Maintain all relevant documentation for administration procedures

To be successful in this position you will bring:

  • Bachelor's degree in computer sciences, physics or related field and 8 years of relevant experience in information technology, systems administration, or high-performance computing.

  • Proven ability to work effectively in a team environment with excellent organizational and communication skills

  • In-depth technical understanding and proven success partnering with scientific teams to understand and implement large and small scale computational and data-driven workflows

  • Demonstrated ability to lead projects and teams to completion and help drive our scientific mission

  • Extensive experience with Linux system management, monitoring, open-source software

  • Expertise and experience in frameworks and scripting for large distributed systems (ansible, bash and python preferred

  • Expertise with distributed compute and storage systems, high performance computing systems, and networking

  • Proven ability and experience in deploying and troubleshooting full stack software and hardware systems in large complex clustered environments

  • Experience leading and and promoting best practices

In addition, preferred experience include:

  • Expert knowledge of high throughput and high performance frameworks and techniques (MPI, containerization, low-level scientific software libraries)

  • Expert knowledge of configuration management systems (ansible preferred)

  • Knowledge of kubernetes primitives and architecture

  • Expert experience managing and configuring Linux and related applications

  • Extensive experience with system and service monitoring (prometheus, influxdb, grafana, loki)

SLAC Employee Competencies:

  • Effective Decisions : Uses job knowledge and solid judgment to make quality decisions in a timely manner.

  • Self-Development : Pursues a variety of venues and opportunities to continue learning and developing.

  • Dependability : Can be counted on to deliver results with a sense of personal responsibility for expected outcomes.

  • Initiative : Pursues work and interactions proactively with optimism, positive energy, and motivation to move things forward.

  • Adaptability : Flexes as needed when change occurs, maintains an open outlook while adjusting and accommodating changes.

  • Communication : Ensures effective information flow to various audiences and creates and delivers clear, appropriate written, spoken, presented messages.

  • Relationships : Builds relationships to foster trust, collaboration, and a positive climate to achieve common goals.

Physical Requirements and Working Conditions:

  • You are expected to reside locally and work onsite up to 3 days a week

  • Consistent with its obligations under the law, the University will provide reasonable accommodation to any employee with a disability who requires accommodation to perform the essential functions of the job. May work extended hours during peak business cycles.

Work Standards :

  • Interpersonal Skills: Demonstrates the ability to work well with Stanford colleagues and clients and with external organizations.

  • Promote Culture of Safety: Demonstrates commitment to personal responsibility and value for environment, safety and security; communicates related concerns; uses and promotes safe behaviors based on training and lessons learned. Meets the applicable roles and responsibilities as described in the ESH Manual, Chapter 1—General Policy and Responsibilities: http://www-group.slac.stanford.edu/esh/eshmanual/pdfs/ESHch01.pdf

  • Subject to and expected to comply with all applicable University policies and procedures, including but not limited to the personnel policies and other policies found in the University's Administrative Guide, http://adminguide.stanford.edu

Classification Title: System Administrator 3

Grade: K

Job code: 4833

Duration: Regular Continuing

The expected pay range for this position is $129,000 to $157,000 per annum. SLAC National Accelerator Laboratory/Stanford University provides pay ranges representing its good faith estimate of what the university reasonably expects to pay for a position. The pay offered to a selected candidate will be determined based on factors such as (but not limited to) the scope and responsibilities of the position, the qualifications of the selected candidate, departmental budget availability, internal equity, geographic location and external market pay for comparable jobs.

SLAC National Accelerator Laboratory is an Affirmative Action / Equal Opportunity Employer and supports diversity in the workplace. All employment decisions are made without regard to race, color, religion, sex, national origin, age, disability, veteran status, marital or family status, sexual orientation, gender identity, or genetic information. All staff at SLAC National Accelerator Laboratory must be able to demonstrate the legal right to work in the United States. SLAC is an E-Verify employer.

DirectEmployers