What we do
Our platforms serve as the foundation for digital R&D transformation across industries – helping teams innovate faster, collaborate securely, and operate efficiently across clouds.
HPC Systems Engineer JD (3–7 years in HPC)
Primary Responsibilities
Diagnose and resolve HPC issues (HPC applications, scheduler, storage).
Analyze job failures, performance bottlenecks, and system logs.
Manage and optimize schedulers like SLURM, PBS.
Perform cluster health checks and proactive monitoring.
Install and Support HPC applications and user environments.
Troubleshoot networking issues (InfiniBand, Ethernet).
Identify recurring issues and implement permanent fixes.
Collaborate with L3 for deep technical issues.
Automate routine operational tasks using scripts.
Update and improve standard operating procedures (SOPs), runbooks, and documentation.
Required Skills
Deep understanding of HPC architecture
Strong Linux administration skills
Experience with SLURM, PBS
Knowledge of parallel computing (MPI, OpenMP)
Storage systems (Lustre, NFS, GPFS)
Networking (InfiniBand preferred)
Scripting (Python, Bash)
Knowledge of AWS ParallelCluster and AWS PCS will be an advantage