GPU/ML Systems Engineer

Aivar Innovations
Bengaluru, Karnataka

Quick apply

Job details

Qualifications

Law
Master's degree
AWS
SDKs

Full job description

About Aivar Innovations

Aivar is an AI-first technology partner where cutting-edge technology meets industry expertise to supercharge your projects.

Team: Accelerators

Experience: 3–7 years | Hands-on GPU optimisation required

Technical Focus: The specialist who takes AI deployments from "it works" to "sub-second latency at 40% lower cost." Own vLLM/Triton configurations, model quantisation (INT8, FP16, 4-bit), tensor parallelism on multi-GPU instances, AWS Inferentia optimisation, and performance benchmarking. Proven results: 40% cost reduction on Whisper ASR, 0.41s TTFT on Llama 70B, 85% throughput gain on YOLO via Inferentia.

Key Responsibilities:

Deploy and tune vLLM with multi-GPU tensor parallelism, dynamic batching, PagedAttention, and KV cache optimisation for LLMs.
Configure NVIDIA Triton for production multi-model serving with custom backends and model ensembles.
Build TensorRT-LLM optimised model binaries for maximum throughput on L40S, A100, and H100 GPUs.
Implement AWS Inferentia deployments using Neuron SDK — model compilation, operator support, performance tuning.
Run comprehensive load testing (Locust) to map performance cliffs, optimal concurrency, and scaling thresholds.
Execute model quantisation (INT8, FP16, GPTQ, AWQ) with rigorous quality-accuracy tradeoff analysis.
Produce detailed benchmark reports with instance selection, scaling strategy, and cost-per-token recommendations.
Neuron: Experience in optimising models for custom accelerators like AWS Inferentia/Trainiums.

Must-Have Technical Skills:

GPU-accelerated ML workloads in production (3+ years).
LLM serving — vLLM, TensorRT-LLM, or Triton Inference Server (hands-on).
GPU architecture — memory hierarchy, tensor cores, NVLink, NCCL multi-GPU communication.
Model quantisation — INT8, FP16, mixed precision, GPTQ/AWQ.
CUDA ecosystem — drivers, cuDNN, NVIDIA container toolkit.
Performance engineering — profiling (Nsight, nvidia-smi, DCGM), bottleneck analysis, load testing.
AWS GPU instances — G-series (L40S), P-series (A100), instance selection methodology.

Core Tech Stack:

vLLM, NVIDIA Triton, TensorRT-LLM, KServe, CUDA/cuDNN/NCCL/DCGM, AWS Inferentia/Neuron SDK, GPTQ/AWQ/bitsandbytes, Locust, Nsight Systems, Prometheus + DCGM Exporter, AWS (EC2 GPU, EKS, Capacity Blocks).

Quick apply

About Aivar Innovations

Team: Accelerators

Key Responsibilities:

Must-Have Technical Skills:

Core Tech Stack:

Jobseeker tools

Employer Tools

Browse

Stay Connected