Aivar is an AI-first technology partner where cutting-edge technology meets industry expertise to supercharge your projects.
Experience: 3–7 years | Hands-on GPU optimisation required
Technical Focus: The specialist who takes AI deployments from "it works" to "sub-second latency at 40% lower cost." Own vLLM/Triton configurations, model quantisation (INT8, FP16, 4-bit), tensor parallelism on multi-GPU instances, AWS Inferentia optimisation, and performance benchmarking. Proven results: 40% cost reduction on Whisper ASR, 0.41s TTFT on Llama 70B, 85% throughput gain on YOLO via Inferentia.
-
Deploy and tune vLLM with multi-GPU tensor parallelism, dynamic batching, PagedAttention, and KV cache optimisation for LLMs.
-
Configure NVIDIA Triton for production multi-model serving with custom backends and model ensembles.
-
Build TensorRT-LLM optimised model binaries for maximum throughput on L40S, A100, and H100 GPUs.
-
Implement AWS Inferentia deployments using Neuron SDK — model compilation, operator support, performance tuning.
-
Run comprehensive load testing (Locust) to map performance cliffs, optimal concurrency, and scaling thresholds.
-
Execute model quantisation (INT8, FP16, GPTQ, AWQ) with rigorous quality-accuracy tradeoff analysis.
-
Produce detailed benchmark reports with instance selection, scaling strategy, and cost-per-token recommendations.
-
Neuron: Experience in optimising models for custom accelerators like AWS Inferentia/Trainiums.
-
GPU-accelerated ML workloads in production (3+ years).
-
LLM serving — vLLM, TensorRT-LLM, or Triton Inference Server (hands-on).
-
GPU architecture — memory hierarchy, tensor cores, NVLink, NCCL multi-GPU communication.
-
Model quantisation — INT8, FP16, mixed precision, GPTQ/AWQ.
-
CUDA ecosystem — drivers, cuDNN, NVIDIA container toolkit.
-
Performance engineering — profiling (Nsight, nvidia-smi, DCGM), bottleneck analysis, load testing.
-
AWS GPU instances — G-series (L40S), P-series (A100), instance selection methodology.
vLLM, NVIDIA Triton, TensorRT-LLM, KServe, CUDA/cuDNN/NCCL/DCGM, AWS Inferentia/Neuron SDK, GPTQ/AWQ/bitsandbytes, Locust, Nsight Systems, Prometheus + DCGM Exporter, AWS (EC2 GPU, EKS, Capacity Blocks).