Engineering experience; hands-on exp in Databricks in production.
- Apache Spark internals - Catalyst optimizer, Tungsten engine, AQE, DAG scheduler, shuffle
behavior, partitioning, broadcast/sort-merge joins, data skew handling, and Spark 4.0
capabilities.
- Databricks platform depth - Delta Lake (transaction log, OPTIMIZE, ZORDER, vacuum, liquid
clustering, schema evolution, time travel, CDC/merge), Lakeflow Declarative Pipelines, Unity
Catalog (governance, lineage, fine-grained access), Photon engine, Databricks Workflows,
Lakebase, and all cluster types (job, all-purpose, serverless SQL, serverless compute).
- Databricks REST API & SDK - programmatic management of clusters, jobs, permissions, and
workspace configuration.
- Performance tuning - Spark UI interpretation, physical plans, shuffle/skew/spill diagnosis,
join optimization, caching strategies, and Photon adoption decisions.
- Cost optimization - DBU forecasting, cluster sizing, autoscaling policies, spot vs. on-demand
trade-offs, instance pools, job-vs-all-purpose decisions, predictive optimization, serverless
economics (Performance vs. Standard mode, serverless GPU, egress, DBU trade-offs).
- Advanced Python & expert SQL; deep PySpark and Spark SQL internals.
- Cloud platforms (AWS/Azure/GCP) - IAM, networking, storage (S3/ADLS/GCS), and cloud native services underpinning Databricks.
- Experience with Docker, Kubernetes, Terraform, and modern CI/CD pipelines.
- Strong fundamentals in data structures, algorithms, distributed systems, and large-scale
system design
MLflow, Mosaic AI ecosystem (Agent Framework, Agent Bricks, AI Gateway, Vector Search),
feature stores, Databricks SQL Warehouses, or Databricks Asset Bundles.
- FinOps practices and cost-attribution models for data platforms.
- Observability tools - Prometheus, Grafana, OpenTelemetry, Datadog.
- Contributions to open-source Spark/Delta/Databricks projects