Company: Apna
Team: Data Platform / Engineering
Location: Bangalore
Experience : 5-7 Years of Experience
Why Join Apna
At Apna, data is central to how we build products, understand users, improve employer outcomes, power recommendations, and scale decision-making. This role gives you the opportunity to build the backbone of Apna’s data platform and influence how data is used across the company.
You will work on real-world, high-scale problems across jobs, users, employers, communities, matching, growth, and AI-driven systems.
About the Role
Apna is looking for a Lead / Staff Data Engineer to build and scale our core data platform. This role will work on large-scale data pipelines, lakehouse architecture, query platforms, workflow orchestration, and data reliability systems that power analytics, product intelligence, machine learning, business dashboards, experimentation, and operational decision-making across Apna.
We are looking for someone who can think deeply about data architecture, design reliable pipelines, improve data quality, and help build a platform that can scale with Apna’s growth.
What You’ll Own:
You will be responsible for designing, building, and operating critical parts of Apna’s data platform, including:
-
Building scalable batch and near-real-time data pipelines across product, business, growth, and ML use cases.
-
Designing and improving our lakehouse architecture using technologies likeApache Hudi.
-
Working with query engines such asPresto / Trinofor large-scale analytical workloads.
-
Building and maintaining orchestration workflows usingApache Airflow.
-
Creating reusable data models, curated datasets, and reliable data marts for analytics and product teams.
-
Improving data platform reliability, observability, SLA tracking, lineage, and data quality checks.
-
Optimizing storage, compute, query performance, and pipeline costs.
-
Partnering with product, analytics, ML, and backend engineering teams to understand data needs and convert them into scalable platform solutions.
-
Driving engineering standards around data modeling, schema evolution, partitioning, deduplication, backfills, replayability, and pipeline ownership.
-
Mentoring data engineers and influencing architecture decisions across teams.
What We’re Looking For
Must Have
-
Strong experience indata engineering, preferably at scale.
-
Hands-on experience withApache Airflowor similar orchestration systems.
-
Strong knowledge ofPresto / Trinoor other distributed query engines.
-
Good understanding ofApache Hudiconcepts such as:
-
Copy-on-write vs merge-on-read
-
Upserts and deletes
-
Incremental reads
-
Compaction
-
Clustering
-
Timeline and commits
-
Schema evolution
-
Partitioning strategy
-
Strong knowledge of distributed data processing and storage systems.
-
Ability to design and build reliable ETL / ELT pipelines.
-
Strong SQL skills and ability to debug complex data issues.
-
Good understanding of different data architectures, including:
-
Data warehouse
-
Data lake
-
Lakehouse
-
Lambda architecture
-
Kappa architecture
-
Medallion architecture
-
Event-driven data architecture
-
Experience with data modeling for analytics and reporting.
-
Strong programming skills in at least one language such asPython, Java, or Scala.
-
Ability to reason about trade-offs between freshness, cost, reliability, latency, and complexity.
-
Strong debugging and production ownership mindset.
Good to Have
-
Experience with Kafka, Spark, Flink, Hive, Iceberg, Delta Lake, or BigQuery.
-
Experience building internal data platforms or self-serve data infrastructure.
-
Experience with data quality frameworks such as Great Expectations, Deequ, Soda, or custom validation systems.
-
Exposure to ML feature pipelines or feature stores.
-
Experience with metadata management, data catalogs, lineage, and governance.
-
Experience with cloud infrastructure such as AWS, GCP, or Azure.
-
Understanding of privacy, compliance, PII handling, and access control in data systems.
What Success Looks Like
In this role, success means:
-
Critical business and product datasets are reliable, discoverable, and trusted.
-
Pipelines are observable, recoverable, and have clear SLAs.
-
Query performance improves across major analytical workloads.
-
Data freshness and quality issues reduce significantly.
-
Teams can build on top of the data platform faster without reinventing pipelines.
-
The platform can scale with Apna’s user, job, employer, and engagement data.