Experience: 4-7 years
Location: Pune (India)
Primary Skills: Apache Spark (Java / Python / Scala), Apache Flink Hive, Impala
What You’ll Do
-
Design, build and optimize distributed data processing systems on CDP.
-
Architect batch and stream data pipelines using Apache Spark.
-
Build streaming pipelines leveraging Flink, Hive and modern table formats like Iceberg.
-
Develop high-performance data pipelines using Spark (Java/Python/Scala) on YARN-based clusters.
-
Ensure data quality, reliability, and performance tuning across large-scale distributed systems.
-
Develop and maintain ETL/ELT workflows orchestrated via Airflow.
Data Quality & Reliability:
-
Define and enforce data quality checks, lineage tracking, and SLA monitoring across pipelines.
-
Implement unit, integration, and end-to-end testing strategies for data pipelines.
-
Troubleshoot performance bottlenecks in Spark jobs, Flink topologies, and Hive queries – applying techniques such as partition pruning, broadcast joins, and predicate pushdown.
Collaboration & Governance
-
Partner with data architects, data scientists, and platform engineers to translate business requirements into robust data solutions.
-
Participate in design reviews, technical documentation, and knowledge sharing within the team.
-
Contribute to establishing engineering standards, coding guidelines, and best practices for the data engineering discipline.
-
Provide technical leadership across teams, unblock complex projects, and mentor junior engineers.
-
Translate product intent into technical plans, influence roadmaps with data-driven insights, and communicate trade-offs to executives and stakeholders.
Tech Stack
Framework: Apache Spark (Java / Python / Scala), Apache Flink
Query Engines: Hive, Impala
Storage & Formats: Apache Iceberg
Orchestration: Apache Airflow
Infrastructure: YARN-based clusters, CDP
What We’re Looking For
-
4-6 years of proven experience building distributed data systems with Apache Spark at scale.
-
Strong proficiency in Python / Java / Scala for data engineering.
-
Hands-on experience with streaming frameworks (Flink) and batch orchestration (Airflow).
-
Deep understanding of data quality practices, SLA monitoring, and pipeline observability.
-
Experience with modern table formats (Apache Iceberg preferred).
-
Strong communication skills – ability to present trade-offs clearly to technical and non-technical stakeholders.