Job Title- Lead Data Engineer (Databricks)
Location – Gurugram/Chennai India (Onsite- 5 days/week)
Employment Type – Fulltime
Role Overview
We are seeking a highly experienced and motivated Principal / Lead Data Engineer to join our dynamic data platform team in either our Gurgaon or Chennai office. The successful candidate will be a critical player in designing, building, and optimizing our next-generation data architecture and pipelines.
This role requires expert-level proficiency in PySpark and SQL, alongside a proven track record of architecting scalable, high-performance ETL/ELT processes. You will transform vast amounts of raw data into high-quality, actionable insights for analytics, reporting, and Machine Learning. Given the seniority of this role, we are looking for a seasoned leader and immediate joiner who can hit the ground running and contribute significantly from day one.
Key Responsibilities
Data Pipeline Development & Optimization
- Design and Build: Architect, develop, and maintain robust, scalable, and fault-tolerant ETL/ELT pipelines for ingesting data from diverse sources (e.g., databases, APIs, streaming sources) into our data lake and data warehouse.
- PySpark Expertise: Write and optimize complex data transformation jobs using PySpark and the Spark DataFrame API to process petabytes of structured and unstructured data efficiently.
- SQL Mastery: Utilize Advanced SQL for complex querying, data manipulation, stored procedures, performance tuning, and optimizing database schema design in relational and analytical databases.
- Data Quality & Governance: Implement data validation, cleansing, and monitoring routines to ensure high data quality, integrity, and adherence to security and governance standards.
Architecture and Infrastructure
- Data Modeling: Design and implement optimal data models (e.g., Dimensional Modeling, Data Vault, Snowflake Schema) for our data warehouse to support business intelligence and analytical needs.
- Cloud Integration: Drive cloud-native data solutions primarily leveraging Azure Databricks, Azure Data Lake, and Synapse (or comparable frameworks like AWS S3/Redshift and Google BigQuery) to build and deploy data solutions.
- Automation: Implement orchestration tools like Apache Airflow, Azure Data Factory, or AWS Step Functions to automate data workflows and manage pipeline dependencies.
Collaboration and Operational Excellence
- Cross-Functional Leadership: Collaborate closely with Data Scientists, Data Analysts, Product Managers, and Business Stakeholders to understand data requirements and translate them into high-level technical specifications.
- Monitoring & Support: Monitor, troubleshoot, and resolve critical issues in production data pipelines, ensuring maximum uptime and timely data delivery.
- Best Practices & Mentorship: Lead code reviews, enforce strict coding standards, mentor junior engineers, and contribute to the continuous improvement of development and deployment practices (CI/CD, Git).
Required Technical Skills (Mandatory)
- Certification: Active Azure Databricks Certification (e.g., Databricks Certified Data Engineer Associate/Professional).
- PySpark: Expert-level, hands-on experience in developing, tuning, and optimizing large-scale data processing applications using PySpark (Python for Apache Spark).
- SQL: Mastery of Advanced SQL (including window functions, complex joins, stored procedures, and query performance tuning) across various database systems (e.g., Snowflake, Redshift, PostgreSQL).
- Programming: Strong proficiency in Python for scripting, automation, and general data manipulation libraries (e.g., Pandas).
- Big Data Architecture: Deep understanding of Big Data concepts, distributed systems architecture, data lakes, and modern data warehousing principles.
- ETL/ELT: Proven experience in designing and implementing enterprise-grade ETL/ELT pipelines.
Preferred Qualifications (Good to Have)
- Hands-on experience with wider Azure ecosystem components (Azure Data Factory, Azure Synapse, Key Vault).
- Familiarity with workflow orchestration tools like Apache Airflow.
- Experience with real-time/streaming data processing (e.g., Spark Structured Streaming, Kafka, or Event Hubs).
- Advanced knowledge of Data Governance, Data Cataloging, and Data Security best practices.
Pay: ₹1,200,093.57 - ₹1,956,668.30 per year
License/Certification:
- Databricks Certified Data Engineer Associate/Professional (Required)
Work Location: In person