Work Schedule
Other
Environmental Conditions
Office
Job Description
Summarized Purpose:
We are seeking a Lead Data Engineer to own the complete lifecycle of enterprise data pipelines from development to production, including roadmap planning, scalable ETL architecture, AWS data services, secure PHI/PII handling, healthcare data standards, AI-assisted mapping automation, data quality, transformation, catalog standards, and RAG-enabled data solutions.
Education/Experience:
-
Bachelor's degree or equivalent in Computer Science, Information Technology, Data Engineering, or related field
-
7+ years of experience in data engineering, ETL development, cloud data platforms, healthcare or regulated data environments, and production data pipeline delivery
Major Job Responsibilities:
-
Design, develop, deploy, and operate scalable ETL and data pipelines using PySpark, Python, advanced SQL, and AWS data services
-
Own data pipeline lifecycle from requirements, mapping, development, testing, deployment, monitoring, production support, release management, and future roadmap planning
-
Build ingestion and transformation pipelines for flat files, relational databases, APIs, data warehouses, healthcare data sources, and enterprise data platforms
-
Implement mapping automation, preferably using AI, along with LLM-assisted data cleaning, transformation, data quality checks, and RAG use cases
-
Implement secure handling of PHI/PII data including encryption, access controls, auditability, retention, masking, de-identification, governance, and operational readiness
Knowledge, Skills, and Abilities:
-
Advanced expertise in PySpark, Python, advanced SQL, ETL best practices, data modeling, and large-scale data processing
-
Strong hands-on experience with AWS services including S3, Glue, Lambda, Step Functions, ECS, DynamoDB, Redshift, RDS/PostgreSQL, and related data services
-
Experience with PostgreSQL, SQL Server, Redshift, flat files, complex source-to-target mappings, HL7, claims data, EMR extracts, and clinical trial data
-
Knowledge of data cataloging, metadata management, transformation standards, orchestration, monitoring, data quality, CI/CD, automated testing, and production support practices
-
Ability to lead technical design, mentor engineers, guide delivery decisions, troubleshoot complex issues, and communicate with cross-functional teams
Must Have Skills:
-
Advanced PySpark, Python, advanced SQL, ETL design, and data pipeline engineering expertise
-
AWS data services experience including S3, Glue, Lambda, Step Functions, ECS, DynamoDB, Redshift, PostgreSQL, and SQL Server integration
-
Secure PHI/PII handling, flat-file ingestion, source-to-target mapping, transformation, data catalog, governance, and healthcare data standards experience
-
CI/CD, GitHub workflows, automated testing, release management for data pipelines and database changes, and dev-to-prod pipeline ownership
Good to Have Skills:
-
AI-assisted mapping automation and use of LLMs for data cleaning, data quality checks, transformation logic, documentation, and patient de-identification support
-
Experience with RAG patterns, embeddings, vector databases, semantic search, or AI-enabled data discovery solutions
-
Familiarity with infrastructure as code such as Terraform or CloudFormation, plus streaming, Databricks, Snowflake, observability, and DevOps practices
Working Hours:
-
India: 05:30 PM to 02:30 AM IST
-
Philippines: 08:00 PM to 05:00 AM PHT