Key Responsibilities
- Design and operate AWS-native data lakehouse: Amazon S3 + Lake Formation (governance),
Glue/Athena (ELT), and optional Amazon Redshift for warehousing.
- Build high-throughput ingestion and CDC pipelines from partner APIs, files, and databases
using EventBridge, SQS/SNS, Kinesis/MSK, AWS DMS, and Lambda/ECS Fargate.
- Implement idempotent upserts, deduplication, and delta detection; define source-of-truth
governance and survivorship rules across authorities/insurers/partners.
- Model healthcare provider data (DDD) and normalize structured/semi-structured payloads
(JSON/CSV/XML, FHIR/HL7 if present) into curated zones.
- Engineer vector-aware datasets for clinician/patient matching; operate pgvector on Amazon.
Aurora PostgreSQL or use OpenSearch k-NN for hybrid search.
- Establish data quality (freshness, accuracy, coverage, cost-per-item) with automated checks.
(e.g., Great Expectations/Deequ) and publish KPIs/dashboards.
- Harden security & privacy: IAM least-privilege, KMS encryption, Secrets Manager, VPC
endpoints, audit logs, pseudonymised telemetry; enforce GDPR and right-to-erasure.
- Observability-first pipelines using OpenTelemetry (ADOT), CloudWatch, X-Ray; DLQ
handling, replay tooling, resiliency/chaos tests; SLOs and runbooks.
- Performance tuning for Aurora PostgreSQL (incl. indexing, partitioning, vacuum/analyze)
and cost-aware Spark (EMR/Glue) jobs.
- CI/CD for data (Terraform/CDK, GitHub Actions/CodeBuild/CodePipeline); test automation
(pytest/DBT) and blue/green or canary for critical jobs.