Architect - Data Engineering
Job Overview: We are seeking a highly skilled Data Modeler with a strong background in big data technologies, particularly PySpark, and extensive experience in ETL processes. The ideal candidate will be responsible for designing, implementing, and maintaining robust data models that support our business needs and enhance our data analytics capabilities.
Key Responsibilities:
-
Design, develop, and optimize conceptual, logical, and physical data models.
-
Create and maintain data models to ensure data integrity and performance.
-
Collaborate with business stakeholders to understand data requirements and translate them into data models.
-
Utilize PySpark for large-scale data processing and transformation.
-
Implement and manage big data solutions on platforms such as Hadoop, Spark, and Hive.
-
Optimize and troubleshoot big data processing pipelines.
-
Develop, implement, and maintain ETL processes to ingest data from various sources.
-
Ensure ETL processes are efficient, scalable, and reliable.
-
Monitor ETL processes to ensure data quality and consistency.
-
Collaboration and Communication:
-
Work closely with data engineers, data scientists, and other stakeholders to ensure seamless data integration and utilization.
-
Document data models, ETL processes, and data pipelines for future reference and knowledge sharing.
-
Provide support and training to team members on data modeling and big data best practices.
-
Performance Tuning and Optimization:
-
Identify and implement opportunities for performance improvements in data models and ETL processes.
-
Monitor system performance and troubleshoot issues related to data processing and storage.
Qualifications:
-
Education: Bachelor’s of Engg. degree in computer science, Information Technology, Data Science, or related field.
-
3+ years of experience in data modeling and database design.
-
3+ years of experience with big data technologies, including PySpark, Hadoop, Spark, Hive, etc.
-
Proven experience with ETL tools and processes.
-
Proficiency in SQL and database management systems (e.g., MySQL, PostgreSQL, Oracle).
-
Strong programming skills in Python, particularly with PySpark.
-
Experience with data warehousing solutions (e.g., Redshift, Snowflake).
-
Familiarity with cloud platforms (e.g., AWS, Azure, GCP) is a plus.
-
Experience with Continuous Integration and Automated Test tools such as PyTest, Jenkins, Artifactory, Git, Selenium, Chef desirable
-
Experience in Graph processing technologies and graph databases such as GraphX and Neo4j is a plus.
-
Strong analytical and problem-solving skills.
-
Excellent communication and collaboration abilities.
-
Detail-oriented with a commitment to data quality.
-
6 or more years of work experience with a Bachelors Degree or 4 or more years of relevant experience with an Advanced Degree (e.g. Masters, MBA, JD, MD) or up to 3 years of relevant experience with a PhD
-
Proven knowledge of successful design and development of data pipelines
-
Experience in creating data driven business solutions and solving data problems using a wide variety of technologies such as Hadoop, Hive, Spark, MongoDB, NoSQL, as well as traditional data technologies (RDBMS).
-
Experience developing large scale, enterprise class distributed pipelines that require high availability, low latency & strong data consistency computing.
-
Ability to program in one or more scripting languages such as Python and one or more programming languages such as Java or Scala
-
Design and development skills with Big Data technologies like Hadoop, Spark, Hive, Presto and Map Reduce
-
Experience with Continuous Integration and Automated Test tools such as Jenkins, Artifactory, Git, Selenium, Chef desirable
-
Experience in Graph processing technologies and graph databases such as GraphX and Neo4j is a plus.
-
Experience in implementing AI and ML methods is preferred, specifically techniques used in identity verification, fraud detection, or risk prediction scenarios such as Identity Graph, Decision Trees, Random Forests, Logistic Regression, Neural Networks, SVM, or Anomaly Detection algorithms.