Must-Have**
-
Strong proficiency in Python programming.
-
Hands-on experience with PySpark and Apache Spark.
-
Knowledge of Big Data technologies (Hadoop, Hive, Kafka, etc.).
-
Experience with SQL and relational/non-relational databases.
-
Familiarity with distributed computing and parallel processing.
-
Understanding data engineering best practices.
-
Experience with REST APIs, JSON/XML, and data serialization.
-
Exposure to cloud computing environments.
-
5+ years of experience in Python and PySpark development.
-
Experience with data warehousing and data lakes.
-
Knowledge of machine learning libraries (e.g., MLlib) is a plus.
-
Strong problem-solving and debugging skills.
-
Excellent communication and collaboration abilities.
-
Develop and maintain scalable data pipelines using Python and PySpark.
-
Design and implement ETL (Extract, Transform, Load) processes.
-
Optimize and troubleshoot existing PySpark applications for performance.
-
Collaborate with cross-functional teams to understand data requirements.
-
Write clean, efficient, and well-documented code.
-
Conduct code reviews and participate in design discussions.
-
Ensure data integrity and quality across the data lifecycle.
-
Integrate with cloud platforms like AWS, Azure, or GCP.
Implement data storage solutions and manage large-scale datasets.