3-5 years of experience
MNC Clients
Remote / Hybrid working
Job Description
- We are looking for a Data Engineer with hands-on knowledge of Spark/PySpark that will work on the collecting, storing, processing, and analyzing of huge sets of data.
- The primary focus will be on choosing optimal solutions to use for these purposes, then maintaining, implementing, and monitoring them.
- You will also be responsible for integrating them with the architecture used across the company.
Responsibilities
- Selecting and working with distributed computing technologies like Spark, Hadoop, Hive, etc.
- Writing PySpark scripts and functions to read, transform and analyze huge amounts of data
- Implementing ETL process(if importing data from existing data sources is relevant)
- Defining data retention policies
- Writing SQL queries to fetch data from multiple sources.
- Writing and scheduling pipelines using schedulers like Airflow, Prefect, etc.
- Pro-actively do research, ask questions and suggest solutions at every step of the project
Skills and Qualifications
- Proficient understanding of distributed computing principles
- Proficiency with Spark and PySpark
- Proficiency with Data warehouses like BigQuery and Snowflake
- Proficiency with Hadoop, MapReduce, HDFS, HIVE
- Experience with building stream-processing systems, using solutions such as Storm or Spark-Streaming.
- Experience with integration of data from multiple data sources