Job Description

We are looking for a Data Engineer with hands-on knowledge of Spark/PySpark that will work on the collecting, storing, processing, and analyzing of huge sets of data.
The primary focus will be on choosing optimal solutions to use for these purposes, then maintaining, implementing, and monitoring them.
You will also be responsible for integrating them with the architecture used across the company.

Responsibilities

Selecting and working with distributed computing technologies like Spark, Hadoop, Hive, etc.
Writing PySpark scripts and functions to read, transform and analyze huge amounts of data
Implementing ETL process(if importing data from existing data sources is relevant)
Defining data retention policies
Writing SQL queries to fetch data from multiple sources.
Writing and scheduling pipelines using schedulers like Airflow, Prefect, etc.
Pro-actively do research, ask questions and suggest solutions at every step of the project

Proficient understanding of distributed computing principles
Proficiency with Spark and PySpark
Proficiency with Data warehouses like BigQuery and Snowflake
Proficiency with Hadoop, MapReduce, HDFS, HIVE
Experience with building stream-processing systems, using solutions such as Storm or Spark-Streaming.
Experience with integration of data from multiple data sources