top of page
Weather ETL Pipeline Using
Spark and Cloud Storages

My project extracts weather data via API from OpenWeatherMap. I then process it with PySpark and upload my data into cloud storages such as AWS S3, Azure Blob Storage, and Google Cloud Storage. For efficiency, I run this code hourly using Apache Airflow on my personal computer

Pipeline Architecture Design

ETL_Pipeline_Design_edited.jpg

I schedule my code to run every 6 hours in my personal computer using Apache Airflow. It uploads the final results to three cloud storages: AWS S3, Microsoft Azure, and Google Cloud Storage

Link: ETL Project

Extract Data From Weather API

5_Tables_ETL_edited.jpg

I make a request to OpenWeatherMap to give me information about 45 cities. Then, I make sure to save those new tables into a folder

Python File Link

Transform Data

I use Spark's library to extract the information by joining two tables to get the top ten results with longest daytime hours

Upload to Cloud Storage

After gathering all my useful data, I finally upload them into my 3 cloud storage

Python File Link

bottom of page