top of page

Weather ETL Pipeline Using
Spark and Cloud Storages
My project extracts weather data via API from OpenWeatherMap. I then process it with PySpark and upload my data into cloud storages such as AWS S3, Azure Blob Storage, and Google Cloud Storage. For efficiency, I run this code hourly using Apache Airflow on my personal computer
Pipeline Architecture Design

I schedule my code to run every 6 hours in my personal computer using Apache Airflow. It uploads the final results to three cloud storages: AWS S3, Microsoft Azure, and Google Cloud Storage
Link: ETL Project
Extract Data From Weather API

I make a request to OpenWeatherMap to give me information about 45 cities. Then, I make sure to save those new tables into a folder
Transform Data

I use Spark's library to extract the information by joining two tables to get the top ten results with longest daytime hours
Upload to Cloud Storage

After gathering all my useful data, I finally upload them into my 3 cloud storage
bottom of page