The dataset from the Global Weather Repository, provides comprehensive meteorological data collected from various locations worldwide. It offers insights into various weather parameters, including temperature, precipitation, and humidity, enabling detailed global climate patterns and trend analysis. This dataset is valuable for understanding weather-related phenomena and their implications on environmental and societal systems.
Before any cleaning, the dataset contains 46967 rows.
The following columns are relevant to the analysis:
Column Name | Description |
---|---|
country | The name of the country where the weather data was recorded. |
latitude | The latitude of the location in decimal degrees. |
longitude | The longitude of the location in decimal degrees. |
last_updated | The last update time in a human-readable format. |
temperature_celsius | The temperature at the location in degrees Celsius. |
wind_kph | The wind speed in kilometers per hour. |
pressure_mb | The atmospheric pressure at the location in millibars. |
precip_mm | The amount of precipitation in millimeters. |
humidity | The percentage of atmospheric moisture at the location. |
cloud | The percentage of cloud cover at the location. |
uv_index | The UV index indicates the strength of ultraviolet radiation. |
gust_kph | The speed of wind gusts in kilometers per hour. |
air_quality_Carbon_Monoxide | The concentration of carbon monoxide in the air. |
air_quality_Ozone | The concentration of ozone in the air. |
air_quality_Nitrogen_dioxide | The concentration of nitrogen dioxide in the air. |
air_quality_Sulphur_dioxide | The concentration of sulfur dioxide in the air. |
air_quality_PM2.5 | The concentration of particulate matter smaller than 2.5 micrometers. |
air_quality_PM10 | The concentration of particulate matter smaller than 10 micrometers. |
air_quality_us-epa-index | The air quality index as per the US EPA standards. |
air_quality_gb-defra-index | The air quality index as per the UK DEFRA standards. |
last_updated
column into a DateTime format for consistency and easier time-based analysis.Cleaned Dataset:
Country | Latitude | Longitude | Last Updated | Temperature (°C) | Wind (kph) | Wind Degree | Pressure (mb) | Precip (mm) | Humidity | Cloud | Visibility (km) | UV Index | Gust (kph) | air_quality_Carbon_Monoxide | air_quality_Ozone | air_quality_Nitrogen_dioxide | air_quality_Sulphur_dioxide | air_quality_PM2.5 | air_quality_PM10 | air_quality_us-epa-index | air_quality_gb-defra-index |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Afghanistan | 34.52 | 69.18 | 2024-05-16 13:15:00 | 26.6 | 13.3 | 338 | 1012 | 0 | 24 | 30 | 7 | 15.3 | 277 | 103 | 1.1 | 0.2 | 8.4 | 26.6 | 1 | 1 | 1 |
Albania | 41.33 | 19.82 | 2024-05-16 10:45:00 | 19.0 | 11.2 | 320 | 1012 | 0.1 | 94 | 75 | 5 | 18.4 | 193.6 | 97.3 | 0.9 | 0.1 | 1.1 | 2.0 | 1 | 1 | 1 |
Algeria | 36.76 | 3.05 | 2024-05-16 09:45:00 | 23.0 | 15.1 | 280 | 1011 | 0 | 29 | 0 | 5 | 22.3 | 540.7 | 12.2 | 65.1 | 13.4 | 10.4 | 18.4 | 1 | 1 | 1 |
Andorra | 42.5 | 1.52 | 2024-05-16 10:45:00 | 6.3 | 11.9 | 215 | 1007 | 0.3 | 61 | 100 | 2 | 13.7 | 170.2 | 64.4 | 1.6 | 0.2 | 0.7 | 0.9 | 1 | 1 | 1 |
Angola | -8.84 | 13.23 | 2024-05-16 09:45:00 | 26.0 | 13.0 | 150 | 1011 | 0 | 89 | 50 | 8 | 20.2 | 2964.0 | 19.0 | 72.7 | 31.5 | 183.4 | 262.3 | 5 | 10 | 1 |
Z-Score Method
abs(z) > 3
are flagged as outliers.IQR Method
1.5 * IQR
from the first or third quartile are flagged as outliers.The vast difference in the number of outliers predicted can be due to the fact that the Z-Score method assumes that the data is normally distributed while the IQR method makes no such assumption. It is thus important to assess the skewness of the data for more accurate predictions.
Temperature Skewness: -0.86
Precipitation Skewness: 19.56
In order to further assess, our anomaly detection, we will now use a data-driven anomaly detection method to evaluate our results.
Total number of outliers predicted for precipitation: 2337 out of 46967 data points.
Isolation Forest
significantly overestimates outliers for temperature.
Conclusion: Moving forward, for any analysis concerning anomalies, we will use the Z-Score method for temperature and the Isolation Forest for precipitation to ensure reliable and accurate anomaly detection.
Note: The extreme variations in Temperature
and Precipitation
are depicted in the Spatial Analysis section of the report.
The goal of this analysis is to evaluate: How accurately can time series forecasting models forecast temperature and precipitation, and how robust is the model’s performance in specifically identifying and predicting anomalous patterns in the time series?
This is primarily a time-series forecasting and anomaly detection problem. The goal is twofold:
temperature
and precipitation
— over time, which requires a regression-based approach tailored to time-series data.Z-Score_Temp_Anomalies
) and machine learning-based anomaly detection methods (IsolationForest_Prec_anomalies
).last_updated
column is normalized and set as the index to facilitate time-series operations. The data is then grouped by index, averaged, resampled to a daily frequency, and sorted by date in ascending order. Missing values are handled through linear interpolation.lag_1
, lag_2
, and lag_3
columns represent the values from the prior day, two days ago, and three days ago, respectively.Variable | Model | MAE | MSE | Anomaly MAE | Anomaly MSE | |
---|---|---|---|---|---|---|
0 | Temperature | SARIMA | 2.91 | 13.64 | 7.22 | 58.80 |
1 | Precipitation | SARIMA | 0.29 | 0.27 | 1.61 | 3.54 |
2 | Temperature | Random Forest | 2.04 | 8.24 | 6.51 | 46.18 |
3 | Precipitation | Random Forest | 0.23 | 0.23 | 1.71 | 3.66 |
Variable | Model | MAE | MSE | Anomaly MAE | Anomaly MSE | |
---|---|---|---|---|---|---|
0 | Temperature | SARIMA | 2.91 | 13.64 | 7.22 | 58.80 |
1 | Precipitation | SARIMA | 0.29 | 0.27 | 1.61 | 3.54 |
2 | Temperature | Random Forest | 2.04 | 8.24 | 6.51 | 46.18 |
3 | Precipitation | Random Forest | 0.23 | 0.23 | 1.71 | 3.66 |
4 | Temperature | Meta model | 1.81 | 5.87 | 3.30 | 13.94 |
5 | Precipitation | Meta model | 0.25 | 0.24 | 1.76 | 3.83 |
Temperature:
temperature
forecasting are higher across all models compared to precipitation
, which is expected due to the more complex and dynamic nature of temperature
fluctuations.Precipitation:
Precipitation
forecasts consistently yield lower MAE and MSE values across models compared to temperature
, reflecting the comparatively more stable nature of precipitation
data.Temperature
trends are more complex, involving subtle seasonal patterns and continuous variability, as seen in the plots previously. The stacked ensemble’s combination of linear and non-linear models provides the flexibility to model these patterns effectively. However, precipitation
data is more discrete and less prone to subtle trends and thus benefits from Random Forest’s direct handling of feature splits and non-linearity. The additional complexity of the meta-model does not add significant benefits and slightly increases error for normal precipitation data.Precipitation
anomalies, the meta-model’s handling of Temperature
anomalies is vastly better making it superior. This is another reflection of the simplicity of precipitation
data which does not require the complex ensemble model.temperature
and SARIMA perform reasonably well for our data, they lag behind in handling anomalies, highlighting the need for ensemble techniques to address such challenges.temperature
results emphasize the importance of leveraging ensemble techniques, where simple models like SARIMA or Random Forest might fail to capture the complexity of rare patterns effectively, while precipitation
evaluation results showcase the benefits of simple machine learning models for non-complex data which expend less computational energy.Note: Temperature trends from 2024-2025 are depicted in the Climate Analysis section of the report
Using this plot, we can explore how weather conditions vary across regions. Variations for each variable are depicted differently:
Temperature | x-axis |
Humidity | y-axis |
Precipitation | size |
Wind | color |
Observations:
Note: While grouping by country, the aggregation metric used was max, to amplify extreme weather conditions allowing for comparisons with the anomaly detection schemes explored earlier.
Notably, Iceland is at the lower extreme in both chloropleths. This also helps rationalize why Iceland was a near-outlier when analyzing country-wise geographical patterns.
RandomForestRegressor
with:
X = df[['humidity', 'precip_mm', 'wind_kph', 'visibility_km']].dropna()
y = df['air_quality_us-epa-index'].loc[X.index]
X_train
on which I ran shap.TreeExplainer
.The correlation analysis provides insights into the relationships between meteorological variables and different air quality indicators.
Demonstration of changing temperature and precipitation trends across all countries in the dataset from 2024-2025:
Note: In order to view the trend of a single country alone, double click it in the legend!
“By making industry-leading tools and education available to individuals from all backgrounds, we level the playing field for future PM leaders. This is the PM Accelerator motto, as we grant aspiring and experienced PMs what they need most – Access. We introduce you to industry leaders, surround you with the right PM ecosystem, and discover the new world of AI product management skills.”