Motivation & Objective

Since 1980, the US experienced 26 major droughts, each event averaging a cost of $9.6 billion. It is also well known that droughts frequency and intensities will increase in the US and globally. It is thus interesting to build droughts prediction models that could be able to generalize well and to predict droughts with the greatest possible lead times.

Using weather time series, previous drought scores and static soil data, I compare the performance of Linear Regression, Gradient Boosted Trees, Dense Neural Network, and LSTM to perform droughts forecasting.

Data

The dataset used for this project is composed of daily weather observations in the US, that come from the NASA POWER (Prediction of Worldwide Energy Resources) Project, of soil data that come from the Harmonized World Soil Database, and of drought score that come from the U.S. Drought Monitor.

Weather data: The weather data contains daily observations of 18 weather variables (Temperature, Wind, Humidity, Precipitations, etc. information) in the US, at a county level (each of them represented by a FIPS code). It is already split in three sets: a train set, a validation and test set that all cover 3058 counties, respectively from 2000 to 2016, 2016 to 2018 and from 2018 to 2020.
Soil data: The soil data is static data that contains around 30 variables concerning the soil agricultural characteristics (distribution of slopes levels, aspects, land use, etc. in each county). This data is available for every county that either appears in the train, or in all timeseries.
Target variable: The U.S. Drought Monitor (USDM) is produced through a partnership between the National Drought Mitigation Center at the University of Nebraska-Lincoln, the United States Department of Agriculture, and the National Oceanic and Atmospheric Administration. It provides for each of the 3058 studied counties a continuous drought score ranging from 0 to 5, once per week and for each county, but is often presented in 5 classes (No Drought - Exceptional Drought). The particularity of this drought score is that it combines multiple drought indices or indicators to create a unique metric, that results in being a more objective drought value. Is is a combination of the Palmer Drought Severity Index (PDSI), of the CPC Soil Moisture Model,and of the Standardized Precipitation Index (SPI). As the score is provided on a weekly basis, we interpolate it for convenience in the model development.

Approach

Regression task

After exploring, analyzing and cleaning the data, I compare multiple models. The target variable is a continuous score that is regressed. However, to stick to USDM's 6 scores categories, I bin the predicted drought scores in the 6 drought severity categories and compute the forecasting average Macro F1 score. With the forecasting MSE, these are the two main performance metrics that are monitored.

Modeling

I compared multiple forecasting methods:
Naive models: Two Naive models that represent the performance baseline: an identity-model that uses current drought scores as predictions, and a model that interpolates droughts based on the two last droughts scores.
ML models: Classic ML techniques (Ridge regression, Gradient Boosted Trees). Weather observations are flattened or averaged to take into account long-term dependencies between weather and droughts. Different combinations of regularization, time window (how far we look in the past), and features are tested to best solve the problem.
LSTM: Using the best set of features found while devising ML models.