In classification problems, it is possible to produce a probability distribution of a set of classes. In regression problems, however, machine learning models *always* predict a single value without any way to measure the certainty of that value. Most of the time it is very important to be able to match a prediction to its likelihood. For example, ParkPredict (https://qucit.com/parkpredict_en/), a solution developed here at Qucit, predicts the time it takes to find a parking space up to 12 hours in advance. There is clearly a critical difference between a prediction with a 95% chance of correctness up to five minutes, and a potential error of several hours! Therefore, it is essential to be able to measure the certainty level of a prediction. Hence prediction intervals: intervals inside which it is almost certain that the true value of the target will be.

## Quantile regression, a standard way to build prediction intervals

In the machine learning domain, confidence intervals are generally built with quantile regression (https://en.wikipedia.org/wiki/Quantile_regression). This approach aims at estimating the conditional quantiles (the most common is the median) of the response variable, in contrast to the method of least squares that estimates the conditional mean. Given a random variable *Y* (such as the predicted parking time) and a value *p *in [0, 1], the associated quantile *q*, is the value such that P(Y <= q) = p. As an example, the median is the 0.5 quantile. For instance, let say that a predictive model that computes the parking time for tomorrow morning in downtown Bordeaux estimates that 22 minutes is the 0.95 quantile. This means that the probability that the parking time will be less or equal to 22 minutes is 95% (obviously this is just a fictional example, especially if you live in a city like Paris and you read this post on Sunday!).

*Fig. 1 Example of quantile regression: the probability that the predicted value (red line) is greater or equal to the actual samples (blue dots) is 70% *

Using quantile regression to compute prediction intervals is quite straightforward. By combining two quantile regressors, it is possible to build an interval that is surrounded by the two sets of predictions produced by these two models.

The following figure (Fig 2) illustrates how the 0.05 and 0.95 quantiles are used to compute the 0.9 prediction interval. Using the predictions of a 0.05 quantile regressor as a lower boundary and the predictions of a 0.95 quantile regressor as an upper one, by construction the probability that a value belongs to the interval between the upper and lower boundary is: P( l<=X <= u) = P(X <= u) – P(X <=l)=0.95-0.05=0.9

Roughly speaking, this means the prediction interval will contain approximately 90% of samples.

*Fig. 2 Example of a 0.9 prediction interval: the probability that the actual function’s observations (blue dost) belongs to the prediction interval (blue filled area) is 90%.*

## Quantile regression in practice

Quantile regression is a classical technique and some widespread machine learning package already implement it, such as scikit-learn in python.

Indeed scikit-learn has a Gradient Boosting Regressor already available that allows quantile regression and can produce excellent results. Here you can find an example of its usage: http://scikitlearn.org/stable/auto_examples/ensemble/plot_gradient_boosting_quantile.html

However, Gradient Boosting Regressors, and quantile regression, in general, have some practical drawbacks.

When presented with extreme quantiles (very low or very high), the Gradient Boosting Regressor tends to produce intervals that are very distant from the true values. You can observe this in figure 2, where the 0.05 and 0.95 predictions are computed by two Gradient Boosting Regressors. As you see in the picture, this doesn’t mean that the quantiles are wrong, but in practice, we prefer quantile that are the closest possible to actual data. Let’s look at the following picture (Fig 3). The orange line represents the median value of the underling function f(x) used to generate the observation. The red line represents the prediction obtained by a 0.5 Gradient Boosted Quantile Regressor. Both lines separate the observation in half, i.e. the probability that an observation is greater than the predicted value on the whole data set is 50%. However, in practice, a static median value is less useful. In fact, it has a higher variance and underfits the data. Let’s take again the parking example and let’s say that the median parking time in your city is 10 minutes. A model that always predict 10 minutes, either in the night where the parking time close to 0 or in the morning where the parking time can get very huge, is quite useless compared to a model that is able to adapt itself to the context.

*Figure 3 Median value of f(x) (orange line) vs. 0.5 Gradient Boosted Regressor (red line). Both of them separate the observation in half, i.e. the probability that an observation is greater than the predicted value is 50%, but the median is less useful in real life applications.*

Another drawback is that each quantile needs its own regressor. Depending on the final application, the need for precision may vary and sometimes we want to estimate the entire distribution of the predictions. A possible solution is to train a model for each possible quantile, but this a clumsy and expensive solution. Let’s say we want an estimation for each percent quantile (0.01, 0.02, 0.03, etc. until 0.99). These solutions imply the training and management of 99 models. On the contrary, having a single model capable to estimate the whole underlying distribution implies training and managing only a single model.

Finally, quantile regression is not available for all types of regression models. In scikit-learn, the only model that implements it is the Gradient Boosted Regressor. Sometimes, such as in the case of XGBoost, you can customize the model’s cost-function to obtain quantile regressor. You can read the details of how to do it here:

https://medium.com/bigdatarepublic/regression-prediction-intervals-with-xgboost-428e0a018b

In general this is a complicated process that needs deep understanding of the algorithms to customize and their implementations. To address these issues, here at Qucit, we built a Distribution Estimator.

## Distribution estimator

A distribution estimator is a trained model that can compute quantile regression for any given probability without the need to do any re-training or recalibration.

Contrary to standard quantile regression which predicts one quantile per probability value (0.1, 0.2, 0.5 and so on), this estimator predicts the entire distribution of the predictions. This approach reduces the whole process to training a single model per target thus making it less computationally expensive and easier to maintain.

## How can we predict a distribution?

The main assumption behind Qucit’s distribution estimator is that the desired prediction follows a normal distribution. This seems like a strong assumption at first, but as we show next, in practice this assumption leads to satisfying results.

Therefore a distribution estimator works by producing a prediction and an estimate error for that prediction. From those two values, it estimates the distribution as a gaussian function centered on the prediction with a standard deviation equal to the estimated error. In other words, a model (hereafter referred as the base model) predicts the mean of the gaussian distribution, whereas the estimated error give us the standard deviation of the distribution.

There are two main strategies to produce values and error predictions:

### Training a machine learning model to predict values, and using its RMSE to compute the error

The first strategy assumes that the standard deviation of the normal distribution is constant. A first regression model, the base model, is trained on a training set and its root mean square error is evaluated on a validation set and used as an estimation for the standard deviation.

Here is a sample code snippet to do it:

### Training a machine learning model to predict values and errors

In practice, the error is not always constant (it depends on the features). Therefore, as an improvement, we can fit a model to learn the error itself.

As before, the base model is learnt from the training data. Then, a second model (the error model) is trained on a validation set to predict the squared difference between the predictions and the real values.

The standard deviation of the distribution is computed by taking the root of the error predicted by the error model. Again, here is a sample code to do it:

### Results

To test our models and compare them to the standard Gradient Boosted Quantile Regressor, we trained several distribution estimators on a toy data set.

However, for a more meaningful evaluation we changed some details on how the data set is created. To crash test our assumption that the prediction error is Gaussian, we have replaced the gaussian noise with an uniform one. Moreover, to have a more realistic data set, we trained the models on 50 000 samples instead of just 1000.

The models used for the evaluation are:

- A scikit-learn Gradient Boosted Quantile Regressor
- A fixed error Distribution Estimator based on a standard Gradient Boosted Regressor (referred as RMSE GBR Distribution Estimator)
- A Distribution Estimator composed of a Gradient Boosted Regressor base model and a Gradient Boosted Regressor error model (referred as GBR Distribution Estimator)
- A Distribution Estimator composed of a K Nearest Neighbor base model and a K Nearest Neighbor error model (referred as KNN Distribution Estimator)

By definition, the distribution estimators take less resources to train and to use than the Gradient Boosted Quantile Regressor (only two models instead of one per quantile).

More interestingly, they tend to produce more narrow prediction (lower variance) intervals than standard quantile regressors. The following picture shows the 90% prediction interval estimated by a distribution estimator whose base and error models are two gradient boosted regressors:

*Figure 4: 90% prediction interval estimated by a distribution estimator whose base and error models are two gradients boosted regressors*

Another advantage of the distribution estimator is the possibility to build them with any machine learning model. For instance, the distribution estimator based on KNN models produces quite surprising good results, as you can see in the picture below.

*Figure 5: 90% prediction interval estimated by a distribution estimator whose base and error models are two KNN estimators*

To conclude the evaluation, we compared the actual performance of the various models. We used the pinball loss. This is the standard loss function used in the context of quantile regression. Here is the formula for a given probability q and predictions p_{k}:

The crucial point is that the model that minimizes the pinball loss outputs the optimal quantile. If you are curious, you can read more about this loss and its relation to quantile regression here.

We computed the loss for each quantile with our four models, displayed on the following graph:

*Figure 6: Comparison of losses for various quantile regression models. The x-coordinate corresponds to the probability (in %) and the y-coordinate to the corresponding loss.*

Notice that the models using distribution estimators (red, orange, and green plots) have lower losses than the simple regression one (the blue one in the graph above) for the different probabilities. Distribution Estimator models using two trained models (one with the training data the other one with the error) are also better than the model estimating the error with the RMSE.

## Conclusion

Despite their apparent simplicity and the Gaussian assumption, the Distribution Estimator proved to be a very powerful tool that can be used effectively in the framework of quantile regression.

We showed that, when applied to a toy example, it outperforms the standard Gradient Boosted Quantile Regressor implemented in scikit learn. We have to admit that such performance gains coupled with model simplicity could look a bit magical. However, we use it in production on some of our products with excellent results. If in the future, we find more evidence of the goodness of this approach or some use cases where our distribution estimator fails, we’ll publish a new article about it.

Nice technique to predict error of prediction.

Thank you Vineet! Do not hesitate to share it. Have a nice day, Qucit’s Team

interesting article. Do you have any reference to this approach? or it is just a simple hack out of head? I wonder the general approach to fitting mean and conditional STD has existed in other types of machine learning task?

We do not have any academic reference, this approach was born more as an engineering hack to solve the quantile regression shortcomings described in the article.

However, if you are interested, Airbnb pusblished an article with a similar approach using machine learning to predict the parametes of a Gamma distribution:

https://medium.com/airbnb-engineering/learning-market-dynamics-for-optimal-pricing-97cffbcc53e3

it‘s somewhat like an article i’ve read named ‘Deep belief network based deterministic and probabilistic wind speed forecasting approach’，but the author uses cubic-curve to fit the distribution of error. Nice job.

Thank you!

Hey, great technique !

You probably wanna update your code to show the square root of the predicted error as the stdev..

st_dev = error_model.predict(X_test)**0.5

You’ve mentioned it in the explanation but probably missed the same in the code! 🙂

Good catch, thank you !

Very nice technique.

In repeating it, I ran into an interesting problematic. I am using XGBoost to predict a continuous variable. For some reason, hard for me to figure out, I am predicting some negative errors. Squaring the validation residuals should make the continuum only positive for the response variable. However, my model is extrapolating some negative values. If one takes the SQR of these values, one does not get a number.

At first, I thought supplanting negative numbers with the maximum value of the prediction might be a good idea since the negative prediction implied a sort of “out of control” or “not seen” input variables. On a second thought, supplanting these values with the validation RMSE seem a better idea.

Hi Ben

It’s quite normal for an XGBoost regressor to extrapolate some negative values, even if they were never seen in the training data. For an explanation on why this happens, you can read this:

https://github.com/dmlc/xgboost/issues/1581

Basically, you can interpretate this as XGBoost doing some extrapolation errors in predicting too small values. Therefore you can replace the negative value with 0, or, better with some epsilon value to still have a normal distribution instead of a Dirac distribution.

Can you please explain how did you derive prediction intervals

I assume you were adding -/+ Standard deviation(which you got from error model) to the base prediction to get 68% confidence interval.

but i when i see graphs the base prediction line is not bisecting the prediction interval, I think we were using different approach to find the prediction intervals