Machine learning is not the panacea. When facing a problem with a huge amount of data, the first thing is to take some time to deeply understand the extent of the problem and what really is in the data. A model of Machine learning is a powerful tool to help you solve a problem but is rarely the answer. In many cases, a solution based only on pure Machine Learning will fail. In this article, we show some techniques to model a problem in a way that smartly combine a predictive Machine Learning model and a post-processing on the output of the model.
The Machine Learning has developed very rapidly during the past years and its use has become widely popular. It is sometimes considered as a magic formula that solves all problems involving large volumes of data. The effectiveness of Machine Learning models is no longer to be demonstrated, but it is often forgotten that their performance is conditioned by a detailed understanding of the problem to be solved.
Some problems can be difficult to express in the form of a pure machine learning problem. In particular, it may be too costly / impossible to obtain labelled data, making supervised learning techniques obsolete. Unsupervised learning methods may also not be applicable to the problem at all. In this case, it may be advisable to rephrase the problem into an intermediate problem, often more general and for which Machine Learning techniques will be relevant and effective, then apply a post-processing step on the predictions of the model to give a solution to the problem.
Post-processing consists of applying some wisely chosen transformations to the predictions produced by a model. Post-processing is often overlooked, even though it is just as important as feature engineering or model development. Post-processing ensures that predictions are consistent and that the model’s prediction error remains reasonable. It also corrects biases inherent to a model. Intelligent post-processing can also automatically transform the predictions to make them more meaningful and usable.
These are the techniques we often use at Qucit to solve some of urban mobility problems from our clients. These are very often complex problems where the volume of data to be processed is very considerable. Then Machine Learning techniques may appear as interesting solutions. However, these are often problems for which we do not know the answer and we therefore need to carry out a real work of modeling the problem.
Cycling is a fast, economical and popular means of transport for city-dwellers every day. It is becoming essential for cities to have an efficient bikeshare system. In recent months, a new type of bikeshare has also been developed, known as free floating, without station and the possibility for users to drop off their bikes almost anywhere.
Anyone who is used to bikesharing, has certainly already experienced the difficulty of finding a bike or a place at peak times of the day. In the morning no place close to your workplace and in the evening no bike to go home. The new free floating bikes partly address this problem, since the user is no longer looking for a place. However, free floating is still affected by the same phenomenon of concentration of bikes in some areas and consequently an absence of bikes in others, which can harm the reliability of this mode of transport without action on the part of system operators.
One of the key challenges for operators of these systems is to address this availability problem and ensure that stations never remain empty or full for too long. Operators therefore move bikes between stations to empty full stations and fill empty stations. For small systems with fewer than sixty stations, a good knowledge of the city and the dynamics of transport flows can be sufficient to carry out these operations efficiently. From a hundred or so stations, this becomes impossible. As we showed in a previous article, the number of trips in a system exponentially increases with the number of stations. In a city like Paris, which will have 1,400 stations in 2018, the rebalancing of the stations becomes a real headache.
Figure 1: Occupation of two stations in Bordeaux during a whole week
Rebalancing operations correspond to discontinuity point (marked in red for some of them)
The problem is therefore as follows: What are the operations to be carried out on the stations and how can they be carried out to ensure maximum availability of the stations in the next few hours while minimizing the logistical costs caused by these operations?
At Qucit, we have been working for several years on the issues of bikeshare systems We constantly collect data on station occupancy and our clients provide us with data on the journeys made by users. Today, we store more than 2Tb of data in our database on more than 400 cities on several months. But we do not have the operations that would have been necessary to guarantee the availability of the stations. Besides, it is not possible to estimate this quantity because a change in the number of bicycles at a station also alters the visible demand. When a station is empty, there may be a latent demand for bikes from users that cannot be observed. Adding bikes is the only way that would allow us to observe it. It is therefore very difficult and note very relevant to develop a Machine Learning model that can predict the ideal number of bikes.
So we took a totally different approach. Rather than trying to directly predict the ideal number of bikes at a station, we choose to predict the probability that a station will be full or empty in the next few hours. This estimates the likelihood that the demand for bicycles will be such that it fills or empties the station completely. The estimation of a probability rather than a number of bicycles is all the more relevant since the evolution of a station’s occupancy is a time series with a large part of randomness, which can only be accounted for by a probabilistic approach. In addition, we are not trying to determine whether a station will have 12 or 13 bicycles in an hour, but whether the station is likely to be empty or full in an hour. This risk is quantified by the probability that the station will encounter an availability problem.
Then comes the post-processing stage which allows us to answer our clients’ questions. Based on these probabilities, it is easy to determine:
- A minimum bound of the number of bikes to ensure that it will not get empty with a certain probability;
- An upper bound of the number of bikes to ensure that it will not get full with a certain probability.
Then we have an interval of the number of eligible bikes that guarantees the availability of the station with a certain probability. Its is an even more relevant answer as in general, the ideal number of bikes has no reason to be unique.
Figure 2: Eligible interval of bikes to guarantee availability – yellow: upper bound / blue: minimum bound
Looking further in the future gives narrower eligible interval
Our approach also allows us to fine-tune the solution to the problem and adapt it to our clients’ wishes. In the case of a model that directly outputs the prediction, it is possible to take into account the client’s needs, but it has to be done before the training of the model and makes it more difficult to make any changes a posteriori.
Our approach has led us to the development of a Machine Learning model whose output still requires some post-processing steps before answering the problem. What may appear to be additional work is in fact the strength of our model. The model’s predictions are easily interpretable and verifiable. In a context where our clients’ expectations are high and where the relevance of our recommendations is critical, it is essential to be able to verify and interpret our results.
Machine Learning should not be seen as a standard answer to all predictive problems, but as a useful and potentially high-performance brick that helps solve them. In a context where models are used by clients, it is important to be able to quickly assess the correctness of the predictions.
In the example discussed, we chose to use a Machine Learning model to address a more general and “low level” problem than then one of our clients . These models that we have set up require only a few post-processing steps to answer the problem.
A simple model that addresses part of the problem is usually much more relevant. It is more flexible in its use and easier to maintain as it is less subject to a particular formulation of a problem. It also allows us to answer multiple questions and not only the one of our clients. Finally it enriches our understanding of the phenomenon.
The difficulty in this approach is often the search for an intermediate problem that a Machine Learning model effectively solves. The key to achieving this is an excellent mastery of the data and a thorough understanding of what we are trying to model.