# Weight loss weightloss weight loss programs weight loss foods weight loss tips Beyond L2 Loss – How We Experiment with Loss Functions Weight loss weightloss weight loss programs weight loss foods weight loss tips

Estimating expected time of arrival (ETA) is crucial to what we do at Lyft. Estimates go directly to riders and drivers using our apps, as well as to different teams within Lyft. The Pricing team, for example, uses our ETA estimates in order to determine the rider’s fare. The Dispatch team uses them to find the best matches between drivers and groups of riders.

To provide reliable and accurate ETA estimates, the Mapping team built a sophisticated system that includes a machine learning ETA prediction model. The main component in any prediction model is the “loss” (or “objective”) function, and choosing the right loss function for the data can improve prediction accuracy. This post discusses a recently developed method that generates and compares loss functions that are suitable for calculating averages, such as the ETA.

Loss functions can be thought of as trails through a mountain range that can lead to different peaks. Each trail varies in difficulty. The amount of training data is the amount of energy one has to hike up the mountain. The optimization method is the trail guide. The result of all this is a prediction model or a trained hiker. Therefore, to take any trail we need to consider the energy and have the right guide, just as for each loss function we need to consider the amount of training data and the optimization method we’ll use.

The common approach when experimenting with loss functions is to create training, validation, and test datasets. Training models are run with different loss functions and evaluated on the test dataset with some evaluation metrics, in order to hopefully find a winner. To use the metaphor: Each day a hiker trains by mustering all their energy, choosing a trail and a guide to take them up the mountain, and stopping after they run out of energy. Trained hikers are compared based on how far they make it up the mountain. However, and this is a big “however”, how can we compare hikers that trained on different trails? In other words, are estimates from different loss functions comparable?

In evaluating competing models, Gneiting (2011) argued that one should either:

1. Fix and communicate a single evaluation loss to modelers; or
2. communicate the desired target (i.e. the mean), such that loss functions that are consistent for the target to be used in training.

Consistency is like identifying all trails that lead to the same peak, and thus the peak is the target. Formally, consistency with respect to a target means that the loss achieves its minimum value at the target. Mathematically, if the target is the mean, μ = E[Y], of some univariate random variable Y, a loss function L is consistent for μ if

for any other value z ∈ ℝ.

Why the different approaches? The two approaches enable us to be clear on what is being evaluated, since trails can lead to different peaks. This relates to the statistical concept of interpretability of the value being estimated: Are we estimating the mean, or the median, or some other property of the predictive distribution?

If we know exactly the desired evaluation loss, then regardless of interpretation, it suffices to communicate it to modelers. This facilitates automated rank based decisions, since modelers have already acceded to the evaluation loss. Modelers have the choice to use this loss in training if desired.

If interpretation is important, then it suffices to communicate the target (i.e., the destination peak) to modelers. For example, if prediction is used in derivative or secondary applications, then one might want to estimate the mean, since it is additive. Lyft’s ETA team trains our models to estimate the expected travel time, as opposed to the median, to preserve interpretability in stacked models.

Since we already know the peak we want to reach, the most famous trail is the squared-error (L2) loss, defined as:

That raises two questions. The first is intuitive: What are the other loss functions that are also consistent for the mean, and can they be better in some way? The other question is slightly non-intuitive. If two hikers trained on different trails leading to the same peak, we clearly cannot evaluate them on how far they perform on a single trail. The evaluators’ choice of trail might influence their ranking. If a hiker knows beforehand which trail they will be evaluated against, they might as well train on that trail to increase their competitiveness, going back to the first approach by fixing the evaluation loss.

In practice, if we have another loss function other than L2 that is also consistent for the mean, how do we compare them? If we use a specific metric to evaluate their outputs, for example, the root mean square error (RMSE), can we ensure that RMSE is impartial to the choice of the training loss?

Here n is the number of observations of y, and x is a prediction.

In this view, moving beyond the L2 loss while preserving the target (i.e., the mean) not only requires a method to generate consistent losses for the mean, but also an impartial diagnostic tool for such losses; an evaluation metric that does not depend on any specific trail.

The recent results of Ehm et. al. (2016) provide exactly that, a loss-generating method and an impartial diagnostic tool called the Murphy’s Diagram.

In the next section I will briefly discuss the Ehm et. al. (2016) framework and show some consistent losses for the mean, and then briefly introduce the Murphy’s Diagram as a diagnostic tool. Later I’ll discuss the optimization method and some practical implementation issues of the generated losses, before finally presenting an example from our data.

For a point prediction x of some y ∈ ℝ , Ehm et al. (2016) showed that any loss function that is consistent for the mean admits a mixture representation of the form:

For a continuous non-negative function h(u), where:

It is called a mixture since we’re integrating over some value u that is between model prediction x and the true observation y, where each u has a weight defined by h(u). lᵤ(x,y) is called the simple loss, since it characterizes the family of all losses that are consistent for the mean.

Let’s look at some known examples:

L2 loss. Assume that x < y (the converse follows similarly), and let h(u) = 2, we have:

Binary cross-entropy. Let h(u) = 1/{u(1-u)} with 0By co 