Weighted scoring rules#

Most scoring rules used in practice evaluate the entire forecast distribution. However, it is often the case that some outcomes are of more interest than others. For example, extreme outcomes often have a large impact on forecast users, making forecasts for these outcomes particularly valuable. One could therefore argue that these outcomes should be assigned a higher weight when evaluating forecast performance. For example, Savage (1971) remarks that “the scoring rule should encourage the respondent to work hardest at what the questioner most wants to know.” For this purpose, weighted scoring rules have been introduced, which extend conventional scoring rules to incorporate a weight function into the score. This weight function can then be chosen to emphasise the outcomes that are of most interest to the forecast users, facilitating flexible, user-oriented evaluation.

The forecaster’s dilemma#

Given a proper scoring rule $S$ on $\Omega$, one may try to emphasise particular outcomes by multiplying the score by a weight depending on what outcome occurs. That is, to evaluate the forecast $F$ using

\[ w(y)S(F, y), \]

for some non-negative weight function $w : \Omega \to [0, \infty)$. In this case, the score will be scaled depending on what outcome occurs.

However, Gneiting and Ranjan (2011) demonstrate that this is not a proper scoring rule. In particular, if $Y \sim G$, then the expected score $\mathbb{E}[w(Y)S(F, Y)]$ is minimised by a weighted version of $G$ rather than $G$ itself. This weighted version of $G$ assigns higher probability density to the outcomes that are assigned higher weight by $w$. For example, if $G$ emits a density $g$, then the weighted score is minimised by the distribution $G_{w}$ with density

\[ g_{w}(x) = \frac{w(x)g(x)}{\int_{\Omega} w(z) g(z) dz}. \]

If the weight function is of the form $w(x) = \mathbb{1}\{x \in A\}$ for some subset of the outcomes $A \subset \Omega$, then $G_{w}$ is the conditional distribution of $Y$ given that $Y \in A$. Note that this easily extends to distributions that do not emit a density function.

For more intuition, consider the case where the weight function restricts attention to extreme outcomes. If we only evaluate forecasts made for extreme outcomes, then it becomes an attractive strategy to always predict that an extreme event will occur, regardless of how likely such an event is. However, such a forecast is useless in practice: if we always predict that an extreme event will occur, then we cannot use this forecast to distinguish whether or not an extreme event will actually occur. Lerch et al. (2017) term this the forecaster’s dilemma. Instead, alternative, more theoretically-desirable methods are required to emphasise particular outcomes when evaluating forecast performance.

Outcome-weighting (conditional scores)#

Holzmann and Klar (2017) remarked that if the expected scoring rule is minimised by a weighted version of the true distribution of $Y$, then we can circumvent the forecaster’s dilemma by evaluating forecasts via their weighted version. That is, given a proper scoring rule $S$, the weighted scoring rule

\[ \mathrm{ow}S(F, y) = w(y)S(F_{w}, y) \]

is minimised when $F_{w} = G_{w}$, which is obviously the case (though generally not uniquely) when $F = G$. This scoring rule is therefore generally proper but not strictly proper.

This approach was initially applied to the Log score by Diks et al. (2011), but can be implemented using any proper scoring rule, providing a very general framework with which to weight particular outcomes during forecast evaluation whilst retaining the propriety of the scoring rule. We refer to these scoring rules as outcome-weighted scoring rules, since they weight the score depending on the outcome. They have also been referred to as conditional scores, since, when the weight function is of the form $w(x) = \mathbb{1}\{x \in A\}$, they evaluate the forecast via its conditional distribution on $A$.

In evaluating the conditional forecast distribution, these outcome-weighted scoring rules do not take into account the forecast probability that an outcome of interest will occur. For example, if $w(x) = \mathbb{1}\{x \in A\}$, then the score will not depend on the forecast probability that $\mathbb{1}\{Y \notin A\}$. Two forecasts that have the same conditional distribution will therefore receive the same score.

Holzmann and Klar (2017) proposed to remedy this by adding a proper scoring rule for binary outcomes, such as the Brier score, to the outcome-weighted scoring rule. However, there is no guarantee that the scale of the two scores are compatible, and there is often no canonical choice for the scoring rule for binary outcomes; the Log score is an exception, since this approach can recover the censored likelihood score introduced by Diks et al. (2011), which is discussed later (Holzmann and Klar, 2017, Example 4). Hence, these complemented outcome-weighted scores are not available in scoringrules. However, scoringrules does offer outcome-weighted versions of some popular scoring rules, which are listed in the following sections.

Threshold-weighting (censored scores)#

When $\Omega \subseteq \mathbb{R}$, Matheson and Winkler (1976) propose evaluating probabilistic forecasts by considering the probability forecast that a threshold will be exceeded, evaluating this forecast using a proper scoring rules for binary outcomes, and then integrating over all thresholds. For example, they introduce the Continuous Ranked Probability Score (CRPS) as

\[ \mathrm{CRPS}(F, y) = \int_{-\infty}^{\infty} (F(z) - \mathbb{1}\{y \le z\})^{2} dz, \]

which is the integral of the Brier score. Matheson and Winkler (1976) note that the integral over the binary scoring rule could be defined more generally with respect to any probability distribution, essentially incorporating a non-negative weight $w(z)$ into the integrand. They then remark that “if certain regions of values of the variable are of particular interest, the experimenter might make [$w(z)$] higher in these regions than it is elsewhere.”

In the case of the CRPS, Gneiting and Ranjan (2011) refer to this weighted version as the threshold-weighted CRPS, since the weight function emphasises particular thresholds at which the binary scoring rules are evaluated. Todter and Ahrens (2012) replaced the Brier score with the Log score and similarly introduced a Continuous Ranked Logarithmic Score (CRLS), as well as a threshold-weighted CRLS.

Allen et al. (2024) demonstrate that the threshold-weighted CRPS corresponds to transforming the forecast and observation, before evaluating the transformed forecast against the transformed observation using the standard (unweighted) CRPS; the transformation is determined by the weight function. For example, when $w(x) = \mathbb{1}\{x > t\}$, for $t \in \mathbb{R}$, one possible transformation is $v(x) = \max\{x, t\}$. In this case, the forecast distribution and the observation are essentially censored at the threshold $t$, which is akin to the censored likelihood score proposed by Diks et al. (2011).

Moreover, while the original idea of threshold-weighting was implicitly univariate and specific to scoring rules constructed using the approach of Matheson and Winkler (1976) described above, the idea of transforming forecasts and observations prior to evaluation can easily be implemented on arbitrary outcome spaces $\Omega$, and with arbitrary scoring rules. This facilitates the introduction of threshold-weighted versions of any proper scoring rule. Some examples constructed using popular scoring rules are listed in the following sections.

Vertical re-scaling#

Outcome-weighted and threshold-weighted scoring rules involve transforming the forecast distributions and outcomes (e.g. via conditioning or censoring) before calculating the original unweighted scores. Alternatively, since many scoring rules are defined in terms of (dis-)similarity metrics, we could instead define weighted scoring rules by transforming the output of these metrics, rather than the inputs. In particular, kernel scores are scoring rules of the form

\[ S_{k}(F, y) = \frac{1}{2} k(y, y) + \frac{1}{2} \mathbb{E} k(X, X^{\prime}) - \mathbb{E} k(X, y), \]

where $k$ is a positive definite kernel on $\Omega$. A positive definite kernel can be interpreted as a generalised inner product, so that $k(x, x^{\prime})$, $x, x^{\prime} \in \Omega$, loosely represents a measure of similarity between $x$ and $x^{\prime}$.

Rasmussen and Williams (2006) remark that if $k$ is a positive definite kernel on $\Omega$, then so is the weighted kernel $\check{k}(x, x^{\prime}) = k(x, x^{\prime})w(x)w(x^{\prime})$, for some non-negative weight function $w$ on $\Omega$. This can therefore be implemented within the definition of kernel scores above to obtain weighted versions of any kernel score. Allen et al. (2024) term these vertically re-scaled kernel scores, since they involve re-scaling the output of the similarity measure.

For an interpretation of these vertically re-scaled kernel scores, consider how $\check{k}$ adapts $k$. The similarity between $x$ and $x^{\prime}$ (as measured using $k$) is weighted depending on their values. For example, if $w(x) = \mathbb{1}\{x \in A\}$ for some subset of the outcomes $A \subset \Omega$, then

\[\begin{split} \check{k}(x, x^{\prime}) = \begin{cases} k(x, x^{\prime}), & x \in A, x^{\prime} \in A, \\ 0, & \text{otherwise.} \end{cases} \end{split}\]

In this case, the measure of similarity is restricted to instances where both inputs are in the region $A$ of interest. In the context of kernel scores, consider the term $\mathbb{E} \check{k}(X, y)$. When the outcome $y \notin A$, this term is equal to zero, regardless of the forecast. When $y \in A$, the term is maximised (resulting in a lower score) when $F$ assigns a higher probability to the region $A$. Conversely, the second term $\mathbb{E} \check{k}(X, X^{\prime})$ does not depend on the observation, and is minimised (resulting in a lower score) when $F$ assigns lower probability to $A$. Hence, this vertically re-scaled scoring rule rewards forecasts that issue a high probability to $A$ when $y \in A$, and a low probability to $A$ otherwise.

Similarly to the threshold-weighted scores, the behaviour of the forecast outside of $A$ does not contribute to the score beyond the probability that is assigned to $A$. In many relevant cases, vertical re-scaling is equivalent to threshold-weighting (Allen et al., 2024, Proposition 4.10).

Examples#

In the following, let $w : \Omega \to [0, \infty)$ be a weight function that assigns a non-negative weight to each possible outcome, unless specified otherwise.

Log score#

Assuming the forecast distribution $F$ emits a density function $f$ on $\Omega$, the Log score is defined as

\[ \mathrm{LS}(F, y) = -\log f(y). \]

The conditional likelihood score is defined as

\[\begin{split} \begin{aligned} \mathrm{coLS}(F, y; w) &= - w(y) \log \left( \frac{f(y)}{\int_{\Omega} w(z)f(z) dz} \right) \\ &= - w(y) \log f(y) + w(y) \log \left[ \int_{\Omega} w(z)f(z) dz \right]. \end{aligned} \end{split}\]

This can be interpreted as an outcome-weighted version of the Log score (Holzmann and Klar, 2017, Example 1).

For a weight function $w : \Omega \to [0, 1]$, the censored likelihood score is defined as

\[ \mathrm{ceLS}(F, y; w) = - w(y) \log f(y) - (1 - w(y)) \log \left[ 1 - \int_{\Omega} w(z)f(z) dz \right]. \]

The censored likelihood score evaluates the forecast via the censored density function, and can therefore be interpreted as a threshold-weighted version of the Log score (see the discussion in de Punder et al., 2023).

Since the Log score is not a kernel score, no vertically re-scaled version of the score exists.

Kernel scores#

The kernel score corresponding to a positive definite kernel $k$ on $\Omega$ is defined as

\[ S_{k}(F, y) = \frac{1}{2} k(y, y) + \frac{1}{2} \mathbb{E} k(X, X^{\prime}) - \mathbb{E} k(X, y), \]

where $X, X^{\prime} \sim F$ are independent. Allen et al. (2024) discuss how weighted versions of kernel scores can be constructed. Since many popular scoring rules are kernel scores, including the CRPS, energy score, and Variogram score, this facilitates the introduction of weighted versions of these popular scores. These examples are given in the following sections.

The outcome-weighted version of a kernel score can be written as

\[ \mathrm{ow}S_{k}(F, y; w) = \frac{1}{2} k(y, y)w(y) + \frac{1}{2 \bar{w}_{F}^{2}} \mathbb{E} k(X, X^{\prime})w(X)w(X^{\prime})w(y) - \frac{1}{\bar{w}_{F}}\mathbb{E} k(X, y)w(X)w(y), \]

where $\bar{w}_{F} = \mathbb{E}[w(X)]$.

The threshold-weighted version of a kernel score can be written as

\[ \mathrm{tw}S_{k}(F, y; v) = \frac{1}{2} k(v(y), v(y)) + \frac{1}{2} \mathbb{E} k(v(X), v(X^{\prime})) - \mathbb{E} k(v(X), v(y)), \]

where $v : \Omega \to \Omega$ is termed the chaining function. When $\Omega \subseteq \mathbb{R}$, the chaining function can be defined as the anti-derivative of the weight function. That is, $v(x) - v(x^{\prime}) = \int_{x^{\prime}}^{x} w(z) dz$ for all $x, x^{\prime} \in \mathbb{R}$. However, more generally, there is no canonical way to map a weight function $w$ to a chaining function $v$.

The vertically re-scaled version of a kernel score can be written as

\[ \mathrm{vr}S_{k}(F, y; w) = \frac{1}{2} k(y, y)w(y)^{2} + \frac{1}{2} \mathbb{E} k(X, X^{\prime})w(X)w(X^{\prime}) - \mathbb{E} k(X, y)w(X)w(y). \]

For any positive definite kernel $k$ on $\Omega$, the functions $\tilde{k}(x, x^{\prime}) = k(v(x), v(x^{\prime}))$ and $\check{k}(x, x^{\prime}) = k(x, x^{\prime})w(x)w(x^{\prime})$ are also positive definite kernels on $\Omega$ (Rasmussen and Williams, 2006). Hence, the threshold-weighted and vertically re-scaled versions of kernel scores are themselves kernel scores.

CRPS#

The CRPS is defined as

\[\begin{split} \begin{aligned} \mathrm{CRPS}(F, y) &= \int_{-\infty}^{\infty} (F(z) - \mathbb{1}\{y \le z\})^{2} dz \\ &= \mathbb{E}|X - y| - \frac{1}{2} \mathbb{E}|X - X^{\prime}|, \end{aligned} \end{split}\]

where $y \in \Omega \subseteq \mathbb{R}$, and $X, X^{\prime} \sim F$ are independent (Matheson and Winkler, 1976; Gneiting and Rafery, 2007).

Holzmann and Klar (2017) introduce the outcome-weighted CRPS as

\[\begin{split} \begin{aligned} \mathrm{owCRPS}(F, y; w) &= w(y) \int_{-\infty}^{\infty} (F_{w}(z) - \mathbb{1}\{y \le z\})^{2} dz \\ &= \frac{1}{\bar{w}_{F}}\mathbb{E}|X - y|w(X)w(y) - \frac{1}{2\bar{w}_{F}^{2}} \mathbb{E}|X - X^{\prime}|w(X)w(X^{\prime})w(y), \end{aligned} \end{split}\]

where $\bar{w}_{F} = \mathbb{E}[w(X)]$.

Gneiting and Ranjan (2011) introduce the threshold-weighted CRPS as

\[\begin{split} \begin{aligned} \mathrm{twCRPS}(F, y; w) &= \int_{-\infty}^{\infty} (F(x) - 1\{y \le x\})^{2} w(x) dx \\ &= \mathbb{E} | v(X) - v(y) | - \frac{1}{2} \mathbb{E} |v(X) - v(X^{\prime}) |, \end{aligned} \end{split}\]

where $v : \mathbb{R} \to \mathbb{R}$ is an anti-derivative of the weight function $w$ (Allen et al., 2024).

Allen et al. (2023) introduce the vertically re-scaled CRPS as

\[ \mathrm{vrCRPS}(F, y; w, x_{0}) = \mathbb{E} | X - y |w(X)w(y) - \frac{1}{2} \mathbb{E} | X - X^{\prime} | w(X)w(X^{\prime}) + \left( \mathbb{E}|X - x_{0}|w(X) - |y - x_{0}|w(y) \right) \left( \mathbb{E}w(X) - w(y) \right), \]

for some $x_{0} \in \mathbb{R}$. The canonical choice is $x_{0} = 0$, though if the weight function is of the form $w(x) = \mathbb{1}\{x > t\}$ or $w(x) = \mathbb{1}\{x < t\}$, for some threshold $t \in \mathbb{R}$, then setting $x_{0} = t$ recovers the threshold-weighted CRPS (Allen et al., 2024, Proposition 4.10).

Energy score#

The energy score is defined as

\[ \mathrm{ES}(F, \boldsymbol{y}) = \mathbb{E} \| \boldsymbol{X} - \boldsymbol{y} \| - \frac{1}{2} \mathbb{E} \| \boldsymbol{X} - \boldsymbol{X}^{\prime} \|, \]

where $\| \cdot \|$ is the Euclidean distance on $\mathbb{R}^{d}$, $\boldsymbol{y} \in \Omega \subseteq \mathbb{R}^{d}$, and $\boldsymbol{X}, \boldsymbol{X}^{\prime} \sim F$ are independent, with $F$ a multivariate predictive distribution on $\Omega$ (Gneiting and Raftery, 2007).

The outcome-weighted energy score is similarly defined as

\[ \mathrm{owES}(F, \boldsymbol{y}; w) = \frac{1}{\bar{w}_{F}} \mathbb{E} \| \boldsymbol{X} - \boldsymbol{y} \| w(\boldsymbol{X})w(\boldsymbol{y}) - \frac{1}{2\bar{w}_{F}^{2}} \mathbb{E} \| \boldsymbol{X} - \boldsymbol{X}^{\prime} \|w(\boldsymbol{X})w(\boldsymbol{X}^{\prime})w(\boldsymbol{y}), \]

where $\boldsymbol{X}, \boldsymbol{X}^{\prime} \sim F$ are independent, and $\bar{w}_{F} = \mathbb{E}[w(\boldsymbol{X})]$ for some weight function $w : \Omega \to [0, \infty)$ (Holzmann and Klar, 2017).

The threshold-weighted energy score constitutes a multivariate extension of the threshold-weighted CRPS,

\[ \mathrm{twES}(F, y\boldsymbol{y}; v) = \mathbb{E} | v(\boldsymbol{X}) - v(\boldsymbol{y}) | - \frac{1}{2} \mathbb{E} |v(\boldsymbol{X}) - v(\boldsymbol{X}^{\prime}) |, \]

where $v : \mathbb{R} \to \mathbb{R}$,

and the vertically re-scaled energy score constitutes a multivariate extension of the vertically re-scaled CRPS,

\[ \mathrm{vrES}(F, \boldsymbol{y}; w, \boldsymbol{x}_{0}) = \mathbb{E} | \boldsymbol{X} - \boldsymbol{y} |w(X)w(\boldsymbol{y}) - \frac{1}{2} \mathbb{E} | \boldsymbol{X} - \boldsymbol{X}^{\prime} | w(\boldsymbol{X})w(\boldsymbol{X}^{\prime}) + \left( \mathbb{E}|\boldsymbol{X} - \boldsymbol{x}_{0}|w(\boldsymbol{X}) - |\boldsymbol{y} - \boldsymbol{x}_{0}|w(\boldsymbol{y}) \right) \left( \mathbb{E}w(\boldsymbol{X}) - w(\boldsymbol{y}) \right) \]

(Allen et al., 2024). As with the vertically re-scaled CRPS, the canonical choice is $\boldsymbol{x}_{0} = \boldsymbol{0}$, though if the weight function is of the form $w(\boldsymbol{x}) = \mathbb{1}\{\boldsymbol{x} > \boldsymbol{x}\}$ or $w(\boldsymbol{x}) = \mathbb{1}\{\boldsymbol{x} < t\}$, for some multivariate threshold $\boldsymbol{x} \in \mathbb{R}$ (with $<$ and $>$ understood componentwise), then setting $\boldsymbol{x}_{0} = \boldsymbol{t}$ recovers the threshold-weighted energy score (Allen et al., 2024, Proposition 4.10).

Variogram score#

The Variogram score is defined as

\[ \mathrm{VS}_{p}(F, \boldsymbol{y}) = \sum_{i=1}^{d} \sum_{j=1}^{d} h_{i,j} \left( \mathbb{E} | X_{i} - X_{j} |^{p} - | y_{i} - y_{j} |^{p} \right)^{2}, \]

where $p > 0$, $\boldsymbol{y} = (y_{1}, \dots, y_{d}) \in \Omega \subseteq \mathbb{R}^{d}$, $\boldsymbol{X} = (X_{1}, \dots, X_{d}) \sim F$, with $F$ a multivariate predictive distribution on $\Omega$, and $h_{i,j} \ge 0$ are weights assigned to different pairs of dimensions (Scheuerer and Hamill, 2015).

Allen (2024) writes the outcome-weighted Variogram score as

\[ \mathrm{owVS}_{p}(F, \boldsymbol{y}; w) = w(\boldsymbol{y}) \sum_{i=1}^{d} \sum_{j=1}^{d} h_{i,j} \left( \frac{1}{\bar{w}_{F}} \mathbb{E} | X_{i} - X_{j} |^{p}w(\boldsymbol{X}) - | y_{i} - y_{j} |^{p} \right)^{2}. \]

Since the Variogram score is a kernel score, the threshold-weighted Variogram score is defined as

\[ \mathrm{twVS}_{p}(F, \boldsymbol{y}; v) = \sum_{i=1}^{d} \sum_{j=1}^{d} h_{i,j} \left(\mathbb{E} | v(\boldsymbol{X})_{i} - v(\boldsymbol{X})_{j} |^{p} - | v(\boldsymbol{y})_{i} - v(\boldsymbol{y})_{j} |^{p} \right)^{2}, \]

where $v(\boldsymbol{y}) = (v(\boldsymbol{y})_{1}, \dots, v(\boldsymbol{y})_{d}) \in \Omega \subseteq \mathbb{R}^{d}$ for some chaining function $v : \Omega \to \Omega$ (Allen et al., 2024).

The vertically-rescaled Variogram score is

\[ \mathrm{vrVS}_{p}(F, \boldsymbol{y}; w, x_{0}) = \mathbb{E} \left[ w(\boldsymbol{X})w(\boldsymbol{y}) \sum_{i=1}^{d} \sum_{j=1}^{d} h_{i,j} \left(| X_{i} - X_{j} |^{p} - | y_{i} - y_{j} |^{p} \right)^{2} \right] - \frac{1}{2} \mathbb{E} \left[ w(\boldsymbol{X})w(\boldsymbol{X^{\prime}}) \sum_{i=1}^{d} \sum_{j=1}^{d} h_{i,j} \left(| X_{i} - X_{j} |^{p} - | X_{i}^{\prime} - X_{j}^{\prime} |^{p} \right)^{2} \right] + \left( \mathbb{E} \left[ w(\boldsymbol{X}) \sum_{i=1}^{d} \sum_{j=1}^{d} h_{i,j} \left(| X_{i} - X_{j} |^{p} - | x_{0,i} - x_{0,j} |^{p} \right)^{2} \right] - w(\boldsymbol{y}) \sum_{i=1}^{d} \sum_{j=1}^{d} h_{i,j} \left(| y_{i} - y_{j} |^{p} - | x_{0,i} - x_{0,j} |^{p} \right)^{2} \right) \left( \mathbb{E}[w(\boldsymbol{X})] - w(\boldsymbol{y}) \right), \]

for some $\boldsymbol{x}_{0} = (x_{0,1}, \dots, x_{0,d}) \in \mathbb{R}^{d}$. Analogously to the vertically re-scaled energy score, the canonical choice is $\boldsymbol{x}_{0} = \boldsymbol{0}$, unless the weight function is of the form $w(\boldsymbol{x}) = \mathbb{1}\{\boldsymbol{x} > \boldsymbol{t}\}$ or $w(\boldsymbol{x}) = \mathbb{1}\{\boldsymbol{x} < t\}$, for some multivariate threshold $\boldsymbol{t} \in \mathbb{R}$ (with $<$ and $>$ understood componentwise), in which case setting $\boldsymbol{x}_{0} = \boldsymbol{t}$ recovers the threshold-weighted variogram score (Allen et al., 2024, Proposition 4.10).

Weight functions#

The weighted scoring rules introduced above are all recovered when the weight function is constant, e.g. $w(z) = 1$ for all $z \in \Omega$. Hence, weighted scoring rules increase the flexibility offered to practitioners when evaluating forecasts. The weight function can be chosen to direct the scoring rules to particular outcomes, so that poor forecasts for these outcomes are penalised more heavily than poor forecasts for other outcomes. In practice, it is often not trivial to decide which outcomes are “of interest”, and how much they should be emphasised when calculating the score; that is, what weight function should be used. This decision is very application-specific and it is therefore difficult to provide any general guidance regarding which weight function should be employed in practical applications. Nonetheless, we list some common weight functions below.

Univariate forecasts#

Suppose $\Omega \subseteq \mathbb{R}$. The most common weight function used in practice is an indicator weight function of the form

\[ w(z) = \mathbb{1}\{z > t\} \quad \text{or} \quad \mathbb{1}\{z < t\} \]

which restricts attention to outcomes above or below some relevant threshold $t \in \mathbb{R}$. The threshold $t$ is typically chosen to be a fairly extreme quantile of previously observed outcomes, though it could also correspond to a value used for policy making; a warning threshold, for example.

More generally, we can define indicator weights on arbitrary regions of $\Omega$. For example, we may be interested in values within a certain range,

\[ w(z) = \mathbb{1}\{a < z < b\} \quad \text{for some $-\infty \le a < b \le \infty$}, \]

which nests the above cases when $a = t$ and $b = \infty$ or when $a = -\infty$ and $b = t$; or in values that either fall below a low threshold or exceed a high threshold,

\[ w(z) = \mathbb{1}\{z < a\} + \mathbb{1}\{z > b\} \quad \text{for some $-\infty \le a < b \le \infty$}. \]

Alternatively, Gneiting and Ranjan (2011) suggested using normal density and distribution functions to define continuous weight functions that change smoothly over $\mathbb{R}$. Let $\phi_{\mu,\sigma}$ and $\Phi_{\mu, \sigma}$ denote the normal density and distribution functions respectively, with location parameter $\mu$ and scale parameter $\sigma$. The weight function

\[ w(z) = \Phi_{\mu, \sigma}(z) \]

assigns a higher weight to higher outcomes, with the weight increasing gradually from 0 to 1. This can be interpreted as a smoothed step function, where $\mu$ defines a threshold of interest, and $\sigma$ determines the speed at which the weight changes from 0 before the threshold to 1 afterwards; as $\sigma \to 0$, the weight function tends towards the indicator weight function $w(z) = \mathbb{1}\{z > \mu\}$, while a larger value of $\sigma$ increases the weight more slowly. Similarly, the weight function

\[ w(z) = 1 - \Phi_{\mu, \sigma}(z) \]

assigns a higher weight to smaller outcomes, with analogous interpretation of $\mu$ and $\sigma$.

Using the normal density function allows central values to be emphasised,

\[ w(z) = \phi_{\mu, \sigma}(z). \]

In this case, $\sigma$ controls the concentration of the weight function around $\mu$, with the weight function reducing to a point mass at $\mu$ as $\sigma \to 0$. Conversely, the lower and upper tails can be targeted simultaneously using

\[ w(z) = 1 - \frac{\phi_{\mu, \sigma}(z)}{\phi_{\mu, \sigma}(\mu)}. \]

While these all employ the normal density and distribution functions, the normal distribution could readily be replaced by any other location-scale family distribution. Allen (2024), for example, consider the same weights defined using the logistic distribution.

Multivariate forecasts#

Multivariate weight functions (when $\Omega \subseteq \mathbb{R}^{d}$) can be defined similarly. For example, it is common to employ an indicator weight function of the form

\[ w(\boldsymbol{z}) = \mathbb{1}\{z_{1} > t_{1}, \dots, z_{d} > t_{d}\} \]

to emphasise values above a threshold $t_{i} \in \mathbb{R}$ in each dimension. Different regions of the outcome space can again be targeted by interchanging $<$ and $>$ signs. More generally, this can be expressed as

\[ w(\boldsymbol{z}) = \mathbb{1}\{a_{1} < z_{1} < b_{1}, \dots, a_{d} < z_{d} < b_{d}\}, \]

for some $-\infty \le a_{i} < b_{i} \le \infty$ for $i = 1, \dots, d$.

Smooth weight functions can again be constructed using multivariate probability distributions. For example, let $\phi_{\boldsymbol{\mu}, \Sigma}$ and $\Phi_{\boldsymbol{\mu}, \Sigma}$ denote the density and distribution functions of a multivariate normal distribution with mean vector $\boldsymbol{\mu}$ and covariance matrix $\Sigma$. Then, high outcomes (in all dimensions) could be targeted using a weight function of the form

\[ w(\boldsymbol{z}) = \Phi_{\boldsymbol{\mu}, \Sigma}(\boldsymbol{z}). \]

The interpretation of $\boldsymbol{\mu}$ and $\Sigma$ is similar to in the univariate case: $\boldsymbol{\mu}$ can be thought of as a vector of thresholds, with $\Sigma$ controlling the rate at which the weight function increases from 0 to 1 along each dimension. Low outcomes can similarly be emphasised using

\[ w(\boldsymbol{z}) = 1 - \Phi_{\boldsymbol{\mu}, \Sigma}(\boldsymbol{z}), \]

while a high outcome in one dimension and a low outcome in another dimension could similarly be targeted using appropriate manipulations of the multivariate normal distribution. For example, when $d = 2$, the following weight function would target high values along the first dimension and low values along the second,

\[ w(\boldsymbol{z}) = \mathrm{P}(X_{1} \le z_{1}, X_{2} > z_{2}), \]

where $\boldsymbol{X} = (X_{1}, X_{2})$ follows a multivariate normal distribution.

The multivariate normal density function can similarly be used to target central values,

\[ w(\boldsymbol{z}) = \phi_{\boldsymbol{\mu}, \Sigma}(\boldsymbol{z}), \]

while tails (in all dimensions) can be emphasised using

\[ w(\boldsymbol{z}) = 1 - \phi_{\boldsymbol{\mu}, \Sigma}(\boldsymbol{z})/\phi_{\boldsymbol{\mu}, \Sigma}(\boldsymbol{\mu}). \]

As in the univariate case, these are just particular examples, and by no means an exhaustive list of the weight functions that should be employed in practice.

Chaining functions#

While most weighted scoring rules depend directly on a weight function, threshold-weighted scoring rules can be defined in terms of a transformation function, or chaining function $v : \Omega \to \Omega$. In the univariate case, a chaining function can be derived from a weight function. However, in the multivariate case, there is generally no canonical way to obtain a chaining function from a weight function. In the following, we discuss the choice of chaining function in more detail.

Univariate forecasts#

Due to the two representations of the threshold-weighted CRPS, it is straightforward to map a weight function to a chaining function in the univariate case ($\Omega \subseteq \mathbb{R}$). In particular, given a weight function $w : \Omega \to [0, \infty)$, the chaining function is simply an anti-derivative of the weight function. That is, the chaining function $v$ satisfied $v(x) - v(x^{\prime}) = \int_{x^{\prime}}^{x} w(z) dz$ for all $x, x^{\prime} \in \Omega$.

When $w$ is constant, we recover the identity function $v(z) = z$, for all $z \in \Omega$, and for all weight functions commonly used in practice, the chaining function is typically straightforward to calculate. Several weight and chaining function combinations are listed in Table 1 of Allen (2024).

For example, when $w(z) = \mathbb{1}\{z > t\}$, $t \in \mathbb{R}$, we get (up to an unimportant constant)

\[ v(z) = \max\{z, t\}, \]

and more generally for $w(z) = \mathbb{1}\{a < z < t\}$, $- \infty \le a < b \le \infty$ we have

\[ v(z) = \min\{\max\{z, a\}, b\}. \]

For $w(z) = \Phi_{\mu, \sigma}(z)$,

\[ v(z) = (z - \mu)\Phi_{\mu, \sigma}(z) + \sigma^{2} \phi_{\mu, \sigma}(z); \]

for $w(z) = 1 - \Phi_{\mu, \sigma}(z)$,

\[ v(z) = z - (z - \mu)\Phi_{\mu, \sigma}(z) - \sigma^{2} \phi_{\mu, \sigma}(z); \]

and for $w(z) = \phi_{\mu, \sigma}(z)$,

\[ v(z) = \Phi_{\mu, \sigma}(z). \]

Multivariate forecasts#

In the multivariate case $(\Omega \subseteq \mathbb{R}^{d})$, there is no default way to obtain a chaining function given a multivariate weight function. One approach is to find the anti-derivative of the weight function along each dimension separately. For example, for the weight function $ w(\boldsymbol{z}) = \mathbb{1}\{a_{1} < z_{1} < b_{1}, \dots, a_{d} < z_{d} < b_{d}\}, $ a possible chaining function is

\[ v(\boldsymbol{z}) = (\min\{\max\{z_{1}, a_{1}\}, b_{1}\}, . . . , \min\{\max\{z_{d}, a_{d}\}, b_{d}\}). \]

In this case, the weight function represents a box in multivariate space, and $v$ projects points not in the orthant onto its perimeter; the points inside the box, i.e., for which the weight function is equal to one, remain unchanged.

Similarly, for the smooth weight functions based on multivariate Gaussian distribution and density functions, a chaining function can be derived from a component-wise extension of the chaining functions corresponding to univariate Gaussian weight functions. For example, for $w(\boldsymbol{z}) = \Phi_{\boldsymbol{\mu}, \Sigma}(\boldsymbol{z})$, the chaining function would become

\[ v(\boldsymbol{z}) = ((z_{1} - \mu_{1})\Phi_{\mu_{1}, \sigma_{1}}(z_{1}) + \sigma_{1}^{2} \phi_{\mu_{1}, \sigma_{1}}(z_{1}), \dots, (z_{d} - \mu_{d})\Phi_{\mu_{d}, \sigma_{d}}(z_{d}) + \sigma_{d}^{2} \phi_{\mu_{d}, \sigma_{d}}(z_{d})), \]

where $\sigma_{1}, \dots, \sigma_{d}$ are the standard deviations along each dimension. This does not depend on the off-diagonal terms of the covariance matrix $\Sigma$, thereby implicitly assuming that it is diagonal. However, when the weight function is $w(\boldsymbol{z}) = \phi_{\boldsymbol{\mu}, \Sigma}(\boldsymbol{z})$, we can readily use

\[ v(\boldsymbol{z}) = \Phi_{\boldsymbol{\mu}, \Sigma}(\boldsymbol{z}). \]

A more detailed discussion on multivariate chaining functions is available in Allen et al. (2023).

Weighted scoring rules

Contents

Weighted scoring rules#

The forecaster’s dilemma#

Outcome-weighting (conditional scores)#

Threshold-weighting (censored scores)#

Vertical re-scaling#

Examples#

Log score#

Kernel scores#

CRPS#

Energy score#

Variogram score#

Weight functions#

Univariate forecasts#

Multivariate forecasts#

Chaining functions#

Univariate forecasts#

Multivariate forecasts#