Open access

Measuring statistical evidence and multiple testing

Authors: Michael Evans [email protected] and Jabed TomalAuthors Info & Affiliations

Publication: FACETS

25 May 2018

https://doi.org/10.1139/facets-2017-0121

Abstract

The measurement of statistical evidence is of considerable current interest in fields where statistical criteria are used to determine knowledge. The most commonly used approach to measuring such evidence is through the use of p-values, even though these are known to possess a number of properties that lead to doubts concerning their validity as measures of evidence. It is less well known that there are alternatives with the desired properties of a measure of statistical evidence. The measure of evidence given by the relative belief ratio is employed in this paper. A relative belief multiple testing algorithm was developed to control for false positives and false negatives through bounds on the evidence determined by measures of bias. The relative belief multiple testing algorithm was shown to be consistent and to possess an optimal property when considering the testing of a hypothesis randomly chosen from the collection of considered hypotheses. The relative belief multiple testing algorithm was applied to the problem of inducing sparsity. Priors were chosen via elicitation, and sparsity was induced only when justified by the evidence and there was no dependence on any particular form of a prior for this purpose.

Introduction

The need for the measurement of statistical evidence arises as an issue in science as follows. The scientific problem under consideration concerns some quantity of interest for which an investigator either wants to know its value or has a hypothesis that the quantity takes a specific value and wants to know if this is true or false. To answer such a question data x are collected. It is rare that the data provide a definitive answer but it is believed that the data contain evidence concerning this. The purpose of statistical reasoning or inference is to use this evidence to estimate the quantity of interest and provide an assessment of the accuracy of the estimate or indicate whether there is evidence either in favor of or against the hypothesized value, and provide an assessment of the strength of this evidence.

To implement statistical inference, additional ingredients are required. First, it is presumed that the data x can be thought of as having arisen from a probability distribution as represented by the density f. Provided the data were collected properly, this assumption is reasonable and this is assumed here. A consequence of this is that the data can be thought of as being objective in the sense that f fully describes how the data were produced from the set

X

of possible data values. Of course, f is generally not known so it is assumed that f ∈ {f_θ : θ ∈ Θ}, a family of probability densities on

X

referred to as the statistical model. Here θ is called the parameter and Θ the parameter space of the model. The quantity of interest is then represented as ψ = Ψ(θ), where Ψ:Θ → Ψ, and we don’t distinguish between the function and its range to save notation.

A natural approach to constructing a theory of inference is to determine a measure of the evidence in the data x that ψ is the true value for each ψ ∈ Ψ. The value ψ(x) ∈ Ψ that maximizes this measure of evidence is, then, the obvious estimate and a subset C(x) ⊂ Ψ of values with evidence measures above some threshold would, through a measure of its size, serve to give an assessment of the accuracy of ψ(x). For a null hypothesis H₀ : Ψ(θ) = ψ₀ the measure of evidence at ψ₀ indicates whether there is evidence in favor of or against H₀ and a measure of the strength of this evidence is then obtained by comparing the evidence at ψ₀ with the evidence at each of the other possible values for ψ. A theory that accomplishes this, based on the relative belief ratio RB(ψ | x) as the measure of evidence, is described by Evans (2015) and outlined in the section “Statistical analysis based on relative belief”.

Even though p-values are commonly used to measure evidence, it has long been recognized that there are serious issues associated with their use (for example, see Royall (1997)). This can be readily observed by considering what the cutoff is to determine when there is evidence against or for a hypothesis. Cutoffs like 5% are not only arbitrary, many treatments of p-values insist that it is not possible for a p-value to give evidence in favor of a null hypothesis. Although that is a perfectly valid statement, it seems like a significant weak point for a supposed measure of evidence. Even when a very small p-value is observed this does not mean that a result of scientific interest has been obtained. For, given the finite accuracy with which measurements are taken, it is rarely the case that the truth of H₀ practically corresponds to an exact value ψ₀. Rather, there is a region about ψ₀ such that if the true value lies in this region, for all practical purposes, H₀ is true. Using relative belief ratios evidence can be obtained either for or against H₀, there is a clear measure of the strength of the evidence, and the essential discreteness involved in assessing H₀ is easily handled.

The theory of relative belief requires an additional ingredient, namely a prior probability distribution π must be specified on Θ that reflects the beliefs concerning what values of θ are more or less likely. The prior is determined by an elicitation algorithm that is an argument as to why the prior in question is to be considered suitable. The prior π is subjective in nature and that seems contrary to the dictates of science, which properly has objectivity as the goal. Although it doesn’t justify the use of priors, it is to be noted that the model {f_θ : θ ∈ Θ} is also subjective as it is chosen by the investigator. One could argue in favor of this subjectivity, however, particularly when the choices are being made by an expert, as informed input should result in a better analysis, but doubts linger. Part of our approach to dealing with this concern is to check that any ingredient chosen is not contradicted by the objective data. Therefore, model checking and checking for prior–data conflict are necessary. Also, it is possible to choose a prior such that a desired result is obtained but such bias can be measured and controlled a priori by design. Some discussion on assessing prior–data conflict and bias is provided in the section “Statistical analysis based on relative belief”.

The focus of this paper is the following problem. Suppose Ψ is an open subset of R^k and we wish to assess the individual hypotheses H_0i = {θ : Ψ_i(θ) = ψ_0i}, namely H_0i is the hypothesis that the i-th coordinate of ψ equals ψ_0i. Considering these hypotheses separately is the multiple testing problem and the concern is to ensure that while controlling the individual error rate, the overall error rate is not too large. An error means either the acceptance of H_0i when it is false (a false negative) or the rejection of H_0i when it is true (a false positive). One approach is to make an inference about the number of H_0i that are true (or false) and then use this to control the number of H_0i that are accepted (or rejected). In the section “Inferences for multiple tests”, this is shown to work for small k but to fail for large k. As a remedy for this, a relative belief multiple testing algorithm is developed that controls for false positives and false negatives through the use of bounds on the evidence that are determined by the measurement of bias. This approach is shown to be consistent and to possess an optimal property when considering the assessment of a randomly selected hypothesis from the set of hypotheses.

In the section “Applications”, an application is made of the relative belief multiple testing algorithm to the problem of inducing sparsity. If it is known that ψ_i = Ψ_i(θ) = ψ_0i, then the effective dimension of the quantity of interest is k − 1, which is a simplification of the model. Sometimes there is a belief that many of the hypotheses H_0i are true, but there is little prior knowledge about which are true and it is, therefore, not clear how to choose a prior that reflects this belief. A common approach is to use a prior that, together with a particular estimation procedure, forces many of the ψ_i to take the corresponding value ψ_0i. For example, the use of a Laplace prior together with using the maximum value of the posterior as the estimate, known as maximum a posteriori (MAP) estimation, is known to accomplish this for certain problems. Problems with this approach include the possibility that such an assignment is simply an artifact of the prior and the estimation procedure and that sparsity requires an overly concentrated prior that leads to prior–data conflict with the coordinates for which H_0i is rejected. It would be preferable to have a procedure that was not dependent on a specific form for the prior, avoided prior–data conflict, and was based on the statistical evidence contained in the data, and this is the approach taken here. Practical applications are presented, with special emphasis on regression problems including the situation where the number of predictors exceeds the number of observations.

Evans (2015) noted that there are connections between relative belief and the pure likelihood approach to inference, as both consider statistical evidence as the core concept. This is also reflected in the approach to multiple testing developed in the current paper and that discussed by Strug and Hodge (2006a, 2006b). There have been several priors proposed for the sparsity problem through MAP estimation; for example, the spike-and-slab prior discussed by George and McCulloch (1993) and Rockova and George (2014), the Laplace prior discussed by Park and Casella (2008), and the horseshoe prior of Carvalho et al. (2009). Any prior can be used with the approach taken here, but logically an elicited prior is preferred over one possessing certain properties.

Statistical analysis based on relative belief

Suppose that interest is in inference about the quantity Ψ(θ) = ψ. Let Π_Ψ denote the prior measure of ψ, with density π_Ψ, and let Π_Ψ(⋅ | x) denote the posterior measure of ψ, with density π_Ψ(⋅ | x). Evidence is measured by change in belief (for example, see Salmon (1973) or Howson and Urbach (2006)), thus if belief in ψ increases there is evidence in favor of this value and evidence against it if belief decreases. Evans (2015) argued for the relative belief ratio RB_Ψ(ψ | x) = lim_δ→0Π_Ψ(N_δ(ψ) | x)/Π_Ψ(N_δ(ψ)) as a measure of evidence, where N_δ(ψ) is a sequence of neighborhoods of ψ converging (nicely, as defined by Rudin (1974)) to {ψ} as δ → 0. When π_Ψ and π_Ψ(⋅ | x) are continuous at ψ, then

R B_{Ψ} (ψ | x) = π_{Ψ} (ψ | x) / π_{Ψ} (ψ)

(1)

So RB_Ψ(ψ | x) > 1 indicates evidence in favor of ψ, RB_Ψ(ψ | x) < 1 indicates evidence against, and RB_Ψ(ψ | x) = 1 gives no evidence either way. Any 1−1 increasing function of RB_Ψ(⋅ | x) is an equivalent measure of evidence and RB_Ψ(⋅ | x) is invariant under smooth reparameterizations, thus relative belief inferences are invariant to these choices.

The best estimate of ψ is the value that maximizes the evidence, namely ψ(x) = arg sup RB_Ψ(ψ | x). Associated with this is a γ-credible region C_{Ψ, γ}(x) = {ψ : RB_Ψ(ψ | x) ≥ c_{Ψ, γ}(x)} containing those values whose evidence is above the threshold c_{Ψ, γ}(x) = inf{k : Π_Ψ(RB_Ψ(ψ | x) > k | x) ≤ γ}. As ψ(x) ∈ C_{Ψ, γ}(x), for every γ ∈ [0, 1], for selected γ, the “size” of C_{Ψ, γ}(x) is a measure of the accuracy of ψ(x). A calibration of RB_Ψ(ψ₀ | x) is given by the strength

Π_{Ψ} (R B_{Ψ} (ψ | x) \leq R B_{Ψ} (ψ_{0} | x) | x)

(2)

When RB_Ψ(ψ₀ | x) < 1, a small value of eq. (2) indicates a large posterior belief that the true value has a relative belief ratio greater than RB_Ψ(ψ₀ | x), and therefore there is strong evidence against ψ₀ but only weak evidence against it if eq. (2) is big. If RB_Ψ(ψ₀ | x) > 1, a large value of eq. (2) indicates a small posterior probability that the true value has a relative belief ratio greater than RB_Ψ(ψ₀ | x), and therefore there is strong evidence in favor of ψ₀, whereas a small value of eq. (2) only indicates weak evidence in favor of ψ₀. A variety of optimality and consistency results have been established for these inferences (see Evans (2015)).

When H₀ : Ψ(θ) = ψ₀ is false both eqs. (1) and (2) converge to 0, and when H₀ is true then eq. (1) converges to the maximum possible value, which is always >1. When H₀ is true and there are only a finite number of possible values for ψ then eq. (2) converges to 1, but in the continuous case eq. (2) can converge to a U(0,1) distribution. The view is taken here, however, that any time continuous probability is used this is an approximation to a finite, discrete context. For example, if ψ is a mean and the response measurements are to the nearest centimeter, then of course the true value of ψ cannot be known to an accuracy >0.5 cm, no matter how large the sample is. Furthermore, there are implicit bounds associated with any measurement process. As such, the restriction can be made to discretized parameters that take only a finite number of values. Thus, when ψ is a continuous, real-valued parameter, it is discretized to the intervals …, (ψ₀ − 3δ/2, ψ₀ − δ/2], (ψ₀ − δ/2, ψ₀ + δ/2], (ψ₀ + δ/2, ψ₀ + 3δ/2], … for some choice of δ > 0, and there are only a finite number of such intervals covering the range of possible values. With this discretization, then H₀ = (ψ₀ − δ/2, ψ₀ + δ/2] and eq. (2) is consistent. Thus, δ needs to be specified as part of the application, at least when the goal is assessing the evidence concerning H₀. The value of δ is simply the smallest difference from ψ₀ that matters in the application and presumably a knowledgeable scientist knows what this is and designs the measurement process that produces the data accordingly.

Let

A \subset X

be such that H₀ is accepted whenever x ∈ A, thus, with M(⋅ | H₀) denoting the prior predictive measure given that H₀ is true, M(A | H₀) is the prior probability of accepting H₀ when it is true. The relative belief acceptance region is A_rb(ψ₀) = {x : RB_Ψ(ψ₀ | x) > 1}. Let

R \subset X

be such that H₀ is rejected whenever x ∈ R and the relative belief rejection region is R_rb(ψ₀) = {x : RB_Ψ(ψ₀ | x) < 1}. Letting M denote the unconditional prior predictive measure the following result was proved by Evans (2015).

Theorem 1: (i) The acceptance region A_rb(ψ₀) minimizes M(A) among all acceptance regions A, satisfying M(A | H₀) ≥ M(A_rb(ψ₀) | H₀). (ii) The rejection region R_rb(ψ₀) maximizes M(R) among all rejection regions R, satisfying M(R | H₀) ≤ M(R_rb(ψ₀) | H₀).

The implication of this is that, when Π_Ψ({ψ₀}) = 0, then A_rb(ψ₀) minimizes the prior probability that H₀ is accepted given that it is false among all acceptance regions A satisfying the condition in (i) and R_rb(ψ₀) maximizes the prior probability that H₀ is rejected given that it is false among all rejection regions R satisfying the condition in (ii). The same result holds for the case when Π_Ψ({ψ₀}) > 0 with the inequalities in (i) and (ii) replaced by equalities. Under independent identically distributed (IID) sampling, M(A_rb(ψ₀) | H₀) → 1 and M(R_rb(ψ₀) | H₀) → 0 as sample size increases, so these quantities can be controlled by design. Theorem 1 can be generalized to obtain optimality results for the acceptance region A_rb, q(ψ₀) = {x : RB_Ψ(ψ₀ | x) > q} and the rejection region R_rb, q(ψ₀) = {x : RB_Ψ(ψ₀ | x) < q}. The following inequality is useful in the section “Inferences for multiple tests” in controlling error rates.

Theorem 2: M(R_rb, q(ψ₀) | ψ₀) ≤ q.

Proof: By the Savage–Dickey result (see proposition 4.2.7 in Evans (2015)), RB_Ψ(ψ₀ | x) = m(x | ψ₀)/m(x). Now

E_{M (\cdot | ψ_{0})} (m (x) / m (x | ψ_{0})) = 1

, and therefore, by Markov’s inequality, M(R_rb,q(ψ₀) | ψ₀) = M(m(x)/m(x | ψ₀) > 1/q | ψ₀) ≤ q.

One of the key concerns with Bayesian inference methods is that the prior can bias the analysis. Given a measure of evidence, however, it is possible to measure and control bias. The bias against H₀ is given by M(RB_Ψ(ψ₀ | x) ≤ 1 | ψ₀) = 1−M(A_rb(ψ₀) | ψ₀) as this is the prior probability that evidence will be obtained against H₀ when it is true. If the bias against H₀ is large, subsequently reporting, after seeing the data, that there is evidence against H₀ is not convincing. The bias in favor of H₀ is given by

M_{T} (R B_{Ψ} (ψ_{0} | x) \geq 1 | ψ_{0}^{'})

for values

ψ_{0}^{'} \neq ψ_{0}

such that the difference between ψ₀ and

ψ_{0}^{'}

represents the smallest difference of practical importance; note that this tends to decrease as

ψ_{0}^{'}

moves farther away from ψ₀. When the bias in favor is large, subsequently reporting, after seeing the data, that there is evidence in favor of H₀ is not convincing. For a fixed prior, both biases decrease with sample size and thus, in design situations, they can be used to set sample size and thereby control bias.

It is never known that the ingredients chosen for a statistical analysis are correct, but hopefully these serve as useful approximations in the sense that inferences drawn from them are reasonably accurate. If x lies in the tails of f_θ for every θ ∈ Θ, then it can be concluded that there is a problem with the model and it needs to be modified. It is clear that checking the prior is a meaningless activity if the model is to be discarded, thus model checking is carried out first. If the model passes, then the prior is checked and the approach of Evans and Moshonov (2006) is adopted here. For this let T be a minimal sufficient statistic (MSS) for the model with density m_T, and if the probability M_T(m_T(t) ≤ m_T(T(x))) is small, then conclude a prior–data conflict exists as this says that T(x) lies in the tails of the prior-predictive. The consistency of this procedure was established by Evans and Jang (2011a) as, under weak conditions this probability converges to Π(π(θ) ≤ π(θ_true)), and a methodology for modifying a prior that fails its checks was developed by Evans and Jang (2011b).

Inferences for multiple tests

Consider now the multiple testing problem. The typical approach to this problem relies on the use of p-values that, for the reasons discussed, are not adopted here. Rather, the relative belief ratio as a valid measure of statistical evidence is used as the basis for all inferences.

To see what the problem is with multiple testing suppose that Ψ_i is finite for each i, perhaps arising via a discretization as discussed in the section “Statistical analysis based on relative belief”, and let

ξ = Ξ (θ) = k^{- 1} \sum_{i = 1}^{k} I_{H_{0 i}} (Ψ_{i} (θ))

be the proportion of the hypotheses H_0i that are true. Note that the discreteness is essential, otherwise, under a continuous prior on Ψ, the prior distribution of Ξ(θ) is degenerate at 0. In an application it is desirable to make inference about the true value of ξ ∈ Ξ = {0, 1/k, 2/k,…, 1} and this is based on the relative belief ratio RB_Ξ(ξ | x) = Π(Ξ(θ) = ξ | x)/Π(Ξ(θ) = ξ). The appropriate estimate of Ξ is ξ(x) = arg sup_ξRB_Ξ(ξ | x) and its accuracy is assessed using the size of C_Ξ, γ(x) for some choice of γ ∈ [0, 1]. Hypotheses such as H₀ = {θ : Ξ(θ) ∈ [ξ₀, ξ₁]}, namely the proportion true is at least ξ₀ and no greater than ξ₁, is assessed using the relative belief ratio RB(H₀ | x) = Π(ξ₀ ≤ Ξ(θ) ≤ ξ₁ | x)/Π(ξ₀ ≤ Ξ(θ) ≤ ξ₁), which equals RB_Ξ(ξ₀ | x) when ξ₀ = ξ₁.

The estimate ξ(x) can be used to control how many hypotheses are potentially accepted. For this, select kξ(x) of the H_0i as being true from among those for which RB_{Ψ_i}(ψ_0i | x) > 1. Note that it does not make sense to accept H_0i when RB_{Ψ_i}(ψ_0i | x) < 1 as there is evidence against H_0i. Thus, if there are fewer than kξ(x) satisfying RB_{Ψ_i}(ψ_0i | x) > 1, then fewer than this number should be accepted. If there are more than kξ(x) of the relative belief ratios satisfying RB_{Ψ_i}(ψ_0i | x) > 1, then some method will have to be used to select the kξ(x) that are potentially accepted. It is clear, however, that the logical way to do this is to order the H_0i, for which RB_{Ψ_i}(ψ_0i | x) > 1, based on their strengths Π_Ψ(RB_{Ψ_i}(ψ_0i | x) ≤ RB_{Ψ_i}(ψ_0i | x) | x), from largest to smallest, and accept at most the kξ(x) for which the evidence is strongest. If control is desired of the number of false positives then the relevant parameter of interest is υ = ϒ(θ) = 1 − Ξ(θ), the proportion of false hypotheses. Note that Π(ϒ(θ) = υ) = Π(Ξ(θ) = 1 − υ), and therefore the relative belief estimate of υ satisfies υ(x) = 1 − ξ(x). Following the same procedure, the H_0i with RB_{Ψ_i}(ψ_0i | x) < 1 are ranked via their strengths and at most kυ(x) are rejected. This procedure will be referred to as the multiple testing algorithm.

The consistency of the multiple testing algorithm follows from results proved by Evans (2015) (see section 4.7.1 therein) under IID sampling. In other words, as the amount of data increases, ξ(x) converges to the proportion of H_0i that are true, each RB(ψ_0i | x) converges to the largest possible value (always >1) when H_0i is true and converges to 0 when H_0i is false, and the evidence in favor or against converges to the strongest possible, depending on whether the hypothesis in question is true or false.

The following example demonstrates the characteristics of the algorithm.

Example 1. Location normal.

Suppose that there are k independent samples x_ij for 1 ≤ i ≤ k, 1 ≤ j ≤ n, where the i-th sample is from a N(μ_i, σ²) distribution with μ_i unknown and σ² known. It is desired to assess the evidence as to whether or not H_0i : μ_i = μ₀ is true for i = 1,…, k. It is easy to modify our development to allow the sample sizes to vary, and the case where σ² is unknown is considered in the section “Applications”. This context is relevant to the analysis of microarray data. The statistic

T (x) = ({\bar{x}}_{1}, \dots, {\bar{x}}_{k})

is an MSS for this model, and thus a natural model checking procedure is to compare the observed value of the statistic

\sum_{i = 1}^{k} \sum_{j = 1}^{n} {(x_{i j} - {\bar{x}}_{i})}^{2} / σ^{2}

to the chi-squared(k(n − 1)) distribution.

For the prior, the μ₁,…, μ_k are taken to be IID from a

N (μ_{0}, λ_{0}^{2} σ^{2})

distribution. The value of

λ_{0}^{2}

is determined via elicitation. For this it is supposed that it is known with virtual certainty that each μ_i ∈ (m_l, m_u) for specified values m_l ≤ m_u. Here, virtual certainty is interpreted to mean that the prior probability of this interval is at least γ, where γ is a large probability like 0.99. It is also supposed that μ₀ = (m_l + m_u)/2. This implies that λ₀ = (m_u − m_l)/(2σΦ⁻¹((1 + 0.99)/2)). Following Evans and Jang (2011b), increasing the value of λ₀ implies a more weakly informative prior in this context and, as such, decreases the possibility of prior–data conflict, and this indicates how the prior is to be modified in case of prior–data conflict. Note that this elicitation argument also specifies μ₀ when this is not predetermined. The prior distribution of T is

N_{k} (μ_{0} 1_{k}, σ^{2} (λ_{0}^{2} + 1 / n) I_{k})

, where 1_k is the k-dimensional vector of 1s and I_k is the k × k identity matrix, and therefore the check on the prior becomes the probability

P (χ_{k}^{2} \geq \sum_{i = 1}^{k} {({\bar{x}}_{i} - μ_{0})}^{2} / σ^{2} (λ_{0}^{2} + 1 / n))

, where

χ_{k}^{2} \sim

chi-squared(k).

The posteriors of the μ_i are independent

μ_{i} | x \sim N (μ_{i} (x), (n λ_{0}^{2} + {1)}^{- 1} λ_{0}^{2} σ^{2})

, where

μ_{i} (x) = (n + 1 / λ_{0}^{2})^{- 1} (n {\bar{x}}_{i} + μ_{0} / λ_{0}^{2})

. Given that the measurements are taken to finite accuracy it is not realistic to test μ_i = μ₀. A value δ > 0 is specified so that H_0i = (μ₀ − δ/2, μ₀ + δ/2] in a discretization of R¹ into a finite number of intervals, each of length δ, as well as two tail intervals. For some

D \in N

there are 2D + 1 intervals I_d = (μ₀ + (d − 1/2)δ, μ₀ + (d + 1/2)δ] for d ∈ { −D, −D + 1,…, D} that span (m_l, m_u), together with the tail intervals (−∞, μ₀ − (D + 1/2)δ] and (μ₀ + (D + 1/2)δ, ∞). Then

R B_{_{i}} (I_{d} | x) = {Φ ((d + 1 / 2) δ / λ_{0} σ) - Φ ((d - 1 / 2) δ / λ_{0} σ {)}}^{- 1} \times {Φ ((n λ_{0}^{2} + {1)}^{1 / 2}

(μ_{0} + (d + 1 / 2) δ - μ_{i} (x)) / λ_{0} σ) - Φ ((n λ_{0}^{2} + {1)}^{1 / 2} (μ_{0} + (d - 1 / 2) δ - μ_{i} (x)) / λ_{0} σ)}

, with a similar formula for the tail intervals. When δ is small this is approximated by the ratio of the posterior to prior densities of μ_i evaluated at μ₀ + dδ. Then RB(I₀ | x) = RB_i(H_0i | x) gives the evidence for or against H_0i and the strength of this evidence is computed using the discretized posterior distribution. Notice that RB_i(H_0i | x) converges to ∞ as λ₀ → ∞ and this is characteristic of other measures of evidence such as Bayes factors. As discussed by Evans (2015), this is one of the reasons why calibrating eq. (1) via eq. (2) is necessary.

Now, consider the bias in the prior. To simplify matters, the continuous approximation is used as this makes little difference here (see Tables 3 and 4). The bias against μ_i = μ₀ equals

M (R B_{i} (μ_{0} | x) \leq 1 | μ_{0}) = 2 (1 - Φ (a_{n} (1)))

(3)

where

a_{n} (q) = {\begin{matrix} {(1 + 1 / n λ_{0}^{2}) \log ((n λ_{0}^{2} + 1) / q^{2})}^{1 / 2}, & q^{2} \leq n λ_{0}^{2} + 1 \\ 0, & q^{2} > n λ_{0}^{2} + 1 \end{matrix}

Note that eq. (3) converges to 2(1 − Φ(1)) = 0.32 as λ₀ → 0 and to 0 as λ₀ → ∞ and, for fixed λ₀, converges to 0 as n → ∞. Thus, there is never strong bias against μ_i = μ₀; this is as expected because the prior is centered on μ₀. The bias in favor of μ_i = μ₀ is measured by

M (R B_{i} (μ_{0} | x) \geq 1 | μ_{0} \pm δ / 2) = Φ (\sqrt{n} δ / 2 σ + a_{n} (1)) - Φ (\sqrt{n} δ / 2 σ - a_{n} (1))

(4)

As λ₀ → ∞ eq. (4) converges to 1, thus there is bias in favor of μ_i = μ₀ and this reflects what was obtained for the limiting value of RB_i(H_0i | x). Also, eq. (4) decreases with increasing δ and goes to 0 as n → ∞; thus, bias of both types can be controlled by sample size. Perhaps the most important take away from this discussion, however, is that by using a supposedly noninformative prior with λ₀ large, bias in favor of the H_0i is being induced.

Consider, first, a simulated data set x when k = 10, n = 5, σ = 1, δ = 1, μ₀ = 0, (m_l, m_u) = (−5, 5), so that λ₀ = 10/2Φ⁻¹(0.995) = 1.94 and suppose μ₁ = μ₂ = … = μ₇ = 0, with the remaining μ_i = 2. The relative belief ratio function RB_Ξ(⋅ | x) is plotted in Fig. 1. In this case, the relative belief estimate ξ(x) = 0.70 is exactly correct. Table 1 gives the values of the RB_i(0 | x) together with their strengths. It is clear that the multiple testing algorithm leads to 0 false positives and 0 false negatives. Therefore, the algorithm works perfectly on these data, but of course it can’t be expected to perform as well when the three nonzero means move closer to 0. Also, it is worth noting that the strength of the evidence in favor of μ_i = 0 is very strong for i = 1, 2, 3, 5, 6, 7, but only moderate when i = 4. The strength of the evidence against μ_i = 0 is very strong for i = 8, 9, 10. The maximum possible value of RB_i((μ₀ − δ/2, μ₀ + δ/2] | x) is (2Φ(δ/2λ₀σ) − 1)⁻¹ = 4.92, thus some of the relative belief ratios are relatively large.

Fig. 1.

Table 1.

Table 1. Relative belief ratios and strengths for the μ_i in Example 1 with k = 10, δ = 1.0.

i	1	2	3	4	5
μ_i	0	0	0	0	0
RB_i(0 \| x)	3.27	3.65	2.98	1.67	3.57
Strength	1.00	1.00	1.00	0.37	1.00
i	6	7	8	9	10
μ_i	0	0	2	2	2
RB_i(0 \| x)	3.00	3.43	2.09 × 10⁻⁴	3.99 × 10⁻⁴	8.80 × 10⁻³
Strength	1.00	1.00	4.25 × 10⁻⁵	8.11 × 10⁻⁵	1.83 × 10⁻³

To investigate sensitivity to the choice of δ several smaller values were considered. Table 2 gives the relevant entries for the same sample as Table 1 when δ = 0.5. The relative belief ratios do not change by much and still give evidence in the right direction. Some of the strengths do change, particularly for i = 1 and i = 6, which now indicate a bit weaker evidence in favor. In this case, ξ(x) = 0.60. Repeating these calculations with δ = 0.1 gives similar results, with the relative belief ratios staying about the same but the strengths getting weaker, and now ξ(x) = 0.50. The insensitivity of the RB_i to δ is expected, as the data should increase belief in the interval (μ₀ − δ/2, μ₀ + δ/2] when H_0i is true and decrease it when it is false. It is to be noted, however, that δ is not a tuning parameter of the algorithm but is determined by scientific knowledge in the application as the smallest difference from μ₀ of practical importance.

Table 2.

Table 2. Relative belief ratios and strengths for the μ_i in Example 1 with k = 10, δ = 0.5.

i	1	2	3	4	5
μ_i	0	0	0	0	0
RB_i(0 \| x)	3.58	4.17	3.15	1.43	4.64
Strength	0.62	1.00	0.59	0.26	1.00
i	6	7	8	9	10
μ_i	0	0	2	2	2
RB_i(0 \| x)	3.18	3.83	3.25 × 10⁻⁵	6.76 × 10⁻⁵	2.37 × 10⁻³
Strength	0.59	1.00	3.30 × 10⁻⁶	7.00 × 10⁻⁶	2.47 × 10⁻⁴

Now, consider basically the same context but with k = 1000, μ₁ = … = μ₇₀₀ = 0 and the remaining μ_i = 2. In this case, ξ(x) = 0.47, which is a serious underestimate. As such, the multiple testing algorithm will not record enough acceptances and will fail. This problem arises due to the independence of the μ_i. For the prior distribution of kΞ(θ) is binomial(k, 2Φ(δ/2λ₀σ) − 1) and the prior distribution of kϒ(θ) is binomial (k, 2(1 − Φ(δ/2λ₀σ))). Thus, the a priori expected proportion of true hypotheses is 2Φ(δ/2λ₀σ) − 1 and the expected proportion of false hypotheses is 2(1 − Φ(δ/2λ₀σ)). When δ/2λ₀σ is small, as when the amount of sampling variability or the diffuseness of the prior are large, then the prior on Ξ suggests a belief in many false hypotheses. When k is small, the data can override this to produce accurate inferences about ξ or υ, but otherwise, large amounts of data are needed that may not be available. Contrary to what is sometimes claimed, testing multiple hypotheses is also a problem in a Bayesian framework.

Example 1 makes it clear that, in general, accurate inference about ξ and υ is not feasible in high-dimensional contexts without large amounts of data. Rather than focus on estimating the proportion of true or false hypotheses, however, we consider an approach designed to protect against false positives or false negatives. It is often the case that when evidence against a hypothesis is obtained it prompts some kind of action, and a user may wish to prevent too many that are spurious. Alternatively, the user may be concerned with too many false negatives, as this may conceal a discovery of real value.

The entries in Tables 1 and 2 point to a feasible approach to these problems by focusing instead on the evidence concerning the individual μ_i, as these parameters do not depend on high-dimensional aspects of the full model parameter like ξ and υ do. To control the actions taken based on the evidence, constants q_R and q_A, where 0 < q_R ≤ 1 ≤ q_A, are used as follows: classify H_0i as accepted when RB_i(ψ_0i | x) > q_A and as rejected when RB_i(ψ_0i | x) < q_R. Note that those accepted always have evidence in favor, whereas those rejected always have evidence against. The strengths can also be quoted to assess the reliability of these inferences. Provided q_R is greater than the minimum possible value of RB_i(⋅ | x), and this is typically 0, and the q_A chosen is less than the maximum possible value of RB_i(ψ_0i | x), and this is 1 over the prior probability of H_0i, then this procedure is consistent as the amount of data increases. In fact, the related estimates of ξ and υ are also consistent. The price paid for this is that a hypothesis will not be classified whenever q_R ≤ RB_i(ψ_0i | x) ≤ q_A. Not classifying a hypothesis implies that there is not enough evidence for this purpose and more data are required. This approach is referred to as the relative belief multiple testing algorithm.

It remains to determine q_A and q_R. Consider, first, protecting against too many false positives. The a priori conditional prior probability, given that H_0i is true, of finding evidence against H_0i less than q_R satisfies M(RB_i(ψ_0i | X) < q_R | ψ_0i) ≤ q_R by Theorem 2. Naturally, we want the probability of a false positive to be small, and choosing q_R small accomplishes this. The a priori probability that a randomly selected hypothesis produces a false positive is

\frac{1}{k} \sum_{i = 1}^{k} M (R B_{i} (ψ_{0 i} | X) < q_{R} | ψ_{0 i})

(5)

which is bounded above by q_R and thus converges to 0 as q_R → 0. Also, for fixed q_R, eq. (5) converges to 0 as the amount of data increases. More generally q_R can be allowed to depend on i, but when the ψ_i are similar in nature this does not seem necessary. Furthermore, it is not necessary to weight the hypotheses equally, therefore a randomly chosen hypothesis with unequal probabilities could be relevant in certain circumstances. In any case, controlling the value of eq. (5), whether by sample size or by the choice of q_R, is clearly controlling for false positives. Suppose there is proportion p_FP of false positives that is just tolerable in a problem. Then, q_R can be chosen so that eq. (5) is less than or equal to p_FP; note that q_R = p_FP satisfies this.

Similarly, if

ψ_{0 i}^{'} \neq ψ_{0 i}

then

M (R B_{i} (ψ_{0 i} | X) > q_{A} | ψ_{0 i}^{'})

is the prior probability of accepting H_0i when

ψ_{0 i}^{'}

is the true value. For a given effect size δ of practical importance it is natural to take

ψ_{0 i}^{'} = ψ_{0 i} \pm δ / 2

. In typical applications this probability decreases the “farther”

ψ_{0 i}^{'}

is from ψ_0i, and choosing q_A to make this probability small will make it small for all meaningful alternatives. Under these circumstances the a priori probability that a randomly selected hypothesis produces a false negative is bounded above by

\frac{1}{k} \sum_{i = 1}^{k} M (R B_{i} (ψ_{0 i} | X) > q_{A} | ψ'_{0 i})

(6)

As q_A → ∞, or as the amount of data increases with q_A fixed, then eq. (6) converges to 0 and the number of false negatives can be controlled. If there is proportion p_FN of false negatives that is just tolerable in a problem, then q_A can be chosen so that eq. (6) is less than or equal to p_FN.

The following optimality result holds for relative belief multiple testing.

Corollary 1: (i) Among all procedures for which the prior probability of accepting H_0i, when it is true, is at least M(RB_i(ψ_0i | X) > q_A | ψ_0i) for i = 1,…, k, the relative belief multiple testing algorithm minimizes the prior probability that a randomly chosen hypothesis is accepted. (ii) Among all procedures for which the prior probability of rejecting H_0i, when it is true, is less than or equal to M(RB_i(ψ_0i | X) < q_R | ψ_0i), then the relative belief multiple testing algorithm maximizes the prior probability that a randomly chosen hypothesis is rejected.

Proof: For (i) consider a procedure for multiple testing and let A_i be the set of data values where H_0i is accepted. Then, by hypothesis M(RB_i(ψ_0i | X) > q_A | ψ_0i) ≤ M(A_i | ψ_0i) and by the analog of Theorem 1, M(A_i) ≥ M(RB_i(ψ_0i | X) > q_A). Applying this to a randomly chosen H_0i gives the result. The proof of (ii) is basically the same.

Applying the same discussion as after Theorem 1, it is seen that, under reasonable conditions, the relative belief multiple testing algorithm minimizes the prior probability of accepting a randomly chosen H_0i when it is false and maximizes the prior probability of rejecting a randomly chosen H_0i when it is false. This establishes an optimality result for the relative belief multiple testing algorithm.

Consider now the application of the relative belief multiple testing algorithm in the previous example.

Example 2. Location normal example, continued.

In this context, M(RB_i(μ₀ | x) < q_R | μ₀) = 2(1 − Φ(a_n(q_R)) for all i and, therefore, this is the value of eq. (5). Therefore, q_R is chosen to make this number suitably small. Table 3 records values for eq. (5) for both the continuous and discretized cases. From this it is seen that for small n there can be some bias against H_0i when q_R = 1, and thus the prior probability of obtaining false positives is perhaps too large. Table 3 demonstrates that choosing a smaller value of q_R can adequately control the prior probability of false positives.

Table 3.

Table 3. Prior probability that a randomly chosen hypothesis produces a false positive when δ/σ = 1, continuous and discretized () versions, in Example 2.

n	λ₀	q_R	(5)	n	λ₀	q_R	(5)
1	1	1	0.239 (0.228)	5	1	1	0.143 (0.097)
		1/2	0.041 (0.030)			1/2	0.051 (0.022)
		1/10	0.001 (0.000)			1/10	0.006 (0.001)
	2	1	0.156 (0.146)		2	1	0.074 (0.041)
		1/2	0.053 (0.045)			1/2	0.031 (0.013)
		1/10	0.005 (0.004)			1/10	0.005 (0.001)
	10	1	0.031 (0.026)		10	1	0.013 (0.004)
		1/2	0.014 (0.011)			1/2	0.006 (0.002)
		1/10	0.002 (0.002)			1/10	0.001 (0.001)

For false negatives, consider eq. (6), where

M (R B_{i} (μ_{0} | x) > q_{A} | μ_{0} \pm δ / 2) = {\begin{matrix} Φ (\sqrt{n} δ / 2 σ + a_{n} (q_{A})) - Φ (\sqrt{n} δ / 2 σ - a_{n} (q_{A})), & 1 \leq q_{A}^{2} \leq n λ_{0}^{2} + 1 \\ 0, & q_{A}^{2} > n λ_{0}^{2} + 1 \end{matrix}

for all i. It is easy to show that this is monotone decreasing in δ, and therefore it is an upper bound on the expected proportion of false negatives among those hypotheses that are actually false. The cutoff q_A can be chosen to make this number as small as desired. When δ/σ → ∞, then eq. (6) converges to 0 and increases to 2Φ(a_n(q_A)) − 1 as δ/σ → 0. Table 4 records values for eq. (6) when δ/σ = 1 so that the μ_i differ from μ₀ by one half of a standard deviation. There is clearly some improvement but the bias in favor of false negatives is still readily apparent. It would seem that taking

q_{A} = \sqrt{n λ_{0}^{2} + 1}

gives the best results, but this could be considered s quite conservative. It is also worth remarking that all the entries in Table 4 can be considered very conservative when large effect sizes are expected.

Table 4.

Table 4. Prior probability that a randomly chosen hypothesis produces a false negative when δ/σ = 1, continuous and discretized () versions, in Example 2.

n	λ₀	q_A	(6)	n	λ₀	q_A	(6)
1	1	1.0	0.704 (0.715)	5	1	1.0	0.631 (0.702)
		1.2	0.527 (0.503)			2.0	0.302 (0.112)
		1.4	0.141 (0.000)			2.4	0.095 (0.000)
	2	1.0	0.793 (0.805)		2	1.0	0.747 (0.822)
		2.0	0.359 (0.304)			3.0	0.411 (0.380)
		2.2	0.141 (0.000)			4.5	0.084 (0.000)
	10	1.0	0.948 (0.955)		10	1.0	0.916 (0.961)
		5.0	0.708 (0.713)			10.0	0.552 (0.588)
		10.0	0.070 (0.000)			22.0	0.080 (0.000)

Now, consider the situation when k = 1000, n = 5, δ = 1 and λ₀ = 1.94 is the elicited value. From Table 3 with q_R = 1.0 about 8% false positives are expected a priori, and from Table 4 with q_A = 1.0 a worst case upper bound on the a priori expected percentage of false negatives is about 75%. The top part of Table 5 indicates that with q_R = q_A = 1.0, then 4.9% (34 of 700) false positives and 0.1% (3 of 300) false negatives were obtained. With these choices of the cutoffs all hypotheses are classified. Certainly the upper bound 75% seems far too pessimistic in light of the results, but recall that Table 4 is computed at the false values μ = ±0.5. The relevant a priori expected percentage of false negatives when μ = ±2.0 is about 3.5%. The bottom part of Table 5 gives the relevant values when q_R = 0.5 and q_A = 3.0. In this case, there are 2.1% (9 of 428) false positives and 0% false negatives, but 39.9% (272 of 700) of the true hypotheses and 4.3% (13 of 300) of the false hypotheses were not classified as the relevant relative belief ratio lay between q_R and q_A. Thus, being more conservative has reduced the error rates, but with the drawback that a large proportion of the true hypotheses don’t get classified. The procedure has worked well in this example, but of course the error rates can be expected to rise when the false values move towards the null and improve when they move away from the null.

Table 5.

Table 5. Confusion matrices for Example 2 with k = 1000 when 700 of the μ_i equal 0 and 300 of the μ_i equal 2.

Decision	μ = 0	μ = 2
Accept μ = 0 using q_A = 1.0	666	3
Reject μ = 0 using q_R = 1.0	34	297
Not classified	0	0
Accept μ = 0 using q_A = 3.0	419	0
Reject μ = 0 using q_R = 0.5	9	287
Not classified	272	13

What is implemented in an application depends on the goals. If the primary purpose is to protect against false positives, then Table 3 indicates that this is accomplished fairly easily. Protecting against false negatives is more difficult; as the actual effect sizes are not known a decision has to be made. Note that choosing a cutoff is equivalent to saying that one will only accept H_0i if the belief in the truth of H_0i has increased by a factor at least as large as q_A. Computations such as those in Table 4 can be used to provide guidance, but there is no avoiding the need to be clear about what effect sizes are deemed to be important or the need to obtain more data when this is necessary. With the relative belief multiple testing algorithm error rates are effectively controlled, but there may be many true hypotheses not classified.

The idea of controlling the prior probability of a randomly chosen hypothesis yielding a false positive or a false negative via eq. (5) or eq. (6), respectively, can be extended. For example, consider the prior probability that a random sample of l from k hypotheses yields at least one false positive

\frac{1}{(\begin{matrix} k \\ l \end{matrix})} \sum_{{i_{1}, \dots, i_{l}} \subset {1, \dots, k}} M (\begin{matrix} at least one of R B_{i_{j}} (ψ_{0 i_{j}} | X) < q_{R} \\ for j = 1, \dots, l | ψ_{0 i_{1}}, \dots, ψ_{0 i_{l}} \end{matrix})

(7)

In the context of the examples in this paper, and many others, the term in eq. (7) corresponding to {i₁,…, i_l} equals M(at least one of

R B_{i_{j}} (ψ_{0 i_{j}} | X) < q_{R}

for j = 1,…, l | ψ₀). The following result leads to an interesting property for eq. (7).

Lemma 1: Let (Ω,

F

, P) be a probability model and

B = {A_{1}, \dots, A_{k}} \subset F

. The probability that at least one of l ≤ k randomly selected events from

B

occurs is increasing in l.

Proof: Let Δ(i) be the event that exactly i of

A_{1}, \dots, A_{k} \in F

occur, so that

\cup_{i = 1}^{k} A_{i} = \cup_{i = 1}^{k} Δ (i)

; note that the Δ(i) are mutually disjoint. When l < k,

\begin{array}{l} S_{l, k} & = \sum_{{i_{1}, \dots, i_{l}} \subset {1, \dots, k}} I_{A_{i_{1}} \cup \dots \cup A_{i_{l}}} = (\begin{matrix} k \\ l \end{matrix}) \sum_{i = 0}^{l - 1} I_{Δ (k - i)} + \sum_{i = l}^{k - 1} [(\begin{matrix} k \\ l \end{matrix}) - (\begin{matrix} i \\ l \end{matrix})] I_{Δ (k - i)} \\ = (\begin{matrix} k \\ l \end{matrix}) \sum_{i = 0}^{k - 1} I_{Δ (k - i)} - \sum_{i = l}^{k - 1} (\begin{matrix} i \\ l \end{matrix}) I_{Δ (k - i)} \end{array}

and

S_{k, k} = I_{A_{1} \cup \dots \cup A_{k}}

. Now, consider

{(\begin{matrix} k \\ l \end{matrix})}^{- 1} S_{l, k} - {(\begin{matrix} k \\ l - 1 \end{matrix})}^{- 1} S_{l, k}

, which equals

\frac{1}{(\begin{matrix} k \\ l \end{matrix})} \sum_{{i_{1}, \dots, i_{l}} \subset {1, \dots, k}} I_{A_{i_{1}} \cup \dots \cup A_{i_{l}}} - \frac{1}{(\begin{matrix} k \\ l - 1 \end{matrix})} \sum_{{i_{1}, \dots, i_{l - 1}} \subset {1, \dots, k}} I_{A_{i_{1}} \cup \dots \cup A_{i_{l - 1}}}

(8)

If l = k, then eq. (8) equals

I_{A_{1} \cup \dots \cup A_{k}} - \sum_{i = 0}^{k - 1} I_{Δ (k - i)} + I_{Δ (1)} = I_{A_{1} \cup \dots \cup A_{k}} - \sum_{i = 0}^{k - 2} I_{Δ (k - i)}

, which is nonnegative. If l < k, then eq. (8) equals

{(\begin{matrix} k \\ l - 1 \end{matrix})}^{- 1} I_{Δ (k - l + 1)} + \sum_{i = l}^{k - 1} [(\begin{matrix} i \\ l - 1 \end{matrix}) {(\begin{matrix} k \\ l - 1 \end{matrix})}^{- 1} - (\begin{matrix} i \\ l \end{matrix}) {(\begin{matrix} k \\ l \end{matrix})}^{- 1}] I_{Δ (k - i)}

, which is nonnegative because an easy calculation shows that each term in the second sum is nonnegative. The expectation of eq. (8) is then nonnegative and this establishes the result.

It follows, by taking A_i = {x : RB_i(ψ_0i | x) < q_R}, that eq. (7) is an upper bound on the prior probability that a random sample of l′ hypotheses yields at least one false positive whenever l′ ≤ l. Thus, eq. (7) leads to a more rigorous control over the possibility of false positives. A similar result is obtained for false negatives.

Applications

We now consider the sparsity problem.

Example 3. Testing for sparsity.

Consider the context of Example 1. A natural approach to inducing sparsity is to estimate μ_i by μ₀ whenever RB_i(μ₀ | x) > q_A. From the simulation it is seen that this works extremely well when q_A = 1 for both k = 10 and k = 1000. It also works when k = 1000 and q_A = 3, in the sense that the error rate is low, but it is conservative in the amount of sparsity it induces in that case. Again, the goals of the application will dictate what is appropriate.

Another Bayesian method for inducing sparsity is to use the Bayesian Lasso as per Park and Casella (2008) and based on Tibshirani (1996). The prior here is a product of independent Laplace distributions, namely

\prod_{i = 1}^{k} [(\sqrt{2} λ_{0} σ)^{- k} \times \exp {- (\sqrt{2} / λ_{0} σ) \sum_{i = 1}^{k} | μ_{i} - μ_{0} |}]

, where σ is assumed known and μ₀, λ₀ are hyperparameters. Note that each Laplace prior has mean μ₀ and variance

λ_{0}^{2} σ^{2}

. Using the elicitation algorithm provided in Example 1 but replacing the normal prior with a Laplace prior leads to the assignment μ₀ = (m_l + m_u)/2, λ₀ = (m_u − m_l)/2σG⁻¹(0.995), where G⁻¹(p) = 2^−1/2 log 2p when p ≤ 1/2, G⁻¹(p) = −2^−1/2 log 2(1 − p), where p ≥ 1/2 and G⁻¹ denotes the quantile function of a Laplace distribution with mean 0 and variance 1. With the specifications used in the simulations of Example 1, this leads to μ₀ = 0 and λ₀ = 1.54, which implies a smaller variance than the value λ₀ = 1.94 used with the normal prior, and therefore the Laplace prior is more concentrated about 0.

The posteriors for the μ_i are independent with the density for μ_i proportional to

\exp {- n {({\bar{x}}_{i} - μ_{i})}^{2} / 2 σ^{2} - \sqrt{2} | μ_{i} - μ_{0} | / λ_{0} σ}

giving the MAP estimator

μ_{i MAP} (x) = {\begin{array}{l} {\bar{x}}_{i} + \sqrt{2} σ / λ_{0} n, & {\bar{x}}_{i} < μ_{0} - \sqrt{2} σ / λ_{0} n \\ μ_{0}, & μ_{0} - \sqrt{2} σ / λ_{0} n \leq {\bar{x}}_{i} \leq μ_{0} + \sqrt{2} σ / λ_{0} n \\ {\bar{x}}_{i} - \sqrt{2} σ / λ_{0} n, & {\bar{x}}_{i} > μ_{0} + \sqrt{2} σ / λ_{0} n \end{array}

The MAP estimate of μ_i is sometimes forced to equal μ₀, although this effect is negligible whenever

\sqrt{2} σ / λ_{0} n

is small.

The Lasso induces sparsity through estimation by taking λ₀ to be small. By contrast, the evidential approach, based on the normal prior and the relative belief ratio, induces sparsity through taking λ₀ large. The advantage to this latter approach is that by taking λ₀ large, prior–data conflict is avoided. When taking λ₀ small, the potential for prior–data conflict increases, as the true values can be deep into the tails of the prior. For example, for the simulations of Example 1,

\sqrt{2} σ / λ_{0} n = 0.183

, which is smaller than the δ/2 = 0.5 used in the relative belief approach with the normal prior. Therefore, it can be expected that the Lasso will do worse here, and this is reflected in Table 6 in which there are far too many false negatives. To improve this, the value of λ₀ needs to be reduced; however, note that this is determined by an elicitation and there is the risk of then encountering prior–data conflict. Another possibility is to implement the evidential approach with the elicited Laplace prior and the discretization as done with the normal prior, and then results similar to those obtained in Example 1 can be expected.

Table 6.

Table 6. Confusion matrices using Lasso with k = 1000 when 700 of the μ_i equal 0 and 300 of the μ_i equal 2 in Example 3.

Decision	μ = 0	μ = 2
Accept μ = 0 using q_A = 1.0	227	0
Reject μ = 0 using q_A = 1.0	473	300

It is also interesting to compare the MAP estimation approach and the relative belief approach with respect to the conditional prior probabilities of μ_i being assigned the value μ₀ when the true value actually is μ₀. It is easily seen that, based on the Laplace prior,

M (μ_{i MAP} (x) = μ_{0} | μ_{0}) = 2 Φ (\sqrt{2} / λ_{0} \sqrt{n}) - 1

, and this converges to 0 as n → ∞ or λ₀ → ∞. For the relative belief approach M(RB_i(μ₀ | x) > q_A | μ₀) is the relevant probability. With either the normal or Laplace prior M(RB_i(μ₀ | x) > q_A | μ₀) converges to 1 both as n → ∞ and as λ₀ → ∞. Therefore, with enough data the correct assignment is always made using relative belief but not with MAP based on the Laplace prior.

The Laplace and normal priors work equally with the relative belief multiple testing algorithm but there are no advantages to using the Laplace prior. One could argue too that the singularity of the Laplace prior at its mode makes it an odd choice and there doesn’t seem to be a good justification for this. Furthermore, the computations are harder with the Laplace prior, particularly with more complex models, and therefore using a normal prior is preferable overall.

An example with considerable practical significance is now considered.

Example 4. Full rank regression.

Suppose the basic model is given by y = β₀ + β₁x₁ + … + β_kx_k + z = β₀ + x′β_1:k + z, where the x_i are predictor variables, z ∼ N(0, σ²) and β and σ² are unknown. The problem of interest is testing H_0i : β_i = 0 for i = 1,…, k to establish which variables have any effect on the response. It is assumed that the observed values of the predictor variable have been standardized so that for observations (y, X) ∈ Rⁿ × R^n×(k+1), where X = (1, x₁,…, x_k) is of rank k+1, then 1′x_i = 0 and ‖x_i‖² = 1 for i = 1,…, k. Note that (b, s), where b = (X′X)⁻¹X′y and s = ‖y − Xb‖, is an MSS for this model, and model checking can be carried out by considering functions of the standardized residuals r = (y − Xb)/s as this has a distribution independent of (β, σ²). The skewness and kurtosis statistics are such functions and it is straightforward to simulate from their distributions to determine if their observed values are surprising.

The prior distribution of (β, σ²) is taken to be

β | σ^{2} \sim N_{k + 1} (0, σ^{2} Σ_{0}), 1 / σ^{2} \sim {gamma}_{rate} (α_{1}, α_{2})

(9)

for some hyperparameters Σ₀ and (α₁, α₂). Note that this may entail subtracting a known, fixed constant from each y value so that the prior for β₀ is centered at 0. Taking 0 as the central value for the priors on the remaining β_i seems appropriate when the primary concern is whether or not each x_i is having any effect. The marginal prior for β_i is then

{(α_{2} / α_{1}) σ_{0 i i}^{2}}^{1 / 2} t_{2 α_{1}}

, where t_2α₁ denotes the t distribution on 2α₁ degrees of freedom, for i = 0,…, k. Hereafter, we will take

Σ_{0} = λ_{0}^{2} I_{k + 1}

although it is easy to generalize to more complicated choices.

The elicitation of the hyperparameters is carried out via an extension of a method developed by Cao et al. (2014) for the multivariate normal distribution. Suppose that it is known with virtual certainty, based on our knowledge of the measurements being taken, that β₀ + x′β_1:k will lie in the interval (−m₀, m₀) for some m₀ > 0 for all x ∈ R, where R is a compact set centered at 0. On account of the standardization, R ⊂ [−1, 1]^k. Again “virtual certainty” is interpreted as probability greater than or equal to γ, where γ is some large probability like 0.99. Therefore, the prior on β must satisfy 2Φ(m₀/σλ₀{1 + x′x}^1/2) − 1 ≥ γ for all x ∈ R, and this implies that

σ \leq m_{0} / λ_{0} τ_{0} z_{(1 + γ) / 2}

(10)

where

τ_{0}^{2} = 1 + \max_{x \in R} ‖ x ‖^{2} \leq 1 + k

with equality when R = [−1, 1]^k.

An interval that will contain a response value y with virtual certainty, given predictor values x, is β₀ + x′β_1:k ± σz_(1+γ)/2. Suppose that we have lower and upper bounds s₁ and s₂ on the half-length of this interval so that s₁ ≤ σz_(1+γ)/2 ≤ s₂ or, equivalently,

s_{1} / z_{(1 + γ) / 2} \leq σ \leq s_{2} / z_{(1 + γ) / 2}

(11)

holds with virtual certainty. Combining eq. (11) with eq. (10) implies λ₀ = m₀/s₂τ₀.

To obtain the relevant values of α₁ and α₂ let G(α₁, α₂,⋅) denote the cdf of the gamma_rate(α₁, α₂) distribution, and note that G(α₁, α₂, z) = G(α₁, 1, α₂z). Therefore, the interval for 1/σ² implied by eq. (11) contains 1/σ² with virtual certainty, when α₁, α₂ satisfy

G^{- 1} (α_{1}, α_{2}, (1 + γ) / 2) = s_{1}^{- 2} z_{(1 + γ) / 2}^{2}, G^{- 1} (α_{1}, α_{2}, (1 - γ) / 2) = s_{2}^{- 2} z_{(1 - γ) / 2}^{2}

, or equivalently

G (α_{1}, 1, α_{2} s_{1}^{- 2} z_{(1 + γ) / 2}^{2}) = (1 + γ) / 2

(12)

G (α_{1}, 1, α_{2} s_{2}^{- 2} z_{(1 - γ) / 2}^{2}) = (1 - γ) / 2

(13)

It is a simple matter to solve these equations for (α₁, α₂). For this choose an initial value for α₁ and, using eq. (12), find z such that G(α₁, 1, z) = (1 + γ)/2, which implies

α_{2} = z / s_{1}^{- 2} z_{(1 + γ) / 2}^{2}

. If the left side of eq. (13) is less (or greater) than (1 − γ)/2, then decrease (or increase) the value of α₁ and repeat step 1. Continue iterating this process until satisfactory convergence is attained.

Evans and Moshonov (2006) showed that when checking for prior–data conflict in such a context it is better to check the components of the prior sequentially as this helps to pinpoint where any failure in the prior occurs. First, the prior on σ² is checked using the tail probability based on the prior predictive for s and, if this component passes, then the prior on β is checked based on the conditional prior-predictive of b given s. If conflict is found, the methods discussed by Evans and Jang (2011b) are available to modify the prior.

Assuming that X is of rank k+1, the posterior of (β, σ²) is given by

β | y, σ^{2} \sim N_{k + 1} (β (X, y), σ^{2} Σ (X)), 1 / σ^{2} | y \sim {gamma}_{rate} ((n + 2 α_{1}) / 2, α_{2} (X, y) / 2)

(14)

where

β (X, y) = Σ (X) X' X b, Σ (X) = (X' X + Σ_{0}^{- 1})^{- 1}

and α₂(X, y) = ‖y − Xb‖² + (Xb)′(I_n − XΣ(X)X′)Xb + 2α₂. Then the marginal posterior for β_i is given by β_i(X, y) + {α₂(X, y)σ_ii(X)/(n + 2α₁)}^1/2t_n+2α₁ and the relative belief ratio for β_i at 0 equals

R B_{_{i}} (0 | X, y) = \frac{Γ (\frac{n + 2 α_{1} + 1}{2}) Γ (α_{1})}{Γ (\frac{2 α_{1} + 1}{2}) Γ (\frac{n + 2 α_{1}}{2})} {(1 + \frac{β_{i}^{2} (X, y)}{α_{2} (X, y) σ_{i i} (X)})}^{- \frac{n + 2 α_{1} + 1}{2}} \times {(\frac{α_{2} (X, y) σ_{i i} (X)}{α_{2}^{2} λ_{0}^{2}})}^{- \frac{1}{2}}

(15)

Rather than using eq. (15), however, the distributional results are used to compute the discretized relative belief ratios as in Example 1. For this δ > 0 is required to determine an appropriate discretization and it will be assumed here that this is the same for all the β_i, although the procedure can be easily modified if this is not the case in practice. Note that such a δ is effectively determined by the amount that x_iβ_i will vary from 0 for x ∈ R. As x_i ∈ [−1, 1], then |x_iβ_i| ≤ δ provided that |β_i| ≤ δ. When this variation is suitably small as to be immaterial, then such a δ is appropriate for saying β_i is effectively 0. Determination of the hyperparameters and δ is dependent on the application.

Again inference can be made concerning ξ = Ξ(β, σ²), the proportion of the β_i effectively equal to 0. As in Example 1, however, we can expect bias when the amount of variability in the data is large relative to δ or the prior is too diffuse. To implement the relative belief multiple testing algorithm, the quantities eqs. (5) and (6) need to be computed to determine q_R and q_A, respectively. The conditional prior distribution of (b, ‖y − Xb‖²), given (β, σ²), is b ∼ N_k+1(β, σ²(X′X)⁻¹), statistically independent of ‖y − Xb‖² ∼ gamma((n − k − 1)/2, σ⁻²/2). Thus, computing eqs. (5) and (6) can be carried out by generating (β, σ²) from the relevant conditional prior, generating (b, ‖y − Xb‖²) given (β, σ²), and using eq. (15).

To illustrate these computations the diabetes data set discussed by Efron et al. (2004) and Park and Casella (2008) is now analyzed. With γ = 0.99, the values m₀ = 100, s₁ = 75, s₂ = 200 were used to determine the prior together with τ₀ = 1.05 determined from the X matrix. This led to the values λ₀ = 0.48, α₁ = 7.29, α₂ = 13641.35 being chosen for the hyperparameters. Using the methods developed by Evans and Moshonov (2006), a first check was made on the prior on σ² against the data, and a tail probability equal to 0.19 was obtained indicating there is no prior–data conflict with this prior. Given no prior–data conflict at the first stage, the prior on β was then checked and the relevant tail probability of 0.00 was obtained indicating a strong degree of conflict. Following the argument of Evans and Jang (2011b), the value of λ₀ was increased to choose a prior that was weakly informative with respect to our initial choice. This led to choosing the value λ₀ = 5.00, and the relevant tail probability equals 0.32, so there is no conflict.

Using this prior, the relative belief estimates, ratios, and strengths are recorded in Table 7. This shows that there is strong evidence against β_i = 0 for the variables sex, bmi, map, and ltg and no evidence against β_i = 0 for any other variables. There is strong evidence in favor of β_i = 0 for age and ldl, moderate evidence in favor of β_i = 0 for the constant, tc, tch, and glu and perhaps only weak evidence in favor of β_i = 0 for hdl.

Table 7.

Table 7. Relative belief estimates, relative belief ratios, and strengths for assessing no effect for the diabetes data in Example 4.

Variable	Estimates	RB_i(0 \| X, y)	Strength
Constant	2	2454.86	0.44
age	−4	153.62	0.95
sex	−224	0.13	0.00
bmi	511	0.00	0.00
map	314	0.00	0.00
tc	162	33.23	0.36
ldl	−20	57.65	0.90
hdl	167	27.53	0.15
tch	114	49.97	0.37
ltg	496	0.00	0.00
glu	77	66.81	0.23

As previously discussed, it is necessary to consider the issue of bias, namely to compute the prior probability of getting a false positive for different choices of q_R and the prior probability of getting a false negative for different choices of q_A. The value of eq. (5) is 0.0003 when q_R = 1, and therefore there is virtually no bias in favor of false positives and one can feel confident that the predictors identified as having an effect do so. The story is somewhat different, however, when considering the possibility of false negatives via eq. (6). For example, with q_A = 1 then eq. (6) equals 0.9996 and when q_A = 100 then eq. (6) equals 0.7998. Thus, there is substantial bias in favor of the null hypotheses and undoubtedly this is due to the diffuseness of the prior. The implication is that we cannot be entirely confident concerning those β_i assigned to be equal to 0. Recall that the first prior proposed led to prior–data conflict, and thus a much more diffuse prior obtained by increasing λ₀ was substituted. The bias in favor of false negatives with this prior could be reduced by making the prior less diffuse by lowering λ₀, but we know that if it is lowered too much prior–data conflict arises. Thus, there is a trade-off between lowering the bias in favor and avoiding prior–data conflict. In any case, determining a value of λ₀ in such a fashion seems inappropriate becauase then the prior becomes too dependent on the data and we do not advocate this. The real cure for the bias in an application is to collect more data, and the amount necessary can be determined by the bias calculations.

Next we consider the application to regression with k + 1 > n.

Example 5. Non-full rank regression.

In a number of applications k + 1 > n and thus X is of rank l < n. In this situation, suppose {x₁,…, x_l} forms a basis for

L (x_{1}, \dots, x_{k})

, perhaps after relabeling the predictors, and write X = (1 X₁ X₂), where X₁ = (x₁…x_l). For given r = (X₁ X₂)β_1:k there will be many solutions β_1:k. A particular solution is given by

β_{1 : k^{*}} = (X_{1} {(X_{1}^{'} X_{1})}^{- 1} 0)' r

. The set of all solutions is then given by

β_{1 : k^{*}} + \ker (X_{1} X_{2})

, where

\ker (X_{1} X_{2}) = {(- B' I_{k - l})' η : η \in R^{k - l}}, B = (X_{1}^{'} X_{1})^{- 1} X_{1}^{'} X_{2}

, and the columns of C = (−B′ I_k−l)′ give a basis for ker(X₁ X₂). As sparsity is expected for β_1:k, it is natural to consider the solution that minimizes ‖β_1:k‖² for

β_{1 : k} \in β_{1 : k^{*}} + L (C)

. Using

β_{1 : k^{*}}

, and applying the Sherman–Morrison–Woodbury formula to C(C′C)⁻¹C′, this is given by the Moore–Penrose solution

β_{1 : k}^{M P} = (I_{k} - C {(C' C)}^{- 1} C') β_{1 : k^{*}} = (I_{l} B)' ω_{1 : l}

(16)

where ω_1:l = (I_l + BB′)⁻¹(β_1:l + Bβ_l + 1:k).

From eq. (9) with

Σ_{0} = λ_{0}^{2} I_{k + 1}

, the conditional prior distribution of (β₀, ω_1:l) given σ² is

β_{0} | σ^{2} \sim N (0, σ^{2} λ_{0}^{2})

, independent of

ω_{1 : l} | σ^{2} \sim N_{l} (0, σ^{2} λ_{0}^{2} {(I_{l} + B B')}^{- 1})

, which, using eq. (16), implies

β_{1 : k}^{M P} | σ^{2} \sim N_{k} (0, σ^{2} Σ_{0} (B))

, conditionally independent of β₀, where

Σ_{0} (B) = λ_{0}^{2} (\begin{matrix} {(I_{l} + B B')}^{- 1} & {(I_{l} + B B')}^{- 1} B \\ B' {(I_{l} + B B')}^{- 1} & B' {(I_{l} + B B')}^{- 1} B \end{matrix})

With 1/σ²∼ gamma_rate(α₁, α₂), this implies that the unconditional prior of the i-th coordinate of

β_{1 : k}^{M P}

{(λ_{0}^{2} α_{2} σ_{i i}^{2} (B) / α_{1})}^{1 / 2} t_{2 α_{1}}

Putting

X_{*} = (1 X_{1} + X_{2} B')

gives the full rank model

y | β_{0}, ω_{1 : l}, σ^{2} \sim N_{n} (X_{*} (β_{0}, ω_{1 : l}^{'})', σ^{2} I_{n})

. As in Example 4, then

(β_{0}, ω_{1 : l}) | y, σ^{2} \sim N_{l} (ω (X_{*}, y), σ^{2} Σ (X_{*})), 1 / σ^{2} | y \sim {gamma}_{rate} ((n + 2 α_{1}) / 2, α_{2} (X_{*}, y) / 2)

where

ω (X_{*}, y) = Σ (X_{*}) X_{*}^{'} X_{*} b_{*}, b_{*} = (X_{*}^{'} X_{*})^{- 1} X_{*}^{'} y

and

Σ^{- 1} (X_{*}) = (\begin{matrix} n & 0 \\ 0 & (X_{1} + X_{2} B')' (X_{1} + X_{2} B') \end{matrix}) + λ_{0}^{- 2} (\begin{matrix} 1 & 0 \\ 0 & (I_{l} + B B') \end{matrix}), α_{2} (X_{*}, y) = ‖ y - X_{*} b_{*} ‖^{2} + (X_{*} b_{*})' (I_{n} - X_{*} Σ (X_{*}) X'_{*}) X_{*} b_{*} + 2 α_{2}

Now, noting that (X₁ + X₂B′)′(X₁ + X₂B′) = (I_l + BB′)X′₁X₁(I_l + BB′), this implies

b'_{*} = (\bar{y}, (I_{l} + B B')^{- 1} b_{1})

, where

b_{1} = (X_{1}^{'} X_{1})^{- 1} X_{1}^{'} y

is the least-squares estimate of β_1:l, and

Σ (X_{*}) = {(\begin{matrix} n + λ_{0}^{- 2} & 0 \\ 0 & (I_{l} + B B') X'_{1} X_{1} (I_{l} + B B') + λ_{0}^{- 2} (I_{l} + B B') \end{matrix})}^{- 1}, ω (X_{*}, y) = Σ (X_{*}) X'_{*} X_{*} b_{*} = (\begin{matrix} n \bar{y} / (n + λ_{0}^{- 2}) \\ {(I_{l} + B B' + λ_{0}^{- 2} {(X'_{1} X_{1})}^{- 1})}^{- 1} b_{1} \end{matrix})

Using eq. (16), then

β_{0} | y, σ^{2} \sim N (n {(n + λ_{0}^{- 2})}^{- 1} \bar{y}, σ^{2} {(n + λ_{0}^{- 2})}^{- 1})

, independent of

β_{1 : k}^{M P} | y, σ^{2} \sim N_{k} (β^{M P} (X, y), σ^{2} Σ^{M P} (X))

, where

β^{M P} (X, y) = (\begin{matrix} D b_{1} \\ B' D b_{1} \end{matrix}), Σ^{M P} (X) = (\begin{matrix} E & E B \\ B' E & B' E B \end{matrix})

with

D = (I_{l} + B B' + λ_{0}^{- 2} {(X'_{1} X_{1})}^{- 1})^{- 1}

and

E = ((I_{l} + B B') (X'_{1} X_{1}) (I_{l} + B B') + λ_{0}^{- 2} (I_{l} + B B' {))}^{- 1}

. The marginal posterior for

β_{i}^{M P}

is then given by

β_{i}^{M P} (X, y) + {α_{2} (X_{*}, y) σ_{i i}^{M P} (X) / (n + 2 α_{1})}^{1 / 2} t_{n + 2 α_{1}}

. Relative belief inferences for the coordinates of

β_{1 : k}^{M P}

can now be implemented just as in Example 4.

We consider a numerical example in which there is considerable sparsity. For this let X₁ ∈ R^n×l be formed by taking the second through l-th columns of the (l + 1)-dimensional Helmert matrix, repeating each row m times and then normalizing. Thus, n = m(l + 1) and the columns of X₁ are orthonormal and orthogonal to 1. It is supposed that the first l₁ ≤ l of the variables giving rise to the columns of X₁ have β_i ≠ 0, whereas the last l − l₁ have β_i = 0 and that the variables corresponding to the first l₂ ≤ k − l columns of X₂ = X₁B ∈ R^n×(k−l) have β_i ≠ 0, whereas the last k − l − l₂ have β_i = 0. The matrix B is obtained by generating B = diag(B₁, B₂), where B₁ = (z₁/‖z₁‖…z_l₂/‖z_l₂‖) with

z_{1}, \dots, z_{l_{2}} \overset{i . i . d .}{\sim} N_{l_{1}} (0, I)

independent of

B_{2} = (z_{l_{2} + 1} / ‖ z_{l_{2} + 1} ‖ \dots z_{k - l - l_{2}} / ‖ z_{l_{k - l - l_{2}}} ‖)

with

z_{l_{2} + 1}, \dots, z_{k - l - l_{2}}

IID N_l−l(0, I). Note that this ensures that the columns of X₂ are all standardized. Furthermore, because it is assumed that the last l − l₁ variables of X₁ and the last k − l − l₂ variables of X₂ don’t have an effect, B is necessarily of the diagonal form given. For, if it was allowed that the last k − l − l₂ columns of X₂ were linearly dependent on the the first l₁ columns of X₁, then this would induce a dependence on the corresponding variables, and this is not the intention in the simulation. Similarly, if the first l₂ columns of X₂ were dependent on the last l − l₁ columns of X₁, then this would imply that the variables associated with these columns of X₁ have an effect, and this is not the intention.

The sampling model is then prescribed by setting l = 10, l₁ = 5, l₂ = 2, with β_i = 4 for i = 1,…, 5, 11, 12 with the remaining β_i = 0, σ² = 1, m = 2, therefore n = 22 and we consider various values of k ≥ l. Note that a different data set was generated for each value of k. The prior is specified as in Example 4, where the values

λ_{0}^{2} = 4, α_{1} = 11, α_{2} = 12

were chosen so that there will be no prior–data conflict arising with the generated data. Also, we considered several values for the discretization parameter δ. A hypothesis was classified as true if the relative belief ratio was >1 and classified as false if it was <1. Table 8 gives the confusion matrices with δ = 0.1. The value δ = 0.5 was also considered, but there was no change in the results.

Table 8.

Table 8. Confusion matrices for the numerical example in Example 5.

k = 10	Classified positive	Classified negative	Total
True positive	5	0	5
True negative	1	4	5
Total	6	4	10
k = 20	Classified positive	Classified negative	Total
True positive	7	0	7
True negative	0	13	13
Total	7	13	20
k = 50	Classified positive	Classified negative	Total
True positive	7	0	7
True negative	0	43	43
Total	7	43	50
k = 100	Classified positive	Classified negative	Total
True positive	7	0	7
True negative	0	93	93
Total	7	93	100

One fact stands out immediately, namely that in all of these examples only one misclassification was made and this was in the full rank (k = 10) case where one hypothesis that was true was classified as a positive. The effect sizes that exist are reasonably large, and thus it can’t be expected that the same performance will arise with much smaller effect sizes, but it is clear that the approach is robust to the number of hypotheses considered. It should also be noted, however, that the amount of data is relatively small and the success of the procedure will only improve as this increases. This result can, in part, be attributed to the fact that a logically sound measure of evidence is being used.

Conclusions

The relative belief approach to inference has been applied to problems of practical significance. The central feature is that the inferences are based upon a proper measure of evidence. This approach avoids many of the problems that arise with p-values. For example, there is a natural cutoff to determine when there is either evidence for or against. Given a measure of evidence, a concern with Bayesian methodology can be addressed, namely determining whether or not the ingredients bias the results. Bias calculations play a key role in the multiple testing algorithm and its application to sparsity through the a priori control of false positives and negatives.

There are a number of ingredients that need to be selected to implement the relative belief multiple testing algorithm. Perhaps the most important of these is the model and the most controversial is the prior. For the prior, elicitation algorithms have been provided for each example based on the user being able to specify bounds on parameters that hold with virtual certainty. Given that a measurement process was used in the data collection, this implies restrictions for the values of parameters. For example, suppose interest is in the mean of a response variable corresponding to some kind of length. Each length is measured to a certain accuracy and there is an upper bound on what length can be obtained using a particular measurement technology. Thus, such bounds on the mean response are definitely available and how tight they are depends on what additional information is available on the context. It is also worth noting that there is no reason why some other elicitation algorithm cannot be used if this is felt to be appropriate. There is also the choice of (q_R, q_A), but these are chosen based on the bias calculations to control for false positives and false negatives and the user will have to select these after considering what proportions of errors are tolerable.

The value of δ in hypothesis assessment problems is seemingly another choice but practical aspects of the measurement process involved in data collection dictate what values make sense. For example, there is no point in considering differences from a mean >0.5 cm if the measurements producing the data are only taken to this accuracy. This provides a lower bound on δ and the application may allow for a larger value. It is comforting, however, that results are reasonably robust to this choice. Determining δ for an arbitrary parameter of interest ψ is not necessarily straightforward, but some guidance, when ψ is a probability and δ is either absolute or relative error, can be found in the work of Al-Labadi et al. (2017).

No mention has been made in the paper concerning the false discovery rate (FDR) approach to multiple testing. Current approaches base this on p-values, but presumably there is no reason why a valid measure of evidence such as the relative belief ratio couldn’t be used instead. It should also be noted that the FDR approach is somewhat different as it does not imply control over both false positive and false negatives, which has been our intent here. The relationship between the approach of this paper and controlling something like the FDR is a topic for further investigation.

Acknowledgements

Michael Evans was supported by a grant from the Natural Sciences and Engineering Research Council of Canada. The authors thank two reviewers for a number of helpful comments.

References

Al-Labadi L, Baskurt Z, and Evans M. 2017. Goodness of fit for the logistic regression model using relative belief. Journal of Statistical Distributions and Applications, 4: 17.

Crossref

Google Scholar

Cao Y, Evans M, and Guttman I. 2014. Bayesian factor analysis via concentration. In Current trends in Bayesian methodology with applications. Edited by SK Upadhyay, U Singh, DK Dey, and A Loganathan. CRC Press, Boca Raton, Florida, USA. pp. 181–201.

Google Scholar

Carvalho CM, Polson NG, and Scott JG. 2009. Handling sparsity via the horseshoe. Journal of Machine Learning Research, 5: 73–80.

Google Scholar

Efron B, Hastie T, Johnstone I, and Tibshirani R. 2004. Least angle regression. The Annals of Statistics, 32: 407–499.

Google Scholar

Evans M. 2015. Measuring statistical evidence using relative belief. Monographs on Statistics and Applied Probability 144. CRC Press, Boca Raton, Florida, USA.

Google Scholar

Evans M, and Jang GH. 2011a. A limit result for the prior predictive applied to checking for prior-data conflict. Statistics & Probability Letters, 81: 1034–1038.

Crossref

Google Scholar

Evans M, and Jang GH. 2011b. Weak informativity and the information in one prior relative to another. Statistical Science, 26(3): 423–439.

Crossref

Google Scholar

Evans M, and Moshonov H. 2006. Checking for prior-data conflict. Bayesian Analysis, 1(4): 893–914.

Crossref

Google Scholar

George EI, and McCulloch RE. 1993. Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88: 881–889.

Crossref

Google Scholar

Howson C, and Urbach P. 2006. Scientific reasoning: the Bayesian approach. 3rd edition. Open Court, Chicago, Illinois, USA.

Google Scholar

Park R, and Casella G. 2008. The Bayesian Lasso. Journal of the American Statistical Association, 103: 681–686.

Crossref

Google Scholar

Rockova V, and George EI. 2014. EMVS: the EM approach to Bayesian variable selection. Journal of the American Statistical Association, 109(506): 828–846.

Crossref

Google Scholar

Royall R. 1997. Statistical evidence: a likelihood paradigm. Chapman & Hall/CRC Monographs on Statistics & Applied Probability, CRC Press, Boca Raton, Florida, USA.

Google Scholar

Rudin W. 1974. Real and complex analysis. 2nd edition. McGraw-Hill, New York, New York, USA.

Google Scholar

Salmon W. 1973. Confirmation. Scientific American, 228(5): 75–83.

Crossref

Google Scholar

Strug LJ, and Hodge SE. 2006a. An alternative foundation for the planning and evaluation of linkage analysis. I. Decoupling ‘error probabilities’ from ‘measures of evidence’. Human Heredity, 61: 166–188.

Crossref

PubMed

Google Scholar

Strug LJ, and Hodge SE. 2006b. An alternative foundation for the planning and evaluation of linkage analysis. II. Implications for multiple test adjustments. Human Heredity, 61: 200–209.

Crossref

PubMed

Google Scholar

Tibshirani R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58(1): 267–288.

Google Scholar

Information & Authors

Information

Published In

FACETS

Volume 3 • Number 1 • October 2018

Pages: 563 - 583

Editor: Patrick Ingram

History

Received: 20 November 2017

Accepted: 6 February 2018

Version of record online: 25 May 2018

Copyright

© 2018 Evans and Tomal. This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Data Availability Statement

All relevant data are within the paper.

Key Words

Sections

Subjects

Authors

Affiliations

Michael Evans [email protected]

Department of Statistical Sciences, University of Toronto, Toronto, ON M5S 3G3, Canada

View all articles by this author

Jabed Tomal

Department of Computer and Mathematical Sciences, University of Toronto Scarborough, 1265 Military Trail, Toronto, ON M1C 1A4, Canada

View all articles by this author

Author Contributions

All conceived and designed the study.

All performed the experiments/collected the data.

All analyzed and interpreted the data.

All contributed resources.

All drafted or revised the manuscript.

Competing Interests

ME is currently serving as a Subject Editor for FACETS, but was not involved in review or editorial decisions regarding this manuscript.

Metrics & Citations

Metrics

Other Metrics

Citations

Cite As

Michael Evans and Jabed Tomal. 2018. Measuring statistical evidence and multiple testing. FACETS. 3(1): 563-583. https://doi.org/10.1139/facets-2017-0121

Export Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited by

1. Relative-belief inference in quantum information theory

2. Evidence-Based Certification of Quantum Dimensions

3. Model Checking with Right Censored Data Using Relative Belief Ratio

4. On robustness of the relative belief ratio and the strength of its evidence with respect to the geometric contamination prior

5. The Support Interval

6. Bayesian estimation of extropy and goodness of fit tests

7. Kullback–Leibler divergence for Bayesian nonparametric model checking

8. Measuring and Controlling Bias for Some Bayesian Inferences and the Relation to Frequentist Criteria

9. On one-sample Bayesian tests for the mean

10. The Measurement of Statistical Evidence as the Basis for Statistical Reasoning

Verify Phone

Congrats!

Measuring statistical evidence and multiple testing

Abstract

Introduction

Statistical analysis based on relative belief

Inferences for multiple tests

Applications

Conclusions

Acknowledgements

References

Information & Authors

Information

Published In

History

Copyright

Data Availability Statement

Key Words

Sections

Subjects

Authors

Affiliations

Author Contributions

Competing Interests

Metrics & Citations

Metrics

Other Metrics

Citations

Cite As

Export Citations

Cited by

View Options

View options

PDF

Figures

Tables

Media

PREVIOUS ARTICLE

NEXT ARTICLE

LOGIN TO YOUR ACCOUNT

Create a new account

Request Username

Change Password

Your password must have 8 characters or more and contain 3 of the following:

Password Changed Successfully

Verify Phone

Congrats!

Abstract

Introduction

Statistical analysis based on relative belief

Inferences for multiple tests

Applications

Conclusions

Acknowledgements

References

Information

Published In

History

Copyright

Data Availability Statement

Key Words

Sections

Subjects

Authors

Affiliations

Author Contributions

Competing Interests

Metrics

Other Metrics

Citations

Cite As

Export Citations

Cited by

View options

PDF

Share

Share the article link

Share on social media