Pearl’s and Jeffrey’s Update as Modes of Learning in Probabilistic Programming

The concept of updating a probability distribution in the light of new evidence lies at the heart of statistics and machine learning. Pearl’s and Jeffrey’s rule are two natural update mechanisms which lead to different outcomes, yet the similarities and differences remain mysterious. This paper clarifies their relationship in several ways: via separate descriptions of the two update mechanisms in terms of probabilistic programs and sampling semantics, and via different notions of likelihood (for Pearl and for Jeffrey). Moreover, it is shown that Jeffrey’s update rule arises via variational inference. In terms of categorical probability theory, this amounts to an analysis of the situation in terms of the behaviour of the multiset functor, extended to the Kleisli category of the distribution monad.


Introduction
Suppose you test for a certain disease, say Covid.You take three consecutive tests, because you wish to be sure -two of them come out positive but one is negative.How do you compute the subsequent (posterior) probability that you actually have the disease?In a medical setting one starts from a prevalence, that is, an a priori disease probability, which is assumed to hold for the whole population.Medical tests are typically not perfect: one has to take their sensitivity and specificity into a account.They tell, respectively, if someone has the disease, the probability that the test is positive, and if someone does not have the disease, the probability that the test is negative.
When all these probabilities (prevalence, sensitivity, specificity) are known, one can apply Bayes' rule and obtain the posterior probability after a single test.But what if we do three tests?And what if we do a thousand tests?
It turns out that things become fuzzy when tests are repeated multiple times.One can distinguish two approaches, associated with Pearl and Jeffrey.They agree on single tests.But they may disagree wildly on multiple tests, see the example in Section 2 below.This is disconcerting, certainly in the current age of machine learning, in which so many decisions are based on statistical learning and decision making.
Earlier work (of one of the authors) [6,8] analysed the approaches of Pearl and Jeffrey.The difference there was formulated in terms of learning from 'what is right' and from 'what is wrong'.As will be recalled below, Pearl's update rule involves increasing validity (expected value), whereas Jeffrey's rule involves decreasing (Kullback-Leibler) divergence.The contributions of this paper are threefold.
• It adds the perspective of probabilistic programming.Pearl's and Jeffrey's approaches to updating are formulated, for the medical test example, in a standard probabilistic programming language, namely WebPPL [4,5], see Section 2. Pearl's update is straightforwardly expressible using built-in conditioning constructs, while Jeffrey's update involves nested inference, a simple form of reasoning about reasoning [13].We further explore the different dynamics behind the two update techniques are operationally using rejection samplers in Section 6. • The paper also offers a new perspective on the Pearl/Jeffrey distinction in terms of different underlying generative models and their associated likelihoods: with Pearl's update rule one increases one form of 'Pearl' likelihood, whereas with Jeffrey's update rule one increases another form of 'Jeffrey' likelihood.These two likelihoods are described in terms of different forms of evaluating data (as a multiset of data points) with respect to a multinomial distribution.Theses two forms of likelihood are directly related to the respective update mechanisms, see Section 7. Pearl likelihood occurs in practice, for example as the basis of the multinomial naive Bayes classifer [12], while Jeffrey likelihood -and its difference to Pearl's -is new, as far as we know.• Pearl's likelihood directly leads to the associated update rule, see Theorem 4. For Jeffrey's likelihood the connection is more subtle and involves variational inference [10,11]: it is shown that Jeffrey's update is least divergent from the update rule for Jeffrey likelihood, in a suitable sense, see Theorem 6.This likelihood update rule is described categorically in terms of the extension of the multiset functor to the Kleisli category of the (discrete) distribution monad, see [3,7].This analysis clarifies the mathematical situation, for instance in Equation 11, where it is shown that this extended multiset functor commutes with the 'dagger' reversal of channels.This is a new result, with a certain esthetic value.
This paper develops the idea that Pearl's and Jeffrey's rule involve a difference in perspective: are we trying to learn something about an individual or about a population?

A Motivating Example
Consider some disease with an a priori probability (or 'prevalence') of 5%.There is a test for the disease with the following characteristics: • ('sensitivity') If someone has the disease, then the test is positive with probability of 90%.
• ('specificity') If someone does not have the disease, there is a 95% chance that the test is negative.
We are told that someone takes three consecutive tests and sees two positive and one negative outcome.These test outcomes are our observed data that we wish to learn from.The question is: what is the posterior probability that this person has the disease, in the light of this test data?You may wish to stop reading here and calculate this probability yourself.Outcomes, using Pearl's and Jeffrey's rule, will be provided in Examples 1 and 2 below.
Below we present several possible implementations of the medical test situation in the probabilistic programming language WebPPL [4,5], giving three different solutions to the above question.The code starts by defining a function test which models the test outcome, incorporating the above sensitivity and specificity.Here, flip(p) tosses a biased coin with bias p. var test = function (dis) { return dis ?(flip (0.9) ?'pos ' : 'neg ') : (flip (0.95) ?'neg ' : 'pos ');

}
We then define three inference functions which we simply label as prog1, prog2, prog3.At this stage we do not wish to connect them to Pearl/Jeffrey.We invite the reader to form a judgement about what is the 'right' way to model the above situation with three test outcomes ('pos', 'pos', 'neg').

})) }
All functions make use of the condition command to instruct WebPPL to compute a conditional probability distribution.prog1 uses three successive conditions, while the other two use a single condition on a randomly chosen target.prog3 additionally makes use of nested inference, that is, it wraps the Infer function around part of its code.Nested inference is a form of reasoning about reasoning [13] and has been applied for example to the study of social cognition, linguistics and theory of mind [5,Ch. 6].We give a short overview of WebPPL's semantics and usage in Section 10.All programs can be run using exhaustive enumeration or rejection sampling as inference algorithms, which we elaborate further in Section 4.
The three functions can be executed in WebPPL and the posteriors visualized using the command viz(Infer(prog1)).The posterior disease probabilities of each of the programs are respectively: • prog1: 64% • prog2: 9% • prog3: 33% The same probabilities appear in the mathematical analysis in Examples 1 and 2 below.An interesting question to ask is: suppose we do not have 3 tests (2 positive, 1 negative), but 3000 tests (2000 positive, 1000 negative).Does that change the outcome of the above computations?Not so for the second and third program, which only require a statistical sample of the data.The first program however, quickly converges to 100% disease probability when the number of tests increases (still assuming the same ratio of 2 positive and 1 negative).But this first program becomes increasingly difficult to compute, because each test result emits further conditioning instructions that the inference engine needs to take into account.The two other programs on the other hand scale almost trivially.We return to this scaling issue at the end of Section 7.
The three implementations will be reiterated throughout the paper and related to Pearl's and Jeffrey's update.In Section 6, where we also make their semantics explicit using rejection samplers.

Multisets, Distributions, and Channels
Sections 3 -5 introduce the mathematics underlying the update situations that we are looking at.This material is in essence a recap from [6,8].We write M and D for the multiset and distribution monads on the category Sets of sets and functions.For a set X, multisets φ ∈ M(X) can equivalently be written as a function φ : X → N with finite support, or as a finite formal sum i n i |x i ⟩, where n i ∈ N is the multiplicity of element x i ∈ X.Similarly, a distribution ω ∈ D(X) is written either as a function ω : X → [0, 1] with finite support and x ω(x) = 1, or as a finite formal convex combination i r i |x i ⟩ with r i ∈ [0, 1] satisfying i r i = 1.
Functoriality of M (and D) works in the following manner.For a function f : For a multiset φ ∈ M(X) we write ∥φ∥ ∈ N for its size, defined as sum of its multiplicities: ∥φ∥ := x φ(x).When this size is not zero, we can define an associated distribution flrn(φ) ∈ D(X), via frequentist learning (normalisation), as: For K ∈ N we write M[K](X) = {φ ∈ M(X) | ∥φ∥ = K} for the set of multiset of size K.There is an accumulation function acc : A distribution ω ∈ D(X) may be seen as an urn with coloured balls, where X is the set of colours.The number ω(x) ∈ [0, 1] is the probability of drawing a ball of colour x.We are interested in K-sized draws, formalised as multiset φ ∈ M[K](X).The multinomial distribution mn[K](ω) ∈ D M[K](X) assigns probabilities to such draws: A Kleisli map c : X → D(Y ) for the distribution monad D is often called a channel, and written as c : X → Y .For instance, the above accumulation map acc : , where arr stands for arrangement, see [7] for details.This arrangement is defined as: x with (φ) as defined in (1). ( Kleisli extension gives a pushforward operation along a channel: a distribution ω ∈ D(X) can be turned into a distribution c = ≪ ω ∈ D(Y ) via the formula: This new distribution c = ≪ ω is often called the prediction.One can prove: The following two programs are equivalent ways of sampling from a prediction c = ≪ ω: It shows that such sampling can be done in two steps: The notation x ← ω is used for sampling a random element x ∈ X from a distribution ω ∈ D(X), where the randomness takes the probabilities in ω into account.This is a standard construct in probabilistic programming.If multiple samples x i ← ω are taken, and accumulated in a multiset φ ∈ M(X), then the normalisation flrn(φ) of φ approaches the original distribution ω.
Lastly, the tensor product ⊗ extends pointwise to channels: 4 Validity, Conditioning, and Pearl's Update Rule The validity (or expected value) of a predicate p : X → [0, 1] in a distribution ω ∈ D(X) is written as ω |= p and defined as: When this validity is non-zero we can define the updated distribution ω| p ∈ D(X) as: For a channel c : X → Y and a predicate q : Y → [0, 1] on its codomain, we can define a pullback predicate c ≫ = q on X via the formula: The following result contains the basic facts that we need here.Proofs can be found for instance in [6,8].
□ The last result shows that a predicate p is 'more true' in an updated distribution ω| p than in the original ω.The next result from [6,8] contains both the formulation of Pearl's update, and the associated validity increase.
Theorem 1 Let c : X → Y be a channel with a prior distribution ω ∈ D(X) on its domain and a predicate q : Y → [0, 1] on its codomain.The posterior distribution ω P ∈ D(X) of ω, via Pearl' update rule, with the evidence predicate q, is defined as: The proof follows from an easy combination of points (i) and (iii) of Lemma 1.The increase in validity that is achieved via Pearl's rule means that the validity of predicate q is higher in the predicted distribution obtained from the posterior distribution ω P , than in the prediction obtained from original, prior distribution ω.
The following are two rejection samplers that allow sampling from a posterior distribution: On the left below we show how to obtain an updated distribution ω| p via sampling, and on the right how to get a Pearl update ω| c ≫ = q .
x ← ω y ← flip(p(x)) The probabilistic program prog1 at the end of Section 2 computes the Pearl update.How this update works in detail will be described next.

Example 1
We are now in a situation to explain the 64% posterior disease probability claimed in Section 2. It is obtained via repeated Pearl updates.We first translate the information given there into mathematical structure.
We use X = {d, d ⊥ } for the set with elements d for disease and d ⊥ for no-disease.The given prevalence of 5% for the disease corresponds to a prior distribution ω ∈ D(X) given by ω = There are two obvious point predicates 1 p : Y → [0, 1] and 1 n : Y → [0, 1] on the set Y = {p, n} of test outcomes.We are told that there are two positive and one negative test.This translates in the conjunction Since conjunction is commutative, the order does not matter.Updating with this conjection is equivalent to three successive update, see Lemmma 1 (ii), and gives the claimed outcome: This is the probability computed in prog1 in Section 2.
The validity increase associated with Pearl's update rule takes the following form.

Dagger channels and Jeffrey's update rule
First we recall that the difference (divergence) between two distributions ω, ρ ∈ D(X) is commonly expressed as Kullback-Leibler divergence, defined as: where ln is the natural logarithm.
The main ingredient that we need for Jeffrey's rule is the dagger of a channel c : X → Y with respect to a prior distribution ω ∈ D(X).This dagger is a channel c † ω : Y → X in the opposite direction.It is also called Bayesian inversion, see [2,1], and it is defined on y ∈ Y as: We again combine Jeffrey's rule with its main divergence reduction property, from [8].The set-up is very much as for Pearl's rule, in Theorem 1, but with evidence now in the form of distribution instead of a predicate.
Theorem 2 Let c : X → Y be a channel with a prior distribution ω ∈ D(X) and an evidence distribution τ ∈ D(Y ).The posterior distribution ω J ∈ D(X) of ω, obtained via Jeffrey's update rule, with the evidence distribution τ , is defined as: The proof of this divergence decrease is remarkably hard, see [8] for details.The result says that the prediction from ω J is less wrong than from ω, when compared to the 'target' distribution τ .
Example 2 We build on the test channel c : X → Y and prevalence distribution ω ∈ D(X) from Example 1.The first task is to compute the dagger channel f := c † ω : Y → X.It yields: The fact that there are two positive and one negative test translates into the 'empirical' evidence distribution The posterior, updated disease distribution, obtained from this evidence, gives the 33% probability mentioned in Section 2: This probability is computed by prog3 in Section 2.
The divergence decrease from Theorem 2 takes the following form: Having seen this, we may ask: why not use the evidence distribution τ = 2 3 | p ⟩ + 1 3 | n ⟩ not as a predicate q = 2 3 1 p + 1 3 1 n , and then do a single Pearl update: This is the distribution computed by program prog2 in Section 2.
For future use we record the following standard properties of the dagger of a channel (7).
Lemma 2 (i) Daggers preserve sequential composition: for two successive channels 6 An Operational Understanding of Jeffrey's We return to the probabilistic programs of Section 2. As discussed in Section 4, prog1 expresses repeated Pearl updates.It remains to understand the difference between prog2 and prog3.As shown in (8), prog2 corresponds to a single Pearl's update with the target distribution, as predicate.Further, prog3 is Jeffrey's update, with the nested inference corresponding to the computation of the dagger channel c † ω .The difference between the two programs prog2 and prog3 is surprisingly subtle, so we begin by illustrating it using a different kind of metaphor, and derive a rejection sampler for each case in turn.
Consider a large queue of people waiting in front of a club.Each person prefers either rock or pop.The club's management wants to achieve a target ratio of 75% rock fans on the inside.To that end, they equip their doorman with a special ticker device, see Figure 6 We may also wonder how the door policy influences other statistical properties of the audience (such as age or gender) which may correlate with music preference: If the prior distribution in the queue is ω, what will the resulting distribution be inside the club?For the Jeffrey Policy, this update is precisely described by Jeffrey's update.We summarize this section with a concrete description of rejection samplers for Pearl's update with a random target (left) and Jeffrey's update (right), corresponding to the semantics of the probabilistic programs prog2 and prog3: while True:

Likelihoods and Generative Models for Pearl and Jeffrey
This section first identifies two forms of likelihood of data in the situation with a statistical model given by a channel X → Y and a distribution on X.It then relates these two forms of likelihood to the two update rules of repeated-Pearl and Jeffrey -in Theorems 1 and 2.
Definition 1 Let ψ ∈ M[K](Y ) be a multiset of data, of size K = ∥ψ∥ ∈ N. Let c : X → Y be a channel with a distribution ω ∈ D(X) on its domain.
(i) The Jeffrey likelihood of the multiset ψ is given by the number: (ii) The Pearl likelihood of ψ in the same model is the first expression below, which has several alternative formulations.It uses the abbreviation mn[K](c) Associated to these two likelihoods are different generative models, i.e. distributions over multisets, in D(M[K](Y )), which we evaluate on the dataset ψ.For Jeffrey likelihood in item (i) we first do the Kleisli extension c = ≪ (•) of c and then take the multinomial, as in the composite: We can concisely illustrate this with string diagrams using an informal 'plate' notation to copy parts of the string diagram (inspired by the use of plates in graphical models), see Figure 2 on the left.In contrast, for the Pearl likelihood in item (ii) we use the composite mn[K](c) := mn[K] • c in the pushforward: Here, the plate does not extend over the distribution ω, whose output is copied instead of resampled, see Figure 2 on the right.The Pearl likelihood is used in the multinomial naive Bayes classifier [12].For the likelihood of Jeffrey we shall see alternative formulations in Section 8 below.
Our first result says that minimising Kullback-Leibler divergence that occurs in Theorem 2 -and that is actually reduced by Jeffrey's update rule -corresponds to maximising the Jeffrey likelihood of Definition 1 (i).
Theorem 3 (i) For distributions ω, ω ′ ∈ D(X) and channels c, c ′ : X → Y , with data ψ ∈ M(Y ), we have that Jeffrey likelihood is oppositely ordered to Kullback-Leibler divergence in: (ii) Fix a channel c : X → Y .Then: The above expression on the right is the divergence between the data distribution and the prediction c = ≪ ω.This divergence can be reduced via Jeffrey's rule.The above result says that Jeffrey's rule thus increases the Jeffrey likelihood, see Theorem 2.
Proof.We only prove the first item, since the second one is a direct consequence.We use that the natural logarithm ln : R >0 → R preserves and reflects the order: a ≤ b iff ln(a) ≤ ln(b).This is used in the first step below.We additionally use that the logarithm sends multiplications to sums.
We also relate Pearl likelihood to Pearl's update rule.
Theorem 4 Consider a channel c : X → Y with distribution ω ∈ D(X) and data ψ ∈ M[K](Y ).The validity increase of Theorem 1, applied to the last formulation of Pearl likelihood in Definition 1 (ii), gives an increase of Pearl likelihood via a repetition of Pearl's rule: ∈ D(Y ) can be described via repeated Pearl as: We have used such successive updates in the calculation of the disease probabilities according to Pearl in Example 1.
Proof.We first note that we can write Pearl's likelihood as: = ω(x) The conjunction predicate & y∈Y (c ≫ = 1 y ) ψ(y) used in the above Theorem 4 looses its value in practice as soon as we have much data, that is, when the multiset ψ is big.The conjunction involves multiplication of probabilities and thus quickly becomes unmanageably small.Thus, Pearl update works only (in practice) for small amounts of data.
There is an exception however, which is beyond the scope of the current paper.When there is a conjugate prior situation, Pearl updates may happen via updates of the hyperparameters.This does scale to big multisets of data.

Jeffrey's Update Rule via Variational Inference
In this section we like to make the idea precise that Jeffrey's update rule involves a 'population' perspective, in contrast to the individual perspective in Pearl's rule.We show how Jeffrey's rule emerges from updating a multinomial distribution mn[K](ω).There are two challenges.
the following update of the multinomial abbreviated as σ ∈ D M[K](X) .
We like to think of this σ as a distribution of the form mn[K](ω ′ ).The obvious way to obtain this distribution ω ′ is via frequentist learning, as flrn = ≪ σ.Indeed, as we have seen before ( 3), flrn = ≪ mn[K](ρ) = ρ.The first of our two main results in this section is Theorem 5; it says that flrn = ≪ σ is the Jeffrey update c † ω = ≪ flrn(ψ).This is a technically non-trivial result.• Next we use techniques from variational inference [10,11]: we like to determine the 'best' distribution ω ′ such that mn[K](ω ′ ) approximates the above distribution σ in (10).We thus look for the distribution with minimal Kullback-Leibler divergence.There again we find Jeffrey's update: argmin This is the content of our second main result below, Theorem 6.

Jeffrey's rule via Frequentist Learning
Taking multisets of a particular size K ∈ N forms a functor M[K] : Sets → Sets.This functor can be extended to the Kleisli category Kℓ(D) of the distribution monad D. This works via a distributive law , see [3,7].The extension can also be written via accumulation and arrangement, see Lemma 3 (i) below.We shall use it in that form.The resulting extension is still written as M[K] : Kℓ(D) → Kℓ(D).It sends a set/object X in Kℓ(D) to the set M[K](X) of mulitsets of size K.On a channel/morphism c : X → Y one defines a channel Notice that we have written M(c) for the application of the multiset functor M : Sets → Sets, in order to distinguish it from the extension M[K] : Kℓ(D) → Kℓ(D).
Lemma 3 (i) For a channel c : X → Y and a number K ∈ N the following diagram commutes.Proof.This follow from the results in [7].□ crucial observation is that the formulation of the extension M[K](c) in Lemma 3 (i) also works for daggers.It demonstrates that 'multisets' and 'daggers' commute, see (11) below.
Proposition 4 Consider a channel c : X → Y with a distribution ω ∈ D(X) and a number K ∈ N. Then the following diagram of daggers commutes.
This means that the extended multiset functor M[K] : Kℓ(D) → Kℓ(D) commutes with daggers, where the original prior distribution ω is replaced by the multinomial distribution mn[K](ω), that is: Proof.We concentrate on proving commutation of the diagram, since it implies (11) via Lemma 3 (i).We use Lemma 2 (i) as first step in: This last equation is justified by the three following steps.
• The dagger channel arr † mn[K](ω) : X K → M[K](X) is determined on x ∈ X K as: • We again use arr = ≪ mn[K](ω) = ω K , so that we can apply Lemma 2 (ii): constant that depends only on σ, not on ω.
Thus, in order to minimise the original divergence D KL σ, mn[K](ω) we have to maximise the latter logexpression ln • • • .This is a familiar maximal likelihood estimation (MLE) problem, see e.g.[9, Ex. 17

Conclusions
The difference in outcomes of Pearl's and Jeffrey's update rules remains an intriguing topic.The paper does not offer the definitive story about when to use which rule, but it does enrich the field with several new ingredients (such as the different likelihoods and variational inference) and offers a wider perspective (including probabilistic programming).The main points that we have made explicit are that, when we learn from data, • repeated application of Pearl's rule, for each data point, corresponds to an update of the prior distribution, along a multinomial channel, see Theorem 4; • Jeffrey's rule is best understood as an update of all the multinomial draws from the prior and the formulation in Jeffrey's rule is a best approximation of this update, see Theorem 6.
In these two update mechanisms there seems to be different perspectives at stake: the Pearlian posterior disease probability for an individual can be computed from a couple of tests, whereas the Jeffreyan posterior probability for a population requires many tests.
posterior.The inference algorithm can be customized using a method argument.The default algorithm for discrete problems such those as in this paper is exact enumeration.That is Infer exhaustively tracks all random calls made within fn, discards those that violate the conditions, and computes the exact posterior.This strategy is only feasible for small problem instances.A typical call to Infer looks like

})
The result is a distribution object posterior, which we can for example visualize using the command viz(posterior).We can also sample from the posterior using sample(posterior).Because Infer and sample are first-class operations in WebPPL, inference code can be nested without issue, expressing inference about inference.We use this pattern in our explanation of Jeffrey's update.
If the inference problems are no longer tractable using exact enumeration, approximate or samplingbased inference techniques can be used.The simplest is Monte Carlo simulation using rejection sampling, which will simply generate many execution traces of fn(), discard those whose conditions haven't been satisfied, and aggregate the results.More sophisticated algorithms are importance sampling, particle filters, variational inference and Markov Chain Monte Carlo.Internally, WebPPL is compiled into continuationpassing style which allows the Infer method a large amount of control over what happens at individual sample and condition commands [4].

Fig. 2 .
Fig. 2. Graphical representation of Jeffrey likelihood on the left, and Pearl likelihood on the right, see Definition 1.
Accumulation acc and frequentist learning flrn are natural transformations between functors extended to Kleisli categories: The functor (−) K : Kℓ(D) → Kℓ(D) is the K-fold tensor product, and D : Kℓ(D) → Kℓ(D) is the standard extension of a monad to its Kleisli category, given on c : X → Y by D(c) = η • c : D(c) : D(X) → D(Y ), where η is the unit of the monad D.

Theorem 6
.5].The log expression is maximal for ω = flrn = ≪ σ. □ With this lemma we can get our 'variational' characterisation of Jeffrey's theorem.Consider a channel c :X → Y with distribution ω ∈ D(X) and data ψ ∈ M(Y ).Jeffrey's update c † ω = ≪ flrn(ψ) is the distribution ω ′ ∈ D(X) such that mn[K](ω ′ ) diverges minimally from multinomial update mn[K](ω) M[K](c) ≫ = 1 ψ, that is:argmin ω ′ ∈D(X) D KL mn[K](ω) M[K](c) ≫ = 1 ψ , mn[K](ω ′ ) = c † ω = ≪ flrn(ψ).Proof.By Lemma 5 this minimal distribution isflrn = ≪ mn[K](ω) M[K](c) ≫ = 1 ψ.By Theorem 5 this equals Jeffrey's update c † ω = ≪ flrn(ψ).□ var posterior = Infer({method : 'enumerate '}, function () { var x = bernoulli({p: 0.3}) var y = bernoulli({p: 0.9}) condition (x==y) return x . The ticker displays a current target (either 'Rock' or 'Pop'), and the doorman admits the next person if and only if they prefer the targeted style.The doorman can click the device to obtain a new target (either by cycling sequentially through the targets, or picking one randomly), but there remains a choice when to click.(i)Single Pearl Policy: pick a new target after every person: It may be clear that only the Jeffrey Policy is suitable to achieve the management's goal.Approximately 75% of the people which are admitted are rock fans.This is in line with the key property of Jeffrey's update rule: reducing the divergence with the target distribution τ , see Theorem 2. It is unclear what the single Pearl policy achieves in this context.