A model of stochastic memoization and name generation in probabilistic programming: categorical semantics via monads on presheaf categories

Stochastic memoization is a higher-order construct of probabilistic programming languages that is key in Bayesian nonparametrics, a modular approach that allows us to extend models beyond their parametric limitations and compose them in an elegant and principled manner. Stochastic memoization is simple and useful in practice, but semantically elusive, particularly regarding dataflow transformations. As the naive implementation resorts to the state monad, which is not commutative, it is not clear if stochastic memoization preserves the dataflow property -- i.e., whether we can reorder the lines of a program without changing its semantics, provided the dataflow graph is preserved. In this paper, we give an operational and categorical semantics to stochastic memoization and name generation in the context of a minimal probabilistic programming language, for a restricted class of functions. Our contribution is a first model of stochastic memoization of constant Bernoulli functions with a non-enumerable type, which validates data flow transformations, bridging the gap between traditional probability theory and higher-order probability models. Our model uses a presheaf category and a novel probability monad on it.


Introduction
Bayesian nonparametric models are a powerful approach to statistical learning.Unlike parametric models, which have a fixed number of parameters, nonparametric models can have an unbounded number of parameters that grows as needed to fit complex data.This flexibility allows them to capture subtle patterns in data that parametric models may miss, and it makes them more composable, because they are not arbitrarily truncated.Prominent examples of nonparametric models include Dirichlet process models for clustering similar data points, and the Infinite Relational Model for automatically discovering latent groups and features, amongst others.These infinite-dimensional models can accommodate an unbounded number of components, clusters, or other features in order to fit observed data as accurately as possible.Probabilistic programming is a powerful method for programming nonparametric models.Stochastic memoization [47,57] has been identified as a particularly useful technique in this.This paper is about semantic foundations for stochastic memoization.In deterministic memoization [38], the idea is to compute a function the first time it is called with a particular argument, and store the result in a memo-table.When the function is called again with the same argument, the memo-table is used, resulting in performance improvement but no semantic difference.Stochastic memoization is this memoization applied to functions that involve random choices, and so a memoized function is semantically different from a non-memoized one, because the random choices will only be made once for each argument.see [14] in the statistics literature, or [2,9,51] in the semantics literature, and references therein).This simple example of a memoized constant Bernoulli function is easy to implement using a memo-table, but already semantically complicated.If we put A = R, the real numbers, for the base measure, as is common in statistical modelling, then the memoized constant Bernoulli distribution on (A → 2) is 1-dimensional white noise: intuitively, for every x ∈ R we toss a coin to pick true or false, making an uncountable number of independent random choices.(As an aside, we note that we could combine steps (i) and (ii), using a complicated base measure for the Dirichlet process that includes all the attributes.This model would not be compositional, and in any case, some kind of memoization would still be needed to implement the Dirichlet process.)

Challenge.
In this paper, we address the challenge of showing that the following items are consistent: (1) a type A with a diffuse probability distribution (Def 2.2); (2) a type bool of Booleans with Bernoulli probability distributions (i.e.tossing coins, including biased coins); (3) a type of functions [A → bool], with function application (4); (4) stochastic memoization of the constant Bernoulli functions (3); (5) the language supports the dataflow property (Def.2.3).
These items are together inconsistent with traditional measure theory, as we discuss in Section 2.3, where we also make the criteria precise.Nonetheless (1)-( 4) are together easy to implement in a probabilistic programming language, and useful for Bayesian modelling.Item ( 5) is a very useful property for program reasoning and program optimization.Item ( 5) is also a fundamental conceptual aspect of axiomatic probability theory, since in the measuretheoretic setting it amounts to Fubini's theorem [32] and the fact that probability measures have mass 1, and in the categorical abstraction of Markov categories [13] it amounts to the interchange law of affine monoidal categories.There are measure-theoretic models where some of these items are relaxed ( §2.1-2.3).For example, if we drop the requirement of a diffuse distribution, then there are models using Kolmogorov extension ( §2.2).A grand challenge is to further generalize these items, for example to allow memoization of functions A → B for yet more general A and B, and to allow memoization of all definable expressions.Since the above five items already represent a significant challenge, and our semantic model is already quite complicated, we chose to focus on a 'minimal working example' for this paper.To keep things simple and minimal, in this paper we side-step measure-theoretic issues by noticing that the equations satisfied by a diffuse probability distribution are exactly the equations satisfied by name generation (e.g.[50,§VB]).Because of this, we can use categorical models for name generation (following e.g.[41, §4.1.4],[49, §3.5]) instead of traditional measure theory.Name generation can certainly be implemented using randomness, and there are no clashes of fresh names if and only if the names come from a diffuse distribution (see also e.g.[48]).
On the other hand, if we keep things simple by regarding the generated names as pure names [40], we avoid any other aspects of measure theory, such as complicated manipulations of the real numbers.

Contributions.
To address the challenge of the consistency of items (1)-( 5) above, our main contributions are then as follows.
(i) We first provide an operational semantics for a minimal toy probabilistic programming language that supports stochastic memoization and name generation ( §4).(ii) We then ( §5) construct a cartesian closed (for function spaces) categorical model of this language endowed with an affine commutative monad (Theorem 5.5).In common with other work on local state (e.g.[28,44]), we use a functor category semantics, indexing sets by possible worlds.In this paper, those worlds are finite fragments of a memo-table.(iii) We prove that our denotational semantics is sound with respect to the operational semantics, ensuring the correctness of our approach and validating that lines can be reordered in the operational semantics (Theorem 5.10).The class of functions that can be memoized includes constant Bernoulli functions.We call these functions freshness-invariant (Definition 5.7).The soundness theorem (5.10) is not trivial because the timing of the random choices differs between the operational and denotational semantics.In the operational semantics, the memo-table is partial, and populated lazily as needed, when functions are called with arguments.This is what happens in all implementations.However, this timing is intensional, and so by contrast, in the denotational semantics, the memo-table is always totally populated as soon as the current world is extended with any functions or arguments.(iv) Finally, we present a practical Haskell implementation [26] which compares the small-step, big-step operational, and denotational semantics, demonstrating the applicability of our results ( §6).

Stochastic memoization by example
As noted at the beginning of this section, we will pass between an internal metalanguage for strong monads, and an ML-like programming language that would be interpreted using strong monads.In Section 3 we introduce this programming language precisely, but for now we note that it has a special syntax λ ‫מ‬ x. u, meaning mem (\x → u), since this is a common idiom1 .The law of Definition 2.1 requires equations such as: The examples in the introduction use memoization of a constant Bernoulli function, i.e.
i.e. λ ‫מ‬ x. bernoulli p, where bernoulli p :: Prob Bool is a Bernoulli probability distribution (biased coin toss) with bias p.An intuition is that this is a binary white noise; every point in a has an independently chosen random Boolean value.
Notice that for the laws we have also needed function application In summary, memoized constant Bernoulli functions (3), and function application (4), are a bare minimum to discuss semantic issues around stochastic memoization.We now consider interpretations where the domain a is finite ( §2.1), countable ( §2.2), and uncountable ( §2.3).

Memoization with finite domain
For finite domains a, memoization is straightforward.It involves simply sampling a value of f (x) for all inhabitants of x ∈ X and returning the assignment as a finite mapping.For example, when a = bool, we can implement memoization in Haskell as follows: Semantic interpretation with finite domain.
Memoization with finite domains is supported by a denotational semantics using any strong monad.For example, the category of sets and the monad of finitely supported probability distributions (e.g.[23]).For a = bool, this is nothing but the double-strength: For other finite a, it is defined using the double-strength by induction.

Memoization with countable/enumerable domain
When a is enumerable, such as a=Int, memoization is useful for defining point processes.Memoization can be regarded as providing an infinite stream of random choices, since the streams over b are isomorphic with the functions a → b.
Infinite streams of random choices are crucial examples of statistical processes [14].For an example of an application, recall the one-dimensional Poisson point process.This is a random sequence of real numbers in which the gaps between consecutive numbers are exponentially distributed.We implement memoization with enumerable a in the Haskell LazyPPL library [10] without using state, instead using Haskell's laziness and tries, following [22] (see [10]).We use the Poisson process extensively in the demonstrations for LazyPPL [52].

Semantic interpretation with enumerable domains.
Memoization with enumerable domains is supported by a denotational semantics using the category of measurable spaces and the Giry monad [15].Although the category is not Cartesian closed, the function space B N does exist for all standard Borel B, and is given by the countable product of B with itself, N B. Memoization amounts to using Kolmogorov's extension theorem to define a map (G B) N → G(B N ) (see [45, §4.8] and [9, Thm.2.5]).

Memoization with non-enumerable/diffuse domain
We now move beyond enumerable domains, to formalize the challenge from Section 1.In Section 1 we illustrated this with a clustering model.See [52] for the full implementation in our Haskell library, LazyPPL, along with other models that also use memoization, including a feature extraction model that uses the Indian Buffet Process, and relational inference with the infinite relational model (following [18]).
Rather than axiomatizing uncountability, we consider diffuse distributions.

Definition 2.2 [Diffuse distribution]
Let a be an object with an equality predicate ((a,a)→ bool).A diffuse distribution2 is a term p such that is semantically equal to return ( false ).
For example, in a probabilistic programming language over the real numbers, we can let a be the type of real numbers and let p be a uniform distribution on [0, 1], or a normal distribution, or an exponential distribution.These are all diffuse in the above sense.The Bernoulli distribution on the booleans is not diffuse, because there is always a chance that we may get the same result twice in succession.
For the reader familiar with traditional measure theory, we recall that if p is diffuse then a is necessarily an uncountable space.For any probability distribution on a countable discrete space must give non-zero measure to at least one singleton set.The implementation trick using tries from Section 2.2 will not work for diffuse measures, because we cannot enumerate the domain of a diffuse distribution.It is still possible to implement memoization using state and a memo-table (e.g.[52]).Unlike a fully stateful effect, however, in this paper we argue that stochastic memoization is still compatible with commutativity/dataflow program transformations:

[Dataflow property]
A programming language is said to have the dataflow property if program lines can be reordered (commutativity) and discarded (discardability, or affineness) provided that the dataflow is preserved.In other words, the language satisfies the following commutativity and discardability equations: The dataflow property expresses the fact that, to give a meaning to programs, the only thing that matters is the topology of dataflow diagrams.These transformations are very useful for inference algorithms and program optimization.But above all, on the foundational side, dataflow is a fundamental concept that corresponds to monoidal categories and is crucial to have a model of probability.As for monoidal categories, a strong monad is commutative (5) if and only if its Kleisli category is monoidal (commutativity is the monoidal interchange law), and affine (6) if the monoidal unit is terminal.In synthetic probability theory, dataflow is regarded by various authors as a fundamental aspect of the abstract axiomatization of probability: Kock [31] argues that any monad that is strong commutative and affine can be abstractly viewed as a probability monad, and affine monoidal categories are used as a basic setting for synthetic probability by several authors [7,13,55,56].The reader familiar with measuretheoretic probability will recall that the proof that the Giry monad satisfies (5) amounts to Fubini's theorem for reordering integrals (e.g.[51]).

Semantic interpretations for diffuse domains
The point of this paper is to provide the first semantic interpretation for memoization of the constant Bernoulli functions (3) with diffuse domain (Def.2.2).We emphasize that although other models can support some aspects of this, there is no prior work that supports everything.
• With countable domain, there is a model in measurable spaces, as discussed in Section 2.2.But there can be no diffuse distribution on a countable space.
• In measurable spaces, we can form the uncountable product space R 2 of R-many copies of 2. We can then define a white noise probability measure on R 2 via Kolmogorov extension (e.g.[45, 4.9(31)]).Moreover, there are diffuse distributions on R, such as the uniform distribution on [0, 1].However, it is known that there is no measurable evaluation map R × ( R 2) → 2 (see [1]), and so we cannot interpret function application (4).
• In quasi-Borel spaces [21], there is a quasi-Borel space [R → 2] of measurable functions, and a measurable

.5]).
• There are domain-theoretic treatments of probability theory that support Kolmogorov extension, uniform distributions on R, and function spaces [20,25].However, these treatments regard the real numbers R as constructive, and hence there are no non-trivial continuous morphisms R → 2, and there is no equality test on R, so that we cannot regard R with a diffuse distribution as formalized equationally in Definition 2.2.The same concern seems to apply to recent approaches using metric monads [36].
• The semantic model of beta-bernoulli in [53] is a combinatorial model that includes aspects of the beta distribution, which is diffuse in measure theory.That model does not support stochastic memoization, but as a presheaf-based model it is a starting point for the model in this paper.
• There is a straightforward implementation of stochastic memoization that uses local state, as long as the domain supports equality testing [52].The informal idea is to make the random choices as they are needed, and remember them in a memo- There are other models of higher-order probability (e.g.[6,8,12]).These do not necessarily fit into the monad-based paradigm, but there may be other ways to use them to address the core challenge in Section 1.

A language for stochastic memoization and name generation
Our probabilistic programming language has a minimal syntax, emphasizing the following key features: • name generation: we can generate fresh names (referred to as atomic names or atoms, in the sense of Pitts' nominal set theory [43]) with constructs such as let x = fresh() in • • • .In the terminology of Def.2.2, this is like a generic diffuse probability measure, since fresh names are distinct.
• basic probabilistic effects: for illustrative purposes, the only distribution we consider, as a first step, is the Bernoulli distribution (but it can easily be extended to other discrete distributions).Constructs like let b = flip(θ) in • • • amount to flipping a coin with bias θ and storing its result in a variable b.
• stochastic memoization: if a probabilistic function f -defined with the new λ ‫מ‬ operator -is called twice on the same argument, it should return the same result (eq.( 2)).
We have the following base types: bool (booleans), A (atomic names), and F (which can be thought of as the type of memoized functions A → bool).For the sake of simplicity, we do not have arbitrary function types.In finegrained call-by-value fashion [33], there are two kinds of judgments: typed values, and typed computations.The grammar and typing rules of our language are given in Figure 1.The typing rules are standard, except for the λ ‫מ‬ operator, which is the key novelty of our language.The typing rule for λ ‫מ‬ is given in Figure 1 and is explained in the next section.(Also, equality v = w and memoized function application v@w are pure computations, i.e. in the categorical semantics (section 5.3), they will be composed by the unit of the monad.) Table 1: Grammar and typing rules of the language Types A, B ::

Typing judgements
Typed values:

Operational Semantics
We now present a small-step operational semantics for our language.The operational semantics defines the rules for reducing program expressions, which form the basis for understanding the behavior of programs written in the language.Henceforth, we fix a countable set of variables x, y, z, . . .∈ Var, and consider the terms up to αequivalence for the λ ‫מ‬ operator.Since we focus on functions with boolean codomain, our partial memo-tables are represented as partial bigraphs (bipartite graphs).E) is a finite bipartite graph where the edge relation E : g L × g R → {true, false, ⊥} is either true, false or undefined (⊥) on each pair of left and right nodes ( , a) ∈ g L × g R .In the following, left nodes will be thought of as function labels and right nodes as atom labels.By abuse of notation, syntactic truth values will be conflated with semantic ones.For a partial graph g, E( , a) = β ∈ {true, false, ⊥} will be written β − → a when g is clear from the context.

Extended expressions
We introduce extended expressions e, by extending the grammar of computations (1) with an extra construct { {u} } ,a γ , where u is a computation,( , a) is a pair of function and atom labels to memoize, and γ is the environment to restore after the result of at a has been computed and stored.Intuitively, the decoration { {−} } ,a γ is thought of as a memoization context, indicating expressions where memoization should happen: { {u} } ,a γ is a computation that memoizes the result of u, and then restores the environment to the state it was in before the computation u was evaluated.In the following, ∆ ∈ n≥0 (g L × g R ) n is a finite stack of function-atom label pairs, indicating that we are in the process of computing the result of these functions at these atoms for the first time.Each newly introduced function-atom label pair is assumed not to already belong to the memoization stack.

Configurations
We now define the set-theoretic interpretation of contexts.Context values are built by combining booleans, atomic names and functions using pairing.Thus a context value is a tree, where the branches are understood as pairing.
We now present terminal computations, redexes, reduction contexts, and configurations (table 3).Configurations encapsulate the computation state (a context value, an extended expression, a partial graph, and a map from the partial graph to closures), which helps keep track of different parts of the program as the computation proceeds.
u is an extended expression Γ | ∆ ⊢ c u : A.
) is a partial graph.

Reduction rules
Let − γ be the function evaluating an expression value in a context value γ (e.g.x γ = γ(x), true γ = true, etc).We can define the operational semantics of the language using reduction rules.They provide a step-by-step description of how expressions are evaluated and transformed during execution, following a left-most outer-most strategy, with lexical binding.Given a configuration (γ, u, g, λ) (note that if u is of the form { {u ′ } } ( ,a) γ , then it is assumed that the function-atom label pair ( , a) ∈ g L × g R ), we will apply the following reduction rules: Table 4: Reduction rules.

Reduction Rules
with proba.θ → (γ, return(β), g, λ) where β ∈ {true, false} Example 4.6 We now give an example showcasing how these reduction rules apply on a program combining name generation, a coin flip, function abstraction, and stochastic memoization.An atom x 0 is generated and used as an argument for a function f 1 , which performs a coin flip if the argument matches x 0 .The outcome is then memoized and the result is returned in the second application.There are two execution traces, depending on the outcome of the coin flip (β ∈ true, false).Lemma 4.7 If a configuration (γ, e, g, λ) is accessible, there exists a corresponding configuration judgement J(γ, e, g, λ) def = Γ | ∆ ⊢ c e : A where γ ∈ Γ and such that J(γ, e, g, λ) is derivable (with tables 1 and 2).
Due to the fact that we have at most one redex per (extended) expression and we do not have recursion (so the dataflow graph does not have self-loops and is acyclic), we can prove that: w] : A is such that the memoization stack ∆ does not contain a functionatom label pair with v γ as first component.
As a corollary, we can then prove that a configuration is accessible only if its memoization stack has no duplicates: Lemma 4.9 If a configuration (γ, e, g, λ) is accessible and J(γ, e, g, λ) This in turn enables us to ensure that the operational semantics satisfies the memoization equations: Proposition 4.10 If e 1 and e 2 are programs of the form the configurations (∅, e 1 , ∅, ∅) and (∅, e 2 , ∅, ∅) have the same big-step operational semantics.

Denotational Semantics
In this section we propose a denotational model that verifies the dataflow property (Def.2.3, Theorem 5.5) and which supports memoization of constant Bernoulli functions (Theorem 5.8) and is sound with respect to the operational semantics of Section 4 (Theorem 5.10).Thus we show that criteria (1)-( 5) of Section 1 are consistent.The memo-tables in memoization are a kind of hidden or local state, and our semantic domain is similar to other models of local state [28,37,44,46] in that it uses a possible worlds semantics in the guise of a functor category.Definition 5.1 A total bigraph is a partial bigraph (Def.4.1) that does not have any undefined (⊥) elements.This represents a fully populated memo-table.We notate this g = (g L , g R , E g ), omitting the superscript when it is clear.An embedding between total bigraphs ι : g → g ′ is a pair of injections (ι L : that do not add or remove edges (E g ( , a) = E g ′ (ι L ( ), ι R (a))).These can be thought of as conservative extensions of the memo-table.We let BiGrph emb be the category where the objects are total finite bigraphs and graph embeddings.
We will interpret our types as covariant presheaves, i.e. functors in [BiGrph emb , Set], and programs will be interpreted as natural transformations.We discuss this category in Section 5.1, before defining a monad ( §5.2) and giving a denotational semantics ( §5.3) and proving a soundness theorem ( §5.4).

Base category
We work in the category [BiGrph emb , Set] of covariant presheaves on the category BiGrph emb of finite bigraphs.The types of the language A are interpreted as presheaves A .The idea is that once some functions and atomic names are fixed, and a memo-table g for them is given, then we can say what the values or expressions are, A (g).The values can be renamed by permuting functions and atomic names, and are monotonic in that they remain unchanged when we conservatively extend the memo-table.This is the functorial action, A ι : A (g) → A (g ′ ).
Programs will be interpreted as natural transformations: the naturality ensures that they are invariant under permuting the functions and atomic names, or extending the memo-table.We write • and • for the one-vertex left and right graphs respectively.The denotation of basic types is given by: The presheaf category [BiGrph emb , Set] has products and coproducts, given pointwise [35].In particular, the denotation of the type of booleans is the constant presheaf 2 ∼ = 1 + 1.The edge relations collect to form a natural transformation E : F × A → 2 given by E g ( , a) = E g ( , a).The category [BiGrph emb , Set] is cartesian closed, as is any presheaf category.By currying E, we have an embedding of F in the function space 2 A , i.e.F → 2 A .In fact, in this development to keep things simpler, we will focus on F rather than the full function space 2 A .

Probabilistic local state monad
In the following, X, Y, Z : BiGrph emb → Set denote presheaves, g = (g L , g R , E g ), g ′ , h, h ′ ∈ BiGrph emb bigraphs, and ι, ι ′ : g ֒→ g ′ bigraph embeddings.We will omit subscripts when they are clear from the context.Let P f be the finite distribution monad: By considering the following 'node-generation' monad N (X)(g) def = colim g ֒→ h X(h) on [BiGrph emb , Set], one could be tempted to think that modeling name generation and stochastic memoization is a matter of composing these two monads.But this is not quite enough.We also need to remember, in the monadic computations, the probability of a function returning true for a fresh, unseen atom.To do so, inspired from Plotkin and Power's local state monad [44] (which was defined on the covariant presheaf category [Inj, Set], where Inj is the category of finite sets and injections), we model probabilistic and name generation effects by the following monad, defined using a coend [35], that we name 'probabilistic local state monad': Definition 5.2 [Probabilistic local state monad] For all covariant presheaves X : BiGrph emb → Set and bigraphs g ∈ BiGrph emb : The monad T is similar to the read-only local state monad, except that any fresh node can be initialized.Every λ ∈ [0, 1] g L is thought of as the probability of the corresponding function/left node yielding true on a new fresh atom.We will refer to such a λ as a state of biases.The coend 'glues together' the extensions of the memo-table that are compatible with the constraints imposed by the current computation.The monad allows manipulating probability distributions over such extensions, while keeping track of the probability of new nodes.Equivalence classes in g֒→h X(h) × [0, 1] (h−g) L are written [x h , λ h ] g .In the coend, the quotient can be thought of as taking care of garbage collection: nodes that are not used in the bigraph environment can be discarded.We use Dirac's bra-ket notation3 [x h , λ h ] g h to denote a formal column vector of equivalence classes ranging over a finite set of h's.As such, a formal convex sum where • ι L : g L ֒→ g ′ L is the embedding restricted to left nodes, the maps ψ g,g ′ are given by: • and h g g ′ is the pushout in the category of graphs regarded as an object of BiGrph emb .
More concretely, with Dirac's bra-ket notation, T (X)(g ι ֒− → g ′ ) can be written as: T can be endowed with the structure of a [BiGrph emb , Set]-enriched monad, that is, since [BiGrph emb , Set] is a (cartesian) monoidal closed category, a strong monad.Its enriched unit η X : 1 → T X X and bind (−) * : T Y X → T Y T X are as follows 4 .
η X (g) : (where each q h has been 0-padded accordingly) As argued before, to construct an abstract model of probability, we show that the monad is commutative.Affineness straightforwardly stems from the following lemma: Lemma 5.4 Let X be a constant presheaf on the coslice category g/BiGrph emb , i.e. there exists a set S 0 such that X(g We have the desired dataflow property, meaning that T is an abstract model of probability [32]: In our language, the denotational interpretation of values, computations (return and let binding), and matching (elimination of bool's and product types) is standard.We interpret computation judgements Γ ⊢ c t : A as morphisms Γ → T ( A ), by induction on the structure of typing derivations.The context Γ is built of bool's, F, A and products.Therefore, Γ is isomorphic to a presheaf of the form where k, ℓ, m are the numbers of booleans, functions and atoms in Γ, and X n is is the n-fold finite product in the category of presheaves.Computations of type A and F then have an intuitive interpretation: Proposition 5.6 A computation of type A returns the label of an already existing atom or a fresh one with its connections to the already existing functions: A computation of type F returns the label of an already existing function or create a new function with its connections to already existing atoms and a fixed probabilistic bias: For every bigraph g, we denote by R g (resp.L g ) the set of bigraphs h ∈ g/BiGrph emb having one more right (resp.left) node than g, and that are the same otherwise.For every e ∈ 2 g L (resp.e ∈ 2 g R ), we denote by g + e • ∈ R g (resp.g + e • ∈ L g ) the bigraph obtained by adding a new right (resp.left) node to g with connectivity e to the right (resp.left) nodes in g.We now give the denotational semantics of various constructs in our language.Henceforth, we will denote normalization constants (that can easily be inferred from the context) by Z.
Denotations of Γ ⊢ c flip(θ) : bool, Γ, v : F, w : A ⊢ c v@w : bool, and Γ, v : A, w : A ⊢ c v = w : bool First, by Lemma 5.4, we note that T ( bool So naturally, the map flip(θ) g is the constant function returning the bias θ.
Denotations of Γ, v : F, w : A ⊢ c v@w : bool, and Γ, v : A, w : A ⊢ c v = w : bool The map v@w g : Γ, v : F, w : A (g) → [0, 1] [0,1] g L returns 1 if the left node corresponding to v is connected to the one of w in g, 0 otherwise.Using the internal edge relation E, it is the internal composition: And similarly, the map v = w g : Γ, v : A, w : A (g) → [0, 1] [0,1] g L is given by: where [−, −] is the copairing and ι true , ι false : 1 → bool ∼ = 2 are the coprojections.
The map fresh() g : Γ (g) → T ( A )(g) randomly chooses connections to each left node according to the state of biases, and makes a fresh right node with those connections.
fresh() g : It suffices to consider the bigraphs that belong to R g only, by garbage collection of the coend.
Denotation of Γ ⊢ c λ ‫מ‬ x. u : F. As λ ‫מ‬ -abstractions are formed based on computation judgements of the form Γ, x : A ⊢ c u : bool.We can decompose the extra variable x in the environment Γ, x : A, the denotation of which is of the form Γ, x : gives us the edge probability of the left node (atom) that we need to generate, both to the existing right nodes (functions), and to any future right node (which needs to be remembered).This can be formalized into a natural transformation λ ‫מ‬ x. u : Γ → T ( F ), provided that u satisfies the following property: Definition 5.7 [Freshness-invariant functions] A function λ ‫מ‬ x. u is freshness-invariant if, for every g, b k ∈ 2 k , κ i : • ֒→ g, τ j : • ֒→ g and λ ∈ [0, 1] g L , we have (where ι 1 , ι 2 are the coprojections): ∀e ∈ 2 g L , u g b k , (• , λ is a constant pu A sufficient condition to ensure that a function of the form λ ‫מ‬ x. u be freshness-invariant is that it has no subexpression of the form f @y, where y / ∈ fv(λ ‫מ‬ x. u).
An example thereof is λ ‫מ‬ x. let val b ← f @x 0 in if b then true else (x = x 0 ).Non examples are λ ‫מ‬ x. let val y ← fresh() in f @y and λ ‫מ‬ x. if f @x then false else true (negation of f ).We can interpret freshness-invariant functions as follows: • a ֒− → g, λ for every a ∈ g R , and pu is as in Def.5.7.As a result, the probabilistic local state monad validates (2): Theorem 5.8 The monad T supports stochastic memoization (Def.2.1) for freshness-invariant functions (Def.5.7), which include any function λ ‫מ‬ x. u that does not contain a subexpression of the form f @y, where y / ∈ fv(λ ‫מ‬ x. u) (so, in particular, constant Bernoulli functions).
Proof (Sketch) The denotation of λ ‫מ‬ -abstractions enables us to define a map T ( bool ) A → T (F), which can in turn be postcomposed by T (F) φ − → T bool A , where φ g : T (F)(g) ∼ = P f (g R + 2 g L ) [0,1] g L → [0, 1] [0,1] g ×(g R +2 g L ) ∼ = T bool A (g) ϑ → (λ, a) ∈ [0, 1] g × (g R + 2 g L ) → let ϑ(λ) = a ′ ∈g R +2 g L p a ′ |a ′ in p a to obtain mem : T ( bool ) A → T ( bool A ), and then we show eq.( 1) in the presheaf topos.✷ Proof (Sketch) As an intermediate step, we build a big-step semantics, and show that this is sound, i.e. making a small step of the operational semantics ( §4) does not change the distributions in the final big-step semantics.Next, we show that the big step semantics of a configuration corresponds to the denotational semantics, for which the main thing to check is that the equivalence classes of the coend are respected.✷

Haskell Implementation
We have a practical Haskell implementation comparing the small-step, big-step operational, and denotational semantics to showcase the soundness theorem with QuickCheck, in a setting analogous (albeit slightly different 5 , to better suit the specificities of Haskell) to the theoretical one we presented.The artefact is openly available [26].

Summary
In conclusion, we have successfully tackled the open problem of finding a semantic interpretation of stochastic memoization for a class of functions with diffuse domain that includes the constant Bernoulli functions.Our contributions pave the way for further exploration and development of probabilistic programming and the sound application of stochastic memoization in Bayesian nonparametrics.

Definition 4 . 2 Example 4 . 3 = 2 ∼
If S is a finite set, Tree(S) ∼ = n≥0 C n S n+1 (where C n is the n-th Catalan number, and C n S n+1 is a coproduct of n copies of S n+1 , one for each possible bracketing) denotes the set of all possible non-empty trees with internal nodes the cartesian product and leaf nodes taken in S. If S def = {s 1 , s 2 }, then s 1 ∈ Tree(S), (s 2 , s 1 ) ∈ Tree(S), (s 1 , (s 1 , s 2 )) ∈ Tree(S), . . .Definition 4.4 [Set-theoretic denotation of contexts.]Let g be a partial bigraph.The set-theoretic denotation − of a context Γ is defined as bool def = {true, false}, F def = g L , A def = g R and − is readily extended to every context Γ.Moreover, in the following, γ ∈ Γ ⊆ Tree(2 + g L + g R ) Var denotes a context value.Example 4.5 If Γ def = (x : bool, y : F, z : (

Example 5 . 9 × 1 λ
The denotation of let val x ← fresh() in let val f ← λ ‫מ‬ y. flip(θ) in f @x is the map1 A .(λ F .f @x) * ×(λ‫מ‬ y. flip(θ)) ;ev * × fresh() − −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− → T ( bool ) T A × T ( A ) ev − → T ( bool )given by * , * → λ → θ |true + (1 − θ) |false , as desired.5.4 SoundnessConfigurations are of the form (γ, e, g, λ), where e is of type A, and can be denotationally interpreted as(γ, e, g, λ) def = ẽ∈2 Ug ( ,a)∈Ug λ( ) ẽ( ,a) 1 − λ( ) 1−ẽ( ,a) u g ẽ (γ, λ) ∈ T (A) g ẽ (γ)(λ)whereU g def = ( , a) | E( , a) = ⊥ ⊆ g L × gR and g ẽ extends g according to ẽ: E( , a) = ẽ( , a) for all ( , a) ∈ U g .We can then prove that the denotational semantics is sound with respect to the operational semantics: Theorem 5.10 (Soundness) (γ, e, g, λ) ∼ = (γ,e,g,λ)→(γ ′ ,e ′ ,g ′ ,λ ′ ) with proba.p p • (γ ′ , e ′ , g ′ , λ ′ ) This section discusses the law of stochastic memoization and provides examples in finite, countable, and nonenumerable domain settings.We then address the challenges posed by the naive use of the state monad, and we clarify our objective: finding a model of probability that supports stochastic memoization over non-enumerable domains, satisfying the dataflow property, and that has function spaces.In what follows, we use two calculi: (a) The internal metalanguage of a cartesian closed category with a strong monad Prob, for which we use Haskell notation, but which is roughly Moggi's monadic metalanguage[42,  §2.2].(b) An ML-like programming language which is more useful for practical programming, but which would translate into language (a); this is roughly Moggi's 'simple programming language'[42,  §2.3].We assume passing familiarity with probability and monadic programming in this section, but the informal discussion here sets the context, and we move to more formal arguments in Section 3. (Recall some Haskell notation: we write \x → t for lambda abstraction; ≫ = for monadic bind, i.e.Kleisli composition; return for the unit; a do block allows a sequence of monadic bound instructions.We write const x for the constant x function, const x = \y → x.) Memoization law.Definition 2.1 A strong monad supports stochastic memoization of type a→ b if it is equipped with a morphism mem :: (a → Prob b) → Prob (a → b) that satisfies the following equation in the metalanguage, for every x0 :: a and f :: a → Prob b: [27,there is no such measurable function representing white noise (e.g.[27, Ex 1.2 but there is no white noise probability measure on [R → 2].The intuitive reason is that, in quasi-Borel spaces, a probability measure on [R → 2] is given by a random element, i.e. a morphism Ω → [R → 2], which curries to a measurable function Ω×R → 2. table, and keep this memo-table in a local state associated with the function.Therefore one could use a semantic treatment of local state to analyze memoization.For example, one could build a state monad in quasi-Borel spaces.However, state effects in general do not support the dataflow property (Def.2.3), since we cannot reorder memory assignments in general.Ideally, one could use a program logic to prove that this particular use of state does support the dataflow property.Although there are powerful program logics for local state and probability (e.g.[3]), we have not been able to use them to prove this.

Table 2 :
Extended expression typing rules.Extended expression typing judgements.Here, ( , a) / g) for a bigraph g ∈ BiGrph emb .Now, the extra part x is a right node, and its valuation will either be a node already in the graph described in the rest of the environment, or a new one with particular edges to the rest of the environment.The argument u can test (if it wants) what kind of node x is, before returning a probability.As a result, the denotation u g