A Language for Evaluating Derivatives of Functionals Using Automatic Differentiation

We present a simple functional programming language, called Dual PCF, that implements forward mode automatic differentiation using dual numbers in the framework of exact real number computation. The main new feature of this language is the ability to evaluate correctly up to the precision specified by the user -- in a simple and direct way -- the directional derivative of functionals as well as first order functions. In contrast to other comparable languages, Dual PCF also includes the recursive operator for defining functions and functionals. We provide a wide range of examples of Lipschitz functions and functionals that can be defined in Dual PCF. We use domain theory both to give a denotational semantics to the language and to prove the correctness of the new derivative operator using logical relations. To be able to differentiate functionals -- including on function spaces equipped with their compact-open topology that do not admit a norm -- we develop a domain-theoretic directional derivative that is Scott continuous and extends Clarke's subgradient of real-valued locally Lipschitz maps on Banach spaces to real-valued continuous maps on Hausdorff topological vector spaces. Finally, we show that we can express arbitrary computable linear functionals in Dual PCF.


Introduction
In this paper, we describe a language for performing automatic differentiation on a wide set of functions, including higher-order functions and non-differentiable Lipschitz functions.This language combines the dual numbers-i.e., the algebra consisting of numbers of the form a + bε where a and b are real and ε 2 = 0 -with a domain-theoretic directional derivative which we introduce in this paper.The domain-theoretic directional derivative can properly and correctly handle higher-order functions and Lipschitz functions like the absolute value function or ReLU used in machine learning.The dual numbers are used to incorporate automatic differentiation into the language in a straightforward manner.
Due to its wide range of applications, automatic differentiation has been an active theoretical and practical area of research in recent years [24].While there is a large body of work in the subject, automatic differentiation is not always implemented in a sound and rigorous way.A standard example is the ifthen-else constructor, ubiquitous in numerical programs, which gives incorrect results when evaluated in automatic differentiation software [1,40].
In this work, we develop a language, based on dual numbers and called Dual PCF, which has a rich set of definable functions.This set includes locally Lipschitz functions such as the absolute value function and ReLU that are not differentiable everywhere but are widely used in applications and have a set-valued generalised derivative called the Clarke subgradient, which generalises the classical gradient [10].
Several formal calculi, containing a primitive for evaluating the derivative of functions, have been proposed in the literature.All together, four main features distinguish our calculus from these other work: developing an exact computation framework using domain theory, having a recursive operator in the language, the possibility to deal with functions that are not infinitely differentiable, and the ability to compute the derivative of functionals on real-valued functions.None of the existing languages accommodate all these four main features together.
The recursive operator allows the recursive definition of functions on real numbers.Most of the programming languages in the literature, instead, assume the existence of a sufficiently rich set of basic functions over the real numbers and construct the other functions by composition, λ-abstraction, and test operator [3,4,29,33,45].Since the recursive operator is missing, these languages are not Turing complete from the point of view of real number computation.In [33], it is claimed that the approach used in [40], together with other techniques, can be used to solve the problem of dealing with the recursive definition.However, the actual solution is left as future work.
To accommodate the recursive operator, the denotational semantics needs to describe partial elements, either in the form of partially defined functions over the real line as in [1,40] or as partial real numbers, as in the present work.The use of partial real numbers unifies our treatment of the derivative operator to the field of exact real number computation.Other approaches assume that real numbers form a basic type where the arithmetic operations can be computed in constant time, giving the exact result.Therefore, the computation on reals is idealised, and the problem of the infinitary nature of real numbers is wholly avoided.
A second aspect by which our work differs from existing work is that we also consider functions that are Lipschitz but not differentiable, such as the absolute value function, which plays a key role in all applications.The presence of non-differentiable functions makes the repeated application of derivative operator more complex.The calculi in [3,20] assume that all functions are infinitely differentiable; as a consequence, they cannot accommodate, in a coherent way, functions like absolute value or min (evaluating the minimum of two real numbers) that are Lipschitz but not everywhere differentiable.The formal languages developed in [1,40] accommodate a larger set of functions including the if-then-else constructor, but define a restricted domain for the input-values and only ensure the correct evaluation of the derivative in the restricted domain.
However, the main feature of our calculus is the ability to compute the derivative of functionals on real-valued functions.We therefore extend the mechanism of automatic differentiation to provide the directional derivative of functionals.To this end, we have developed a domain-theoretic Scott continuous directional derivative for real-valued functions on Hausdorff topological vector spaces that extends the Clarke subgradient to functionals defined on function spaces where the Scott topology does not admit a norm.In addition, we are able to reduce the problem of definability of linear functionals in the language to the definability of first-order functions.
Automatic differentiation on functionals has also been considered in [8,28,33,45].In [8], an approach quite different from ours is used: functions on reals are represented through a base of Chebyshev polynomials, and thereby the directional derivative of a functional is evaluated.In [28,45] no recursive definition of functions on reals is possible and we have already commented on [33].
Because both recursion on real numbers and non-differentiable Lipschitz maps are included in our language, we cannot use the approach adopted by [29,45] in the denotational semantics, which creates a serious challenge for defining a denotational semantics as we undertake in this work.Here, we establish the correctness of the results given by automatic differentiation of higher-order functions and relate them to mathematical notions of derivatives.
The new domain-theoretic notion of directional derivative developed in the paper computes the support function of the well-established mathematical concept of the generalized Lebourg's subgradient which reduces, when this subgradient is a singleton, to the Gateaux derivative of real-valued maps on topological vectors spaces.In contrast, in [45,29] the directional derivatives developed using category theory are not related to any established mathematical notions of differentiation on topological vector spaces.In this respect, in [29], Section 7.2., which has the title "Canonical derivatives of higher order functions?",concludes with the following remark: "We hope that an exploration of such techniques might lead to an appropriate notion of computable derivative, even for higher order functions."Thus, the authors leave the problem of defining a canonical notion of derivative for higher order functions as an open problem.We claim that our approach satisfies at least some of the requirements for a canonical notion of directional derivative since we have shown the coincidence between the standard mathematical notion of derivative and a constructive notion of derivative obtained through domain theory, which, in turn, induces an effective way to differentiate functionals.
Dual PCF is simply a functional programming language with an extra basic type of dual numbers.In fact, dual numbers are at the basis of a standard approach to automatic differentiation, in which one evaluates the derivative of a function at a given point and along one variable.In our language, however, we show also that dual numbers can be used to obtain the directional derivative of a function in several variables, and the directional derivative of a functional.An important contribution is a formal proof of correctness for the computation of the directional derivative inside the language.
We also note that there exists a trick, well-known in functional analysis, to reduce the problem of evaluating the derivative of a functional along a given direction to the problem of evaluating the derivative of a first order function.Given a functional F : (R → R) → R, the derivative of F at f : R → R in the direction g : R → R can be reduced to evaluating the derivative of λx.F (λy.f (y) + x • g(y)).However, this technique has its own problems.Without an analysis of how functionals are described in automatic differentiation, there is no guarantee that the above technique will evaluate the derivative correctly.More specifically, our language contains higher-order primitives such as integration or supremum, and it is necessary to check that automatic differentiation is correctly implemented on these primitives.Furthermore, it is useful to be able to evaluate the derivative of a functional directly from the functional itself, without having to plunge the functional into another expression just to extract the derivative.
The wide range of applications of differentiation of functionals, some of which we elaborate in the paper, include: calculus of variations [36], numerical solution of differential equations using Newton's method over function spaces [8], optimal control theory which features some non-differentiable functions such as the absolute value function [11], physical applications in analytical mechanics, Lagrangian and Hamiltonian mechanics [46,25] and in quantum chemistry [43].
As one of our key applications, we show how in Dual PCF one can solve initial value problems in the theory of ordinary differential equations [26].Using the constant for integration in the language, we are able to construct Picard's functional for solving an initial value problem in the framework of domain theory [19].To our knowledge, this is the first time a (theoretical) functional programming language can undertake this task.

Related works
Several simple calculi with a derivative operator as a primitive have been presented in the literature.A few of these do not aim to implement automatic differentiation [20,14].Forward-mode automatic differentiation is realized in [38,39,45].
In recent years, driven by the application of reverse-mode automatic differentiation in deep learning, a series of formal calculi have been proposed, some of them based on reverse-mode automatic differentiation [9,21,37], others implementing both forward-mode and reverse-mode automatic differentiation [40].In these works, the correctness of automatic differentiation is proved in the context of first order languages [1], or higher order functional languages [3,9,21,33,37,40].Calculi that use logical relations to prove the correctness of derivative evaluation, a key aspect of our work, are provided in [5,14].In several of these works, the semantics of differentiation is given using a categorical setting, [3,28,37], in contrast to using the more concrete space of real numbers or its extension in domain theory.With the exception of [14,45], none of these formalisms uses the notion of Clarke subgradient.The main new result over our previous work [14] is the extension of the notion of derivative to functionals.This extension involves, among other things, transitioning from the Clarke's subgradient for Banach space to generalized Lebourg's subgradient for topological vector space.Also, we consider the use of dual numbers to provide an alternative operational semantics.Additionally, we show that all computable linear functionals can be expressed in Dual PCF.
The paper is organised as follows.In the rest of this section, we recall the elementary facts about dual numbers, followed by the domain-theoretic notions and results required in this paper.In Section 2, we develop the domain-theoretic generalization of Clarke's subgradient for Hausdorff topological vector spaces.In Section 3, we introduce the domain of dual numbers.In Section 4, we present the syntax of Dual PCF, with its denotational and operational semantics, and prove adequacy.In Section 5, a wide range of examples of functions and functionals definable in Dual PCF are presented.In section 6, we define the notion of local consistency between the real and dual parts of a function on the dual domain.In Section 7, we show that the semantic interpretations of the functions definable in Dual PCF are locally consistent.Finally, in Section 8, we prove that all linear functionals on real functions that vanish at infinity are definable in Dual PCF.
All proofs for the results are provided in the full version of this paper [15].

Dual-number preliminaries
The dual numbers are one of only three 2-dimensional "number systems" that extend the real numbers ( [31]).A dual number is an expression of the form a + εb, where a and b are reals, and ε is a new type of imaginary number with ε 2 = 0.
The ε can also be thought of as representing some very small number ( [7]).The number is not so small as to be zero, but it is small enough that its product with itself is zero.This leads to an intuitive picture where the dual numbers represent a one-dimensional number line where each number on that line is surrounded by a set of numbers which are infinitely close to it.We call the term a in a + bε the standard part, and the term b the infinitesimal part.This terminology is consistent with that of nonstandard analysis.
To appreciate the properties of dual numbers, let f (x) be some polynomial.It is easy to check that where f ′ is the derivative of f .On a purely algebraic level, the above equation shows that the dual numbers are able to "accidentally" differentiate an arbitrary function.This feature of the dual numbers can be used to "induce" a computer language into computing the derivative of a subroutine, essentially by exploiting operator overloading ( [12]).The above presentation can be seen as an alternative formulation of "forward-mode automatic differentiation" ( [27]), a method that is notable for not introducing any numerical approximations, and also for avoiding the exponential overhead of naive symbolic differentiation.

Scott topology and directional derivative on topological vector spaces
We first present the elements of domain theory and topology required here; see [2] and [23] for basic references to domain theory.We denote the closure and interior of a subset S of a topological space by S and S • respectively.For a map f : for all x ∈ X.For any two bounded complete domains D and E, the function space (D → E) consisting of Scott continuous functions from D to E with the extensional order is a bounded complete domain with a basis consisting of lubs of bounded and finite families of singlestep functions.If the lattice ΩX of open sets of X is a domain and if D is a bounded complete domain then for any continuous function f : Since R ⊂ IR is dense, any continuous map f : R → R ⊂ IR, considered as a continuous map f : R → IR, has a maximal extension f ⋆ : IR → IR given by f ⋆ (x) = f [x].We also have the following.

Proposition 2.2 The set of functions
is dense in (IR → IR) with respect to the Scott topology.
In this paper, we construct a language for differentiation of functions of first-order and functionals of second-order.The input and output of these functions and functionals will be given by bounded complete domains whose set of maximal elements consists of real numbers, in the case of functions, or contains the continuous real-valued functions, in the case of functionals.We will first study the topological properties of the space of first-order real-valued functions in this section.

Scott topologies on function spaces
The space of maximal elements of a bounded complete domain, equipped with its relative Scott topology, is Hausdorff [30] and is in fact a complete separable metrisable space [34].Moreover, the domains encountered in this paper are constructed from IR by using Cartesian product and function space construction, and therefore they inherit the operations for addition and scalar multiplication from interval arithmetic operations on IR; e.g., for f, g : is the sum of two intervals f (x) and g(x).
As a major example that we work with in this paper, consider the function space (R → R) of all continuous real-valued functions on the real line.This function space is a real vector space with the usual operations of addition of functions and multiplication by real numbers.It is in one to one correspondence with the subset of functions in the maximal elements of the bounded complete domain (IR → IR) consisting of the maximal extensions of continuous real-valued functions.This subset of maximal elements inherits the relative Scott topology from the domain, but does not admit a norm (see Proposition 2.5 below).
The compact-open topology on the function space (Y → Z), the collection of all continuous functions between topological spaces Y and Z, has sub-basic open sets of the form The function space (R → R), equipped with the compact-open topology, is an example of a Hausdorff topological vector space.Recall that a Hausdorff topological vector space is a vector space with a Hausdorff topology with respect to which addition of vectors and scalar multiplication are continuous operations.

Proposition 2.5 The function space (R → R) with the compact-open topology does not admit a norm.
We make two additional remarks here.The sup norm topology on the function space ([0, 1] → IR) coincides with the compact-open topology as is easy to check.However, the compact-open topology on the function space (R → b R), the set of bounded continuous maps, is strictly weaker than the sup norm topology [17, p. 284].The same is true for the space C 0 (R) := (R → 0 R), the set of continuous maps of type R → R that vanish at infinity.
Therefore, for computational reasons, we will work with Hausdorff topological vector spaces which are more general than normed vector spaces and give a unifying framework for function spaces as well as the more basic finite dimensional Euclidean spaces we consider in this paper.
Finally, we have the following result which follows from Proposition 2.2.Consider the function space (R → R) → R with its compact-open topology.

Corollary 2.6 Any continuous functional
Let f : X → R be a real-valued map on a Hausdorff topological vector space X (i.e., a vector space with a Hausdorff topology with respect to which addition and scalar multiplication are continuous operations).We have the following new notion of directional derivative, which generalises Clarke's generalised directional derivative to real-valued maps on topological vector spaces.

Definition 2.7 The domain-theoretic directional derivative
In the full version of the paper [15], we show that Lf is Scott continuous, extends Clarke's notion of generalised directional derivative on a Banach space (i.e., a complete normed vector space) [10] and, like the latter, satisfies a weaker calculus (with equality replaced by ⊑) compared to the classical derivative; this calculus includes a weaker chain rule for composition of two functions [15, section 2.2].
We will work with a well-behaved family of so-called locally Lipschitzian maps on Hausdorff topological vector spaces defined as follows.Say Lf : For a Banach space the two notions of locally Lipschitzian map and locally Lipschitz map coincide [15, section 2.2].If f is locally Lipschitzian, then for each x ∈ X, there exists, as in the construction by Lebourg in [35], a non-empty weak* compact subset, denoted ∂f (x), of the dual space of X, such that for all x ′ ∈ X, we have Lf (x, x ′ ) = {Ax ′ : A ∈ ∂f (x)}, which is a compact real interval [15, Theorem 1(i)].If X is a Banach space, ∂f (x) coincides with Clarke's subgradient [10].

Domain of dual number intervals
This section introduces a hierarchy of continuous domains for dual numbers, D τ , and defines a family of mapping, (−) d τ , that embeds the spaces of functions on real numbers and those of functionals on functions into these domains.The domain-theoretic directional derivative of a function f on reals defines the infinitesimal part of (f ) d τ .On the other hand, we define two families of maps, (−) s τ and (−) i τ , that extract, from a total function g on D τ , the function f on real numbers that g represents as well an infinitesimal perturbation of f that is used in the computation of directional derivatives.
The maps ( ) d τ , ( ) s τ , and ( ) i τ are given on a limited hierarchy of types τ formally defined by the following.We define first-order function types to be the types having the form δ → (. . .→ (δ → δ)), and second-order function types to be the types having the form τ 1 → (. . .→ (τ n → δ) . ..) with τ 1 , . . ., τ n either a first-order function type or equal to δ.Notice that, by the above definition, a first-order function type is always a second-order function type.We define a first-order function to be a function f having first-order function type.We define a second-order function, or a functional, to be a function F having a strictly second-order function type.By uncurrying, a first-order function f can be seen to take n dual values and return a dual value.
The domain for dual number, DR, is the domain IR × IR.The first component is the standard part of a dual number (albeit interval valued instead of single-valued) and the second component is the infinitesimal part.The domains and codomains of the maps, ( ) d τ , ( ) s τ , and ( ) i τ are defined by the following.
Definition 3.1 By induction on the structure of a second-order function type τ , we define the following families of topological spaces: • R δ = R and R τ 1 →τ 2 is the topological vector space of continuous functions from R τ 1 to R τ 2 with the compact-open topology; • R p δ = IR and R p τ 1 →τ 2 is the topological space of continuous functions from R τ 1 to R p τ 2 with the Scott topology; δ is the subset of DR containing elements whose standard part is total, i.e., maximal; D t τ 1 →τ 2 is the subset of D τ 1 →τ 2 containing functions mapping all elements in D t τ 1 →τ 2 to elements in D t τ 2 .Functions in D t τ are called standard maximal preserving.
Notice that, equipped with the compact-open topology, the topological vector space R τ 1 →...→τn→δ is homeomorphic to the space (R τ 1 × . . .× R τn ) → R δ [17, p. 261].Moreover, the space of maximal elements of the listed bounded complete domains, in Definition 3.1, equipped with their relative Scott topology will be Polish, and hence Hausdorff topological vector spaces ( [34]) and thus the results at the end of Section 2, and proved in [15, where ( ) ⋆ is the envelope operator of Proposition 2.1.In defining h, we need to use the envelope (Lf ) ⋆ because (d i ) i can be a partial real number, or a function returning partial real numbers.Since the topological space D t δ is dense in D δ , by an obvious generalization of Proposition 2.2, for any first order type τ , D t τ is dense in D τ ; it follows by Proposition 2.1 that the envelope h ⋆ exists for both first order and second order functions.The definition of ( ) d τ can be seen as extension of the construction in [15, Corollary 1] for extending a functional of type (R → R) → R to (IR → IR) → IR.

Proposition 3.3 For any second order type
, by the infinitesimal property of ε, we have: Note that for first-order types τ , the functions ( ) d τ are just set-theoretic functions since they are not continuous functions on the infinitesimal component.In the above and in the following, we use the pointwise extension of the multiplication by the dual number ε, and the addition operation +.That is, if * is an operation defined on the domain D, the operation * on the domain C → D, is defined by: and similarly for other operations and functions.
In the following, we will sometimes omit the type τ from ( ) d τ , ( ) s τ , ( ) i τ when the type of τ is clear from the context.We will also implicitly assume, where necessary, that any real number is automatically "cast" to a dual number.
Denote by St and In the envelopes of the functions ( ) s δ and ( ) i δ respectively.They are functions in DR → IR defined by: St(x + εx ′ ) = x, In(x + εx ′ ) = x ′ .

A language for differentiable functionals
Next we present Dual PCF, a language with a primitive operator for the evaluation of directional derivatives of functionals.The language is a simply typed λ-calculus extended with a suitable set of constants.
The types of the Dual PCF are defined by the grammar: where o is the type of booleans, ν is the type of natural numbers, π is the type of real numbers, and δ is the type of dual numbers.The derivative operator is defined only on the type of dual numbers δ.We assign to variable x the type π if we are not interested in evaluating the derivative with respect to x; values of type π have implicitly an infinitesimal part equal to 0. The set of expressions in the language is defined by the grammar: where x τ ranges over a set of typed variables and c over a set of constants.For simplicity, here we present only a minimal set of basic constants, sufficient to express any other computable function.In a real programming language this minimal set will be extended with other functions.All constants defining functions on dual numbers, for example max : δ → δ → δ have a corresponding version on real numbers max : π → π → π, acting in the obvious way.To avoid repetition, we present just the dual number versions, implicitly assuming the definition for the real number version.The basic constants in the language are as follows: (i) The three total arithmetic operations, +, −, * : δ → δ → δ.
(iii) Minimum and maximum min , max : δ → δ → δ, evaluating the minimum and maximum of two dual numbers.(iv) Two casting functions (explicit conversion), from naturals to reals in π : ν → π, and from reals to duals in δ : π → δ. (v) A zero-test on reals (0 <) : π → o, that cannot be applied to dual values.This restriction assures that functions on dual numbers do not have points of discontinuity on maximal elements.For example, a function, from dual values to dual values, returning 0 on strictly negative values and 1 on strictly positive ones, will not be definable.This fact, in turn, is necessary to guarantee the correctness of the derivative operator.To ensure that for a function f of type δ → δ, for example, the infinitesimal part describes the derivative of the evaluated part, the language has restrictions on the way δ values are used.It is impossible to convert a dual into a real or to test whether a dual is less or greater than 0. Consequently, all functions from dual numbers to Booleans are constant, so the if-then-else operator cannot be used to define functions on dual numbers that have no generalised derivative.As in [14], we have used the min/max operators as a safe alternative to if-then-else.

Operational semantics
We define a small-step operational semantics.We note that we cannot use a standard PCF operational semantics as in [40] because in our work, unlike [40], we implement exact computation on real numbers, so a real number cannot be defined by a single finite value.In [13,14], an operational semantics to exact computation on real numbers is given by representing real numbers as streams of digits and using lazy evaluation to implement functions on them, but this has its drawbacks.It relies on parallel computation and is difficult to simulate in a programming language.In this paper, we propose an alternative approach, reminiscent of some work in [6,45], with the advantage of a fairly direct translation into Haskell.
In the operational semantics, we use a set of basic constants for the rational intervals [a, b], together with a set of basic constants on dual numbers, made up of pairs of rational intervals, [a, b] + ε[a ′ , b ′ ].We consider the infinite interval (−∞, +∞) to be a special case of a rational interval.We avoid introducing these values directly in the main syntax of the language because they are partial values, and we prefer to have constants only for totally defined values, as is common in most programming languages.Using the functions in π : ν → π and in δ : π → δ and the arithmetic operations, all rational values are readily available in the language.
In the operational semantics, we need to address the problem that by unfolding the fixed-point operator Y τ , on the one hand, one can obtain better and better approximations for a real value, but, on the other hand, the unfolding needs to be stopped at some point; otherwise, the computation diverges.In exact real number computation, which deals with infinitary objects, one rarely has base cases in recursive definitions.For example, one can define a real number by the recursive equation x = 1/4 + x/4, and the infinite unfolding of this definition is an infinitary expression representing the value 1/3.The partial approximations of 1/3 actually used in the computation are obtained by forcing a stop in the infinite unfolding.Similar considerations hold for the Riemann integral, by increasing the number of sub-intervals with which the unit interval is partitioned, one obtains better approximations of the integral, but the partition of the unit interval cannot be refined indefinitely.
To solve this problem, we extend the syntax, used in the operational semantics, by building expressions in the form e, n (or e, (m, n) ).The parameter n (or (m, n)) represents a measure of the complexity of the computation along which the expression e is going to be evaluated.Extending the syntax of the terms is a way to introduce in the syntax some information about the evaluation strategy that is useful in defining the reduction rules, but is not very meaningful in describing a function.One can avoid grammar extensions by adding extra information to the reduction rules, but the current approach is simpler to define.
The operational semantics allows to derive judgements in the form e, n whose intended meaning is that with a computation bounded by a cost n, the expression e reduces to the rational dual In more detail, Dual PCF has two forms of the recursive operator Y τ : a bounded one, when τ is a continuous type, a function type having a continuous range space, that is τ = τ ′ → π or τ = τ ′ → δ, and a standard one for any other value of type τ .Higher values of the parameter n imply more effort in the computation so it will be always the case that if m ≤ n, In other words, the evaluation of e, 0 , e, 1 , e, 2 , . . ., produces a sequence of intervals each one contained in the previous one and converging to the denotational semantics of e.This approach to the operational semantics is somewhat similar to and inspired by [45,6].
Formally for the operational semantics, we consider an extended language obtained by adding extra constants and three extra production rules to the expression grammar of Dual PCF, as in Equation ( 2): The set of constants is extended by: • a constant In for a function δ → π returning the infinitesimal part of a dual number.The evaluation contexts are the standard ones for a call-by-name reduction: The reduction rules for generic terms are: The reduction rules for a generic binary operation on dual numbers op are: together with the rules defining operations on rational intervals; e.g.: The reduction rules for the functionals on dual numbers make use of the parameter n.We have the following rules: The operational semantics for the derivative operator is defined by: where + τ and ǫ τ denote the expressions: + σ→τ = λf.λg.λx.(f x) + τ (gx) and ǫ σ→τ = λf.λx.ǫ τ (f x), while ǫ(x + εx ′ ) is a shorthand for (0 + ε) * (x + εx ′ ).The operational semantics for the fixed-point Y σ operator on a continuous type σ is defined by: where The remaining rules can be found in the full version of the paper [15].

Denotational semantics
The continuous Scott domain D τ , used to give a semantic interpretation to expressions having arbitrary type τ , is recursively defined by: The semantic interpretation of any PCF constant is the usual one.The general schema to give semantics to constants representing functions on dual numbers is the following: given a constant c of type τ that denotes a function f c on the real line (R), the semantic interpretation of c is given by B c defined by: To help the reader, we explicitly define the semantic interpretation of some of these constants: Integration on the dual domain is reduced as where ⋆ [0,1] : (I[0, 1] → IR) → IR is the envelope of the Riemann integral functional [0,1] : ([0, 1] → R) → R as in Proposition 2.1; it coincides with integration constructor developed in Real PCF and in interval analysis [18], [42].It sends a continuous function of type I[0, 1] → IR to an interval in the domain of reals IR and extends the Riemann integral in the sense that if f : , where as usual we identify a real number with its singleton.Note that the integration constructor, using the above method, can compute the value of ( For clarity, to distinguish the classical Riemann integral from the domain-theoretic integral, we always denote the classical Riemann integral of f : [0, 1] → R over any interval [a, b] by b a f (x) dx while [u,v] g(x) dx, i.e., with the range of integration as a subscript to the integral sign, always denotes the extended interval-valued Riemann integral for a continuous function g : IR → IR.
The semantic interpretation of the derivative operator L τ is defined by: The interpretation of the other constants that cannot be obtained by the general scheme is the following: We point out that quite often dual numbers or functions on dual numbers obtained by using the recursion operator Y have an unbounded infinitesimal part, meaning that, depending on the type, the infinitesimal part is ⊥ = (−∞, +∞), or is the function that maps every element to ⊥.A simple example is the following recursive definition of the value 1: Y(λx δ .(pr x δ + 1)/2).This fact can be explained as follows: the semantic interpretation functions are linear on the infinitesimal parts, and a linear function when applied to the bottom value (−∞, +∞) returns either 0 (if the linear map is identically 0) or the bottom value itself.It follows that each element in the chain (f i (⊥ σ )) i∈N , whose least upper bound gives the semantics interpretation of Y f , has an unbounded infinitesimal part.
A solution for this problem, as in [14], consists of introducing a second type of dual values, δ l , having the infinitesimal part bounded by the interval [−1, 1].The basis functions on the type δ l need to be non-expansive.Therefore most of the basic functions defined on δ must be replaced by non-expansive versions of them.For example, addition + is replaced by a function evaluating the average of two values.To motivate this restriction, notice that by using addition it is possible to build a function that doubles its argument: λx.x + x; this function maps 0 + ε1 to 0 + ε2, and therefore cannot have type δ l → δ l .Inside the type δ l , it is possible to use the fixed point operator to obtain functions with informative infinitesimal parts.The functions obtained in this way can later be embedded in the larger type δ.For lack of space and to focus on the main subject of this paper, which is evaluating derivatives of second-order functionals, we do not fully present this solution and refer the interested reader to [14].
The semantic interpretation function E is defined, by structural induction, in the standard way:

Adequacy
The correspondence between the denotational and the operational semantics is shown by the following result.For a closed expression e of type δ, let us denote by [a, b] + ε[a ′ , b ′ ] ≪ Eval(e) the property that there exists a natural number n and a dual rational interval Theorem 4.1 On type δ the operational semantics is sound and complete with respect to the denotational semantics, that is for any closed expression e : δ, for any partial rational dual number

Some functions and functionals in Dual PCF
In this section, we will give various examples of functions and functionals expressible in our language.In some cases, we will also show how to use the operational semantics to compute the derivatives of these functionals.Some of these examples are motivated by actual areas of application.In many cases (for instance, [16]), the problem of solving an integral or differential equation can be reduced to finding the roots of an integral or differential operator; being able to evaluate the derivative of a functional is required for such problems:

Absolute value function
The absolute value function f (x) = |x| can be written in the language as λx.max (x, −x).The domaintheoretic directional derivative of this function at 0 is then correctly evaluated by the following reduction:

Comparison with Chebyshef software
From [8, section 3.2], consider taking the directional derivative of the operator G = λg.λx.x + g(x) 2 at the point f = λu.u 2 in the direction k.We refer to the operational semantics: So in other words, we have that LG(f, k) = λy.2y 2 k(y).This is the same result as was obtained by the software system in the above paper, except that their autodiff procedure is far more involved than ours.
consistent does not have a straightforward proof.The second definition of consistency is based on logical relations.Logical relations are a standard proof technique used in the semantics of functional languages; they are used for proving that the semantic interpretation of terms satisfies some desired properties.A general introduction to logical relations can be found in [41], while in [5,14], logical relations are used in a way similar to our work.We define a set of logical relations which, if they are preserved by a function f , imply that the function f is locally consistent.For any rational number r > 0 let R r δ be a ternary relation over generalised dual numbers DR defined by: R r δ (x Next, we have the following implication regarding functionals.We conjecture that the reverse implication also holds, but since the reverse implication is not required in this work, we avoid considering it here.
Proposition 7.3 Any second-order function F : D τ →δ is locally consistent if it is logically consistent.
Note that, with the single exception of L τ , all the constants in the language are logically consistent.The proof is routine for almost all constants.To prove that the fixed-point operator preserves the above relations, one shows that the bottom elements are self-related by R r σ , and that the relation is closed under the lub of chains.Note that (< 0) preserves the relation when applied to the domain D π , but it has no logically consistent extension to the domain D δ .
Using the technique of logical relations [41], it is straightforward to show: Proposition 7. 4 The semantic interpretation E e of any closed expression e : τ not containing L is logically consistent.

Corollary 7.5
The semantic interpretation E e of any closed expression e having second-order function type and not containing L is locally consistent.

Corollary 7.6
The derivative operator L is sound, i.e., for any closed expression L F (f )(g), if F is a second-order function, E F , E f are standard maximal preserving, and L is not contained in F, f , g then Since our language contains the if-then-else operator, and it is a well-known problem that the if-then else constructor produces an inconsistent result with automatic differentiation, the above result may appear contradictory.Note, however, that there are restrictions on the functions that can be defined on dual numbers.In particular, it is impossible to convert a dual number into a real number, or to test whether a dual value is less or greater than 0. Consequently, all functions from duals to Booleans are constant, so the if-then-else operator cannot be used to define functions on duals that have no generalised derivative.As in [14], we have used the min/max operators as a safe alternative to if-then-else.
topological spaces X and Y , denote the image of any subset S ⊂ X by f [S].The compact-open topology of the function space (X → Y ) has sub-basic open sets of the form (C, O) = {f : f [C] ⊂ O}, with C compact and O open.If X and Y are metric spaces, f is locally Lipschitz if for any x ∈ X there exist an open neighbourhood O of x and k is a non-empty compact real interval, we write I = [I − , I + ].A directed complete partial order D is a partial order in which every directed set A ⊂ D has a lub (least upper bound) or supremum A. The way-below relation ≪ in a dcpo (D, ⊑) is defined by x ≪ y if whenever there is a directed subset A ⊂ D with y ⊑ A, then there exists a ∈ A with x ⊑ a.A subset B ⊂ D is a basis if for all y ∈ D the set {x ∈ B : x ≪ y} is directed with lub y.By a domain we mean a dcpo with a basis.Domains are also called continuous dcpo's.If D has a countable base then it is called a countably based domain.In a domain D with basis B, we have the interpolation property: the relation x ≪ y, for x, y ∈ D, implies there exists z ∈ B with x ≪ z ≪ y.A subset A ⊂ D is bounded if there exists d ∈ D such that for all x ∈ A we have x ⊑ d.If a pair of elements d 1 , d 2 ∈ D is bounded above (consistent), we write d 1 ↑ d 2 and refer to the predicate ↑ as the consistency relation.If any bounded subset of D has a lub then D is called bounded complete.In particular a bounded complete domain has a bottom element ⊥ that is the lub of the empty subset.A bounded complete domain D has the property that any non-empty subset S ⊂ D has an infimum or greatest lower bound S. All domains in this paper are bounded complete and countably based.The set of non-empty compact intervals of the real line ordered by reverse inclusion and augmented with the whole real line as bottom is the prototype bounded complete domain for real numbers denoted by IR, in which I ≪ J iff J ⊂ I • .It has a basis consisting of all intervals with rational endpoints.For two non-empty compact intervals I and J, their infimum I ⊓ J is the convex closure of I ∪ J.The Scott topology on a domain D with basis B has sub-basic open sets of the form ↑ ↑b := {x ∈ D : b ≪ x} for any b ∈ B. The upper set of an element x ∈ D is given by ↑ x = {y ∈ D : x ⊑ y}.The lattice of Scott open sets of a bounded complete domain is continuous.The basic Scott open sets for IR are of the form {J ∈ IR : J ⊂ I • } for any I ∈ IR.The maximal elements of IR are the singletons {x} for x ∈ R which we identify with real numbers, i.e., we write R ⊂ IR, as the mapping x → {x} is a topological embedding when R is equipped with its Euclidean topology and IR with its Scott topology.Similarly, I[a, b] is the domain of non-empty compact intervals of [a, b] ordered with reverse inclusion.If X is any topological space with some open set O ⊂ X and d ∈ D lies in the domain D, then the single-step function dχ O : X → D, defined by dχ O (x) = d if x ∈ O and ⊥ otherwise, is a Scott continuous function.The partial order on D induces by point-wise extension a partial order on continuous functions of type X

Lemma 2 . 3
for any compact subset C ⊂ Y and open O ⊂ Z [17].There is a simple characterisation of the compactopen topology for the function space (R → R) or ([0, 1] → R).The compact-open topology on (R → R) is generated by the sub-basis consisting of subsets of the form U (O 1 , O 2 ) where O 1 and O 2 are open intervals with compact closures.Let Max(D) denote the set of maximal points of a bounded complete domain D and let ( ) ⋆ : (R → R) → (IR → IR) be the maximal extension (envelope) operator given in Proposition 2.1, where (R → R) is equipped with the compact-open topology.Proposition 2.4 The map ( ) ⋆ is a topological embedding, i.e. it is injective, continuous and is an open map onto its image in Max(IR → IR) with respect to the relative Scott topology on Max(IR → IR).
)) → In(λx.x + (x 2 + εk(x)) * (x 2 + εk(x)))(y + ε0)) → In(y + (y 4 + ε2y 2 k(y))) → 2y 2 k(y) The rules for the constants +, −, * , /, min , max , pr , int , and sup acting on real values, are the obvious restriction of the corresponding rules for dual numbers.The reduction rules for the PCF constructors and constants are the standard ones, together with the rules:if e, n then e 1 , n else e 2 , n c, n → c with c any PCF constant, different from Y σ , and with σ a continuous type.
1 , x 2 , x 3 ) holds whenever St(x 3 ) ⊑ St(x 1 ) ⊓ St(x 2 ) and In(x 3 ) ↑ St(x 2 − x 1 ) r or, equivalently, R r δ (I 1 + εI ′ 1 , I 2 + εI ′ 2 , I 3 + εI ′ 3 ) holds whenever I 3 ⊑ I 1 ⊓ I 2 and rI ′ 3 ↑ I 2 − I 1 .For the other ground domains, D o and D ν , R r o and R r ν are defined as follows: R r ν (n 1 , n 2 , n 3 ) holds whenever n 3 ⊑ n 1 ⊓ n 2 , and n 1 , n 2 are consistent.The rationale behind this definition consists in repeating the definition of R r δ by considering Boolean values and natural numbers as having a hidden infinitesimal part equal to zero.The relations are extended inductively to higher order domains in the usual way for logical relations:R r σ→τ (f 1 , f 2 , f 3 ) iff for every d 1 , d 2 , d 3 ∈ D σ , the relation R r σ (d 1 , d 2 , d 3 ) implies R r τ (f 1 (d 1 ), f 2 (d 2 ), f 3 (d 3 )).Definition 7.1 An element f in the domain D σ is logically consistent if it is self-related by R r σ , i.e.R rσ (f, f, f ), for any positive rational number r.We call a constant c in the language logically consistent if its semantic interpretation B c is logically consistent.Any first-order function f : D τ is locally consistent if and only if it is logically consistent.