Neural Ordinary Differential Equations

Introduction

As I am truly fascinated by the application of mathematics in general, I was intrigued when exposed to the field of deep learning: the amalgamation of two of my favourite fields, math and computer science. As I continued to explore, I saw the pattern of rapid evolution of computer hardware, faster algorithms and better and more powerful architectures, yet, mathematical concepts were left behind. We accelerated through GPUs and came up with more and more model architectures but the equations and algorithms were always created with computer science/applications in mind, and it was for the good as well.

Imagine my fascination upon discovering a ~~new~~ novel approach to model architecture, and that finally, the ancient treasure trove of mathematics was opened again! It was a breadth of fresh air, and some heavy mathematics to tackle! And hence I am writing this essay to convey to the best of my ability what I have learned, and to do it in the most intuitive way possible. (To anyone wondering, no, I do not think the latest improvements before this concept did not utilise mathematical concepts or were math-deprived, but it was my perspective as far as I had studied ML). A warning is due in order, as this essay may get long and a familiarity with all of the previous concepts (in this blog) and rudimentory integration and differential equations is necessary.

A differential approach

We begin with looking where even differential equations can be utilised in this field. Differential Equations are basically one of the most useful concepts in mathematics, physics, chemistry, biology, economics etc… basically every field that deals with change. In all deep neural models, the process is very simple: initialise a hidden state, perform a forward pass, and the result will be the next hidden state. Let us envelope the forward pass in a neat function f, which takes in the previous hidden state and the parameters (that it’ll tweak) and outputs the next hidden state. We do this because different model architectures perform forward pass in different ways (which means this method can applied to any model, regardless of the architecture!).

h_{t + 1} = f (h_{t}, θ_{t})

We are missing an important (and quite a revolutionary) step here: the addition of the input back to the output. This was formerly introduced in the case of ResNets, to tackle previous problems and to better the model quality: and it worked like magic. So we add an input at the end of the forward pass to better improve the model. Why stop there, why not add the input after every block, every layer ? So here we are:

h_{t + 1} = h_{t} + f (h_{t}, θ_{t})

Now comes the neat part: remember how I said differential equations study the effects of change ? (Well actually calculus itself does, anyways) If you notice, the only thing changing here is our hidden state: h_t. Why not simply move it to the left side ?

h_{t + 1} - h_{t} = f (h_{t}, θ_{t})

Here we essentially are parametrising the change in our hidden state. And a little calculus-1 helps: why not take it to the limit ? In doing so, we get the most out of the model, by simply moving from the world of discrete to the world of continuous. (Since now we are moving in a continuous fashion, we can parametrise h with respect to t or time, and write it as h(t)).

\begin{array}{l} l i m_{t \to \infty} \frac{h_{t + 1} - h_{t}}{t} = f (h (t), t, θ_{t}) \\ \frac{d h (t)}{d t} = f (h (t), t, θ_{t}) \end{array}

Here we have our desired differential equation ! Basically this gives us the final output, which, as we know by now, must be compared with the ground truth in order to calculate our loss function ! Let L be our loss function (again this is left ambiguous, and any loss function can be used), x(t₀) be our input at time t₀, and our output will be till t₁. The final output of a forward pass will be the “sum” of all the changes x(t) goes through (how the input changes after each layer) being added to our starting point, and since we are in the world of the continuous, we integrate.

\int_{t_{0}}^{t_{1}} \frac{d x (t)}{d t} d t = \int_{t_{0}}^{t_{1}} f (x (t), t, θ_{t}) d t

x (t_{1}) = x (t_{0}) + \int_{t_{0}}^{t_{1}} f (x (t), t, θ_{t}) d t

Hence is our final loss:

L (x (t_{1})) = L (x (t_{0}) + \int_{t_{0}}^{t_{1}} f (x (t), t, θ_{t}) d t)

My usual approach is to go through a numerical example with an input, but in this case, the theory is a tad bit different than the approach itself. The theory is quite complex and initially unintuitive (at least compared to the other models), and must be dealt with without an example first.

Reverse Mode Automatic Differentiation: A differential approach

While we may have come up with the formulation of a differential equation and used it to parametrise our model, that was just the forward pass. As is the case in deep learning in general, the backpropagation is the main, and often the complex, part. The process of acquiring the necessary gradients from our loss function can be done in the traditional way, but turns out to be quite computationally expensive and inaccurate. The authors of thepaper came up with a different approach: the adjoint sensitivity method. This method is not new, and was first introduced in1962. The method has many similiarities with backprop, and only differs in a slight manner. There are a few different perspectives of understanding what we are upto here, but I would use the most straight-forward one (other perspectives may introduce certain terms which although useful, may not be relevant here).

The adjoint sensitivity method works by introducing an adjoint state, which acts as an intermediary between the Loss and other parameters we would want the gradients of. Let’s say our hidden state, parametrised by time, is h(t). Our adjoint state is simply the gradient of the loss with respect to this hidden state.

a (t) = \frac{d L}{d h (t)}

Here, it can also be seen as a Langrange multiplier, but why is it so, and how it can be seen as such is not necessary in this context, as it would lead to a whole new field of constrained optimization and wouldn’t help us going forward. I recommend going through thisvideo if anyone is interested. Moving on, our main concern is knowing the dynamics of how the adjoint state changes, in order to solve for the gradient we are looking for. In order to get it, we simply need to remember the chain rule. Recall the chain rule in case of a discrete forward pass, and a continuous one:

\begin{array}{l} D i s c r e t e : \frac{d L}{d h_{t}} = \frac{d L}{d h_{t + e}} . . . \frac{d h_{t + 3}}{d h_{t + 2}} \frac{d h_{t + 1}}{d h_{t}} \\ C o n t i n u o u s : \frac{d L}{d h (t)} = \frac{d L}{d h (t + e)} \frac{d h (t + e)}{h (t)} \end{array}

A key note to remember is the adjoint state can be defined for any function which is parametrised by time, and just in this case we are looking at the hidden state. Lets look at how our function would change after e time, that is, t -> t+e. We simply add the integrated differential to our initial input.

h (t + e) = h (t) + \int_{t}^{t + e} f (h (t), t, θ_{t}) d t = T h_{t}

Let us call it Th_t for ease of notation. We now simply plug in this equation into our previous continouos derivative (adding the partial derivative sign since the equation is multi-variate):

\frac{d L}{d h (t)} = \frac{d L}{d h (t + e)} \frac{d h (t + e)}{h (t)} = a (t + e) \frac{\partial T h_{t}}{\partial h (t)}

a (t) = a (t + e) \frac{\partial T h_{t}}{\partial h (t)}

Our final goal is to find out how this adjoint state changes, basically :

\frac{d a (t)}{d t}

We need this value in order to sum(integrate) over it in order to find the gradients of the loss with respect to the underlying function (we will see that later). Next we attempt to solve for the derivative using limits.

\begin{array}{l} T a k i n g t h e l i m i t : \\ \frac{d a (t)}{d t} = l i m_{e \to 0} \frac{a (t + e) - a (t)}{e} \\ S u b s t i t u t i n g a (t) : \\ = l i m_{e \to 0} \frac{a (t + e) - a (t + e) \frac{\partial T h_{t}}{\partial h (t)}}{e} \\ S u b s t i t u t i n g T h_{t} w i t h i t^{'} s t a y l o r s e r i e s e x p a n s i o n \\ a r o u n d h (t), O (e^{2}) d e n o t e s t e r m s o f \\ h i g h e r p o w e r s, w h i c h a r e s i m p l y m u l t i p l i e d b y h i g h e r \\ p o w e r s o f e : \\ = l i m_{e \to 0} \frac{a (t + e) - a (t + e) \frac{\partial (h (t) + e (f (h (t), t, θ_{t})) + Ο (e^{2}))}{\partial h (t)}}{e} \\ U n p a c k i n g t h e p a r t i a l d e r i v a t i v e s, I d e n o t e s t h e \\ i d e n t i t y m a t r i x : \\ = l i m_{e \to 0} \frac{a (t + e) - a (t + e) (I + e \frac{\partial (f (h (t), t, θ_{t}))}{\partial h (t)} + Ο (e^{2}))}{e} \\ O p e n i n g u p t h e b r a c k e t s : \\ = l i m_{e \to 0} \frac{a (t + e) - a (t + e) - a (t + e) e \frac{\partial (f (h (t), t, θ_{t}))}{\partial h (t)} + Ο (e^{2})}{e} \\ C a n c e l l e t i o n o f a (t + e) : \\ = l i m_{e \to 0} \frac{- e a (t + e) \frac{\partial (f (h (t), t, θ_{t}))}{\partial h (t)} + Ο (e^{2})}{e} \\ S p l i t t i n g t h e f r a c t i o n : \\ = l i m_{e \to 0} \frac{- e a (t + e) \frac{\partial (f (h (t), t, θ_{t}))}{\partial h (t)}}{e} + \frac{Ο (e^{2})}{e} \\ C a n c e l l e t i o n o f t h e d e n o m i n a t o r : \\ = l i m_{e \to 0} - a (t + e) \frac{\partial (f (h (t), t, θ_{t}))}{\partial h (t)} + Ο (e) \\ T a k i n g t h e l i m i t : \\ \frac{d a (t)}{d t} = - a (t) \frac{\partial (f (h (t), t, θ_{t}))}{\partial h (t)} \end{array}

With this straigh-forward proof, we have acquired the necessery equation in order to calculate the gradient of a function with respect to our Loss function. Recall the original definition of a(t), where t can be any point in time. Let suppose a(t₀) is the desired gradient:

a (t_{0}) = \frac{d L}{d h (t_{0})}

We solve this just as we have solved during the forward pass: simply add the integration over the differential back to the input. Except this time, we add it back to the output, which is a(t_N), and integrate not from 0 to N but backwards: from N to 0. This can be seen as integrating backwards in time.

\begin{array}{l} a (t_{0}) = a (t_{N}) + \int_{t_{N}}^{0} \frac{d a (t)}{d t} d t \\ a (t_{0}) = a (t_{N}) - \int_{t_{N}}^{0} a (t) \frac{\partial (f (h (t), t, θ_{t}))}{\partial h (t)} d t \end{array}

It is very important to understand that this method and equation helps us calculate the gradient of a function with respect to the loss function. Under certain criterias, we can put in any function (paramatrised by time), get that function’s adjoint state and calculate the gradients backward in time. There are essentially three variables we have: the input x(t), our parameters θ, and the initial and end times t₀ and t_N, so just t. Hence our three variables would have three differential equations, though since parameters do not change during the forward pass and we are calculating derivatives with respect to time:

\frac{\partial θ (t)}{\partial t} = 0, \frac{\partial t (t)}{\partial t} = 1

And we can define an adjoint state for each variable, with their own integrals. Since the time equation and gradients aren’t required in the model, I won’t be going into the details. Let suppose we require the gradient of θ(t) at the initial time 0, hence following the above equations and remembering that the change till the final value would be zero, we get :

\begin{array}{l} a_{θ} (t_{0}) = a_{θ} (t_{N}) - \int_{t_{N}}^{0} a (t) \frac{\partial (f (h (t), t, θ_{t}))}{\partial θ (t)} d t \\ a_{θ} (t_{0}) = - \int_{t_{N}}^{0} a (t) \frac{\partial (f (h (t), t, θ_{t}))}{\partial θ (t)} d t \end{array}

or in other words:

\frac{d L}{d θ} = - \int_{t_{N}}^{0} a (t) \frac{\partial (f (h (t), t, θ_{t}))}{\partial θ (t)} d t

With this, we have officially back propagated through the entire network! Even if the proofs and derivations were straight forward, (and hopefully intuitive), it would still be tough to see how do we implement this in practice. There are more nuances to how we actually do it, and hence we shall go over a numerical example and precisely follow the process of forward and backward porpagation to get a complete and thorough understanding of the concept.

Solving a Neural-ODE numerically

Though the above given equations might help in understanding how we can model a neural network using differential equations, it might not be enough to build up the intuition to understand the entire pipeline or how an input flows through the model, and subsequently how we back-propagate through the network to arrive at the gradients. In order to build stronger foundation of Neural ODEs, I would go through a numerical example in order to fully grasp this concept. While the back-propgation method was derived in the last section, the practical implementation is handled a little differently (it could be said we build the theory in the last section, and here we apply it). Let us begin by defining our input, lets call it y₀, which is an n-dimensional vector.

y_{0} = {[\begin{array}{c} y_{1} \\ ⋮ \\ y_{n} \end{array}]}_{1 * n}

Now we must define a function f, which parametrises the change in our input. In this case (and also to keep the calculations simple), I will simply utilise a dense layer :

\begin{array}{l} f (y_{0}, t, θ) = σ (y_{0} 𝐖_{n * n} + 𝐁_{1 * n}), \\ w h e r e θ = {𝐖_{n * n}, 𝐁_{1 * n}} a n d σ = R e L U \end{array}

𝐖_{n * n} = [\begin{array}{ccccc} w_{11} & \dots & w_{1 n} \\ w_{21} \\ w_{31} & ⋱ & ⋮ \\ ⋮ \\ w_{n 1} & \dots & w_{n n} \end{array}], 𝐁_{1 * n} = [\begin{array}{c} b_{1} \\ b_{2} \\ b_{3} \\ ⋮ \\ b_{n} \end{array}]

In the above equation, t simply represents the time steps that we want the differential equation to take, and would be represented by T+1-dimensional vector (0-index), with equidistant points starting from 0. Since these Neural ODEs are primarily designed to model sequential data (which evolved through time), the dimensions of the input and the output remain the same, as the structure or state of the input never changes.

t = {[\begin{array}{c} t_{0} \\ t_{1} \\ t_{2} \\ ⋮ \\ t_{T} \end{array}]}_{1 * (T + 1)}

Our next step is to simply get to the output, which we will call y₁. Recall from the previous sections that since we are going through T time steps, and in a continous manner, we shall integrate from 0 -> T, using our function f, and the input y₀.

\begin{array}{l} y_{1} = \int_{t_{0}}^{t_{1}} f (x (t), t, θ_{t}) d t \\ y_{1} = \int_{0}^{T} σ ([\begin{array}{c} 𝐲_{𝟏} \\ ⋮ \\ 𝐲_{𝐧} \end{array}] [\begin{array}{ccccc} w_{11} & \dots & w_{1 n} \\ w_{21} \\ w_{31} & ⋱ & ⋮ \\ ⋮ \\ w_{n 1} & \dots & w_{n n} \end{array}] + [\begin{array}{c} b_{1} \\ b_{2} \\ b_{3} \\ ⋮ \\ b_{n} \end{array}]) d t \end{array}

Here any method can be used to solve the integral, like the Range-Kutta method or Euler discretization. The choice is left to the programmer.

Ounce we get our output, the next logical step would be to define and utilise a Loss function. While the choice of losses is vast, I have used the most basic one: MSE or mean-squared error for simplicity. Let L be our loss function and true be our truth vector or label.

\begin{array}{l} t r u e = {[\begin{array}{c} t r_{0} \\ t r_{1} \\ t r_{2} \\ ⋮ \\ t r_{n} \end{array}]}_{1 * n} \\ L (y_{1}) = {({[\begin{array}{c} t r_{0} \\ t r_{1} \\ t r_{2} \\ ⋮ \\ t r_{n} \end{array}]}_{1 * n} - \int_{0}^{T} σ ([\begin{array}{c} 𝐲_{𝟏} \\ ⋮ \\ 𝐲_{𝐧} \end{array}] [\begin{array}{ccccc} w_{11} & \dots & w_{1 n} \\ w_{21} \\ w_{31} & ⋱ & ⋮ \\ ⋮ \\ w_{n 1} & \dots & w_{n n} \end{array}] + [\begin{array}{c} b_{1} \\ b_{2} \\ b_{3} \\ ⋮ \\ b_{n} \end{array}]) d t)}^{2} \end{array}

Now we need to get the gradient of the Output with respect to the loss, which can be very straightforward to calculate due to the simplistic nature of mean-squared error.

\frac{d L}{d y_{1}} = \frac{d (t r u e - y_{1})^{2}}{d y_{1}} = - 2 (t r u e - y_{1})

The next step is just the same as we discussed in the previous section: we backpropagate through integrating backwards in time. Remember that our Loss function is essentially takes in three parameters, the initial input y₀, the time vector t and the function f, which defines the dynamics our input goes through. (In programming, the function passed in can be a class). This means we can calculate the gradients with respect to loss for each of these variables passed in, and would thus have to define an adjoint state for each one of them. Ounce we get the adjoint state, we find it’s derivative with respect to time and integrate in order to get our final gradients (as was done in the theoritical section). In order to simplify the calculation and to avoid using the ODE solver again and again, we define an augmented state, which is just a vector made up of the variables of which we want a gradient of. We thus also define an augmented adjoint state, which will contain the adjoint dynamics of our variables.

a u g m e n t e d s t a t e : [\begin{array}{c} 𝐲 \\ θ \\ t \end{array}]

How this state vector changes through time is governed by :

\frac{d}{d t} [\begin{array}{c} 𝐲 (t) \\ θ (t) \\ t (t) \end{array}] = f_{a u g} ([\begin{array}{c} 𝐲 (t) \\ θ (t) \\ t (t) \end{array}]) = f_{a u g} ([\begin{array}{c} f (y (t), t, θ_{t}) \\ 𝟎 \\ 1 \end{array}])

The zero and one vector is because the time vector is differentiated with respect to time itself, and the parameters (θ), do not change during a forward pass. Following this result, we define an adjoint vector with the same properties :

𝐚_{a u g} = [\begin{array}{c} 𝐚_{y} \\ 𝐚_{θ} \\ 𝐚_{t} \end{array}] = {[\begin{array}{ccc} \frac{d L}{d y (t)} & \frac{d L}{d θ (t)} & \frac{d L}{d t (t)} \end{array}]}^{T}

Another important equation is the derivative of the f_aug with respect to our augmented vector:

\frac{\partial f_{a u g}}{\partial [\begin{array}{ccc} 𝐲 & 𝛉 & 𝐭 \end{array}]} = [\begin{array}{ccc} \frac{\partial f}{\partial 𝐲} & \frac{\partial f}{\partial 𝛉} & \frac{\partial f}{\partial 𝐭} \\ 𝟎 & 𝟎 & 𝟎 \\ 𝟎 & 𝟎 & 𝟎 \end{array}]

Going further, we simply get the derivative of the augmented adjoint state:

\begin{array}{l} F r o m, \\ \frac{d a (t)}{d t} = - a (t) \frac{\partial (f (y (t), t, θ_{t}))}{\partial y (t)} \\ w e g e t, \\ \frac{d 𝐚_{a u g} (t)}{d t} = - {[\begin{array}{c} 𝐚_{y} \\ 𝐚_{θ} \\ 𝐚_{t} \end{array}]}^{T} [\begin{array}{ccc} \frac{\partial f}{\partial 𝐲} & \frac{\partial f}{\partial 𝛉} & \frac{\partial f}{\partial 𝐭} \\ 𝟎 & 𝟎 & 𝟎 \\ 𝟎 & 𝟎 & 𝟎 \end{array}] \\ = - [\begin{array}{ccc} 𝐚_{y} \frac{\partial f}{\partial 𝐲} & 𝐚_{θ} \frac{\partial f}{\partial 𝛉} & 𝐚_{t} \frac{\partial f}{\partial 𝐭} \end{array}] (t) \end{array}

The reason we derived the gradient of the output, y₁ with respect to loss (dL/dy₁) is because it would be passed in the initial augmented state vector, to calculate the value of (dL/y₀). Although this value may not be used.

Since we do not need the first and final element in the output vector, we can simply ignore it. The first can be used to obtain the gradient with respect to the hidden states, the second can be used to obtain the gradient with respect to our parameters and the final is used to get the gradient with respect to the time vector. Let’s focus on the second term, which describes the dynamics of the adjoint state of our parameters. Our final gradient for the parameters is obtained by integrating backwards in time, where we go from T to 0.

\begin{array}{l} \frac{d L}{d θ} = - \int_{t_{N}}^{0} a (t) \frac{\partial (f (h (t), t, θ_{t}))}{\partial θ (t)} d t \\ = - \int_{T}^{0} 𝐚_{θ} \frac{\partial f}{\partial 𝛉} d t \\ = - \int_{T}^{0} \frac{d L}{d θ (t)} \frac{\partial f}{\partial θ} d t \\ = - \int_{T}^{0} \frac{d (t r u e - σ (y_{0} 𝐖_{n * n} + 𝐁_{1 * n}))^{2}}{d [\begin{array}{cc} 𝐖 & 𝐁 \end{array}]} \frac{\partial σ (y_{0} 𝐖_{n * n} + 𝐁_{1 * n})}{\partial [\begin{array}{cc} 𝐖 & 𝐁 \end{array}]} d t \end{array}

With this, we have finally found the gradient with respect to the parameters! Since there were a lot of equations, our primary focus of understanding the flow of inputs and gradients can be a little foggy. Here is the entire process, step by step to get a clearer understanding.

Initiliase the time vector t, our input(s) y₀ and the function f. (Here f can be anything, from RNNs, LSTMs to Transformers. It can even be entire model architectures).
Apply the forward propagation method on the input y₀ to get to the output y₁. Forward prop here is basically applying the function T times, and integrating the outputs. Here we make our first call to an ODE solver.
Calculate the derivative of the loss with respect to the output (dL/dy₁). This value will be passed in later in the adjoint state vector as the initial value to an integral.
Define our augmented and adjoint augmented state vectors. (In code, this state was not defined explicitly, but was a part of the backward() call on the output, y₁)
Calculate the vector-jacobian products through the given adjoint equation. This equation gives us three adjoint dynamics, one with respect to the output/input, one with respect to our parameters used and one with respect to the time vector.
Call our ODE solver for the second time in order to calculate the desired derivatves, with respect to the input function (y(t)), our parameters (θ(t)) and the time vector (t(t)). (dL/y(t), dL/θ(t), dL/t(t)).
Update the weights and continue to iterate until desired loss function is achieved.

Conclusion

Neural ODEs are sometimes also referred to as liquid neural networks. This method offers a different approach to solving problems for which neural networks were useful for. This method was quite popular when it came out in 2018 (It was awarded the best paper in NeurIPS) and yet failed to see adoption in mainstream models. There can be various reasons why this might’ve happened, one being this models (like SSMs) are more suited for continuous datasets, and can model data better when the underlying generator of data is continuous in nature. This architecture can also be viewed as a subset of a larger Physics-Informed neural networks, and I am hoping to explore this topic more! Thank you.

Home