Exact numeric nth derivatives

jordigh · on Oct 24, 2013

Am I missing something or is this begging the question?

For any function that is not a combination of polynomials, you need to have its Taylor expansion up to the desired order of derivatives, so you can't just take an "arbitrary" function and use this method to compute its derivative in exact arithmetic.

So for anything other than polynomials, you just reword the problem of finding exact derivatives to finding exact Taylor series, and in order to find Taylor series in most cases, you have to differentiate or express your function in terms of the Taylor series of known functions.

Edit: Indeed, take the only non-polynomial example here, a rational function (division by a polynomial). In order to make this work, you have to know the geometric series expansion of 1/(1-x). For each function that you want to differentiate this way, you have to keep adding more such pre-computed Taylor expansions.

crntaylor · on Oct 24, 2013

The typical approach is to provide overloaded versions of primitive functions (generally addition, multiplication, subtraction, division, powers, trig and hyperbolic trig functions, exponential and logarithm) for which you explicitly tell the program what the first N terms in the Taylor series are.

The magic is that you also tell it how to compute Taylor series of function compositions, if the Taylor series of the functions being composed are already known - then any arbitrary function composed out the primitive functions can have its Taylor series computed automatically!

For your example, the function 1/(1-x) is the composition of

    x -> -x
    x -> 1+x
    x -> 1/x

and so its Taylor series is already known as long as you have already defined negation, addition and reciprocation.

jordigh · on Oct 24, 2013

So instead of implementing a full CAS that knows linearity of differentiation along with the product and chain rules, you implement something that for a little less computation can give you exact derivatives at any given point. Am I understanding this correctly?

crntaylor · on Oct 24, 2013

Yes, that's more or less it. The system you implement still knows about linearity of differentiation, the product rule, chain rule etc but it's not a full blown CAS. It can't give you the symbolic derivative of a function (although I guess you can technically apply AD to any numeric type, including a symbolic type..) but it can compute the numerical value of the derivatives at arbitrary points.

jliszka · on Oct 24, 2013

You're right, it is begging the question, for any simple function, you basically have to tell it what all the derivatives are for it to work.

The power of this technique is that it handles the product rule and chain rule for you, so composition and multiplication of functions work automatically without extra work.

For the case of 1/(1-x), you only need to tell it the derivatives (wrt x) of

f(x, a) = x * a

g(x, a) = x + a

h(x) = 1/x

Then it automatically knows how to compute the derivatives of h(g(f(x, -1), 1)).

kvb · on Oct 24, 2013

No, really you don't need Taylor series at all; see for example Conal Elliott's "Beautiful Differentiation"[1], which shows how to handle arbitrary derivatives of multidimensional functions and never uses Taylor series for anything; all you need to know is that d/dx sin(x) = cos(x), etc. Just as a warning, Conal's paper is quite dense and is probably not a good introduction.

[1] http://conal.net/papers/beautiful-differentiation/

jordigh · on Oct 24, 2013

But knowing that sin' = cos is again begging the question, just rephrasing it back.

I think I get it, though. You can get away with simplifying intermediate steps of a differentiation computation if you only care about the value of the function at a single point.

timmclean · on Oct 25, 2013

I am not a mathematician, but I don't think this is actually begging the question.

> you can't just take an arbitrary function and use this method

Actually, you can, AFAIK. The relation f(x + E) = f(x) + f'(x)E still holds.

If we try this method with a rational function:

    f(x) = (x - 1)/(x - 2)

    f(x + E)
     = (x + E - 1) / (x + E - 2)
     = (x + E - 1)(x - E - 2) / ((x + E - 2)(x - E - 2))
     = (x^2 - x E - 2 x + x E - E^2 - 2 E - x + E + 2) / (x^2 - x E - 2 x + x E - E^2 - 2 E - 2 x + 2 E + 4)
     = (x^2 - 3 x + 2 - E) / (x^2 - 4 x + 4)
     = (x^2 - 3 x + 2) / (x^2 - 4 x + 4) - (1 / (x^2 - 4 x + 4)) E
     = (x - 1) / (x - 2) - (1 / (x - 2)^2) E

so

    f'(x) = -1 / (x - 2)^2

since

    f(x + E) = f(x) + f'(x) E

Edit to add: the key idea here being that no knowledge of differentiation is needed, just tricks for manipulating expressions involving x and E until they are in normal form.

crntaylor · on Oct 24, 2013

Very neat. Presumably there is a more efficient method for implementing Nth order automatic differentiation than encoding the dual numbers as NxN matrices, though? To multiply the matrices takes O(N^3) time, whereas by exploiting their known structure I think you should be able to do it in O(N^2) time. Am I wrong?

jliszka · on Oct 24, 2013

You're right, there's only O(n) information in these particular n x n matrices, so you can multiply them in O(n^2).

The code provided memoizes cell values using the key (r - c), so only O(n^2) work is required to read off the top row of the matrix. I'm planning to make that more explicit in a follow-up post.

on Oct 24, 2013

[deleted]

crntaylor · on Oct 24, 2013

True, but I don't think Strassen and other efficient algorithms are much used in practice. If you go poke around in the source code for BLAS or LAPACK you'll see that the matrix multiplication algorithm used is an O(N^3) algorithm.

It's not the naive O(N^3) algorithm though - it's a block multiplication algorithm which generally gives faster results than the naive algorithm, because it's able to exploit locality so that you don't need to shift data into and out of the CPU cache all the time.

30thElement · on Oct 24, 2013

If I remember correctly, the most efficient algorithms have a ridiculously large constant factor, to the point where you don't have enough RAM to even store the matrix, let alone do anything with it.

rcthompson · on Oct 24, 2013

That's not what GP is talking about. Those algorithms are the sub-cubic ones, the ones that are O(x^2.something). GP is talking about an algorithm that is still O(n^3), but is not just a straight translation of the matrix multiplication formula into code, but rather does things in a way that is more friendly to the CPU cache, resulting in significant speed gains.

dhammack · on Oct 24, 2013

There's an interesting python library [1] which implements AD as well as has some neat features like automatic compilation to optimized C. It's developed by the AI lab at the University of Montreal, and is pretty popular in deep learning circles. I've found it to be a huge time saver to not worry whether you screwed up your gradient calculations when doing exploratory research!

[1] http://deeplearning.net/software/theano/

lightcatcher · on Oct 25, 2013

My personal favorite feature of Theano is the automatic compilation to CUDA code (which would get about 8-15x speed-up over the optimized C code for the deep learning research I was doing).

tel · on Oct 24, 2013

See also the ad package [1] for Haskell which has a number of interesting features of this vein.

[1] http://hackage.haskell.org/package/ad

fdej · on Oct 24, 2013

This seems to be essentially the same thing as "power series arithmetic" (first-order "dual arithmetic" is equivalent to arithmetic in the ring of formal power series modulo x^2, but you can make that x^n).

Encoding power series as matrices is sometimes convenient for theoretical analysis (or, as here, educational purposes), but it's not very efficient. The space and time complexities with matrices are O(n^2) and O(n^3), versus O(n) and O(n^2) (or even O(n log n) using FFT) using the straightforward polynomial representation (in which working with hundreds of thousands of derivatives is feasible). In fact some of my current research focuses on doing this efficiently with huge-precision numbers, and with transcendental functions involved.

BoppreH · on Oct 24, 2013

I couldn't believe it would work, so I made a toy implementation in Python using simple operator overloading: https://github.com/boppreh/derivative

All values tried so far agree with Wolfram Alpha, so color me surprised and happy for learning something new.

backprojection · on Oct 24, 2013

I think it's worth noting that the problem with numerical differentiation, fundamentally, is that differentiation is an unbounded operator. In finite-differences, (the more obvious approach), you assume that your data are samples of some, general, function.

The problem then, is that that general functions have no (essential) bandlimit [1]. Remember that differentiation acts as a multiplication by a monomial, in the frequency domain [2]. Non-constant polynomials always eventually blow up away from 0, so in differentiating, you're multiplying a function by something that blows up, in the frequency domain. This means that, in the result, higher frequencies are going to dominate over lower frequencies, at a polynomial rate.

Let me be clear, the problem with numerical differentiation is not just that rounding errors accumulate, it's that differentiation is fundamentally unstable, and not something you want to apply to real-world data.

It depends very much on what your application is, however, I think generally a better approach to AD is to redefine your differentiation, by composing it with a low-pass filter. If designed properly, your low-pass filter will 'go to zero' faster (in the frequency-domain) than any polynomial, thus making this new operator bounded, and hence numerically more stable. It's not a panacea, but it begins to address the fundamental problem.

One example of such a filter is Gamma(n+1, n x^2)/Factorial[n], where Gamma is the incomplete gamma function [3].

In Python:

scipy.special.gammaincc(n+1,nx2) or mpmath.gammainc(n+1,nx2, regularized=True)

To see why this is a nice choice, notice item 2 in [4]. This filter is simply the product of exp(- x^2) (the Gaussian) multiplied by the first n-terms of the Taylor series of exp(+ x^2), (1/ the-Gaussian). Since this series converges unconditionally everywhere, as n-> +infinity, this filter converges to 1 for a fixed x (as you increase n), however, since it's still a gaussian times a polynomial, it always converges to 0 as you increase x, but fix n.

This is my area of research, so if anyone's interested I can give more details.

[1] https://en.wikipedia.org/wiki/Band-limit [2] https://en.wikipedia.org/wiki/Fourier_transform#Analysis_of_... [3] https://en.wikipedia.org/wiki/Incomplete_gamma_function [4] https://en.wikipedia.org/wiki/Incomplete_gamma_function#Spec...

jules · on Oct 24, 2013

Read the article, it's not about computing derivatives of real world data (using finite differences or whatever), it's about exact derivatives of rational functions specified by a computer program. While what you wrote is interesting, none of it applies to this article.

acjohnson55 · on Oct 24, 2013

^ Totally this. It's not meant for differentiating a function known only by its points. It's meant for evaluating nth-order derivatives of functions you can already evaluate at arbitrary values. The functions don't even have to be rational. As long as the function can be calculated by a computer by some method, so can its derivatives.

This comes in handy in all sorts of things where you might have designed this fancy kernel function to use in some process where you need to be able to calculate the value of the function and also of some derivatives--backpropagation for example [1].

As I was told by someone in the field, at one point, people used to generate machine learning publications simply by finding functions that required fancy mathematical tricks to find closed-form derivatives of chosen functions so that they would be usable in learning algorithms. But in many cases, this work is unnecessary if you use automatic differentiation.

It's a really cool concept, applicable in specific situations. If you need to know the derivative of a function that's not fully specified, you need numerical differentiation [2]. If you need a closed-form expression for your derivative function, that's when you need symbolic differentiation [3].

[1] http://en.wikipedia.org/wiki/Backpropagation [2] http://en.wikipedia.org/wiki/Numerical_differentiation [3] http://en.wikipedia.org/wiki/Symbolic_differentiation

backprojection · on Oct 24, 2013

> It's not meant for differentiating a function known only by its points. It's meant for evaluating nth-order derivatives of functions you can already evaluate at arbitrary values

This is correct. So my comment "I think generally a better approach to AD is to redefine your differentiation", doesn't really apply to AD. Essentially, AD is about differentiating functions, not data.

jordigh · on Oct 24, 2013

I find it funny how you think of differentiation in terms of "frequency domain", "bandlimit" and "filter". Very signal-processy, very engineery. A mathematician would speak about unbounded operators in a Banach space. :-)

Edit: Reminds me of the list on page 3 here: http://arxiv.org/pdf/math/9404236v1

backprojection · on Oct 24, 2013

> I find it funny how you think of differentiation in terms of "frequency domain", "bandlimit" and "filter". Very signal-processy, very engineery. A mathematician would speak about unbounded operators in a Banach space. :-)

This more or less depends on what kind of mathematician you're talking to. A mathematician working on Frames and signals almost certainly uses that language. But someone working in PDEs probably doesn't, even if the theory intersects a lot. It's probably a sign that there's nothing wrong with mathematicians and engineers sharing concepts :).

crntaylor · on Oct 24, 2013

This is interesting. I am generally very skeptical about taking derivatives of functions computed from real-world data for exactly this reason. What I normally end up doing is applying some form of kernel smoothing (e.g. nearest K points with a Gaussian kernel) to approximate the function, and then compute derivatives of that instead.

Would you recommend another way?

backprojection · on Oct 24, 2013

This is pretty much exactly what I'm saying. So you're convolving with a gaussian kernel. That's the exact same thing as multiplying, in the frequency domain, by another gaussian. (convolution theorem, and the FT of a gaussian is a gaussian). The Gamma function filter goes one step further, it throws in a polynomial to make that gaussian 'flatter', which is more suitable for a low-pass filter.

Try the script at the end of this comment. Here's the output http://imgur.com/5948cW6

Notice that the 0-th one is just a gaussian. If you're interested, these things have traditionally been called HDAFs, I have a library to do this kind of thing https://bitbucket.org/nmaxwell/hdaf it may need more documentation. I don't mind if anyone provides constructive feedback on it.

The functionality you'd be interested in is hdaf.hdaf_periodic_samples, - you can convolve with that to low-pass filter, and use hertz_to_sigma to get the sigma parameter (from cutoff frequency).

  import numpy, scipy.special
  import matplotlib.pyplot as plt

  x=numpy.linspace(-2,5,1000)
  for n in [0,1,4,10]:
      s=1 if n==0 else n
      h=scipy.special.gammaincc(n+1,s*x**2)
      plt.plot(x, h)
  plt.savefig("filters")

EDIT: formatting.

irremediable · on Oct 24, 2013

Thanks for the insight! It's interesting to read a mathematician's more formal take on this -- I'm from an engineering background, and I've often seen the problem in control theory, where we approximate signals using observers and then differentiate the observed signal.

Edit: I should clarify that it's cool to read a mathematician's insight into the problem as seen by engineers, written in a way that engineers can understand.

short_circut · on Oct 25, 2013

Low pass filters are absolute miracles when working with numerical derivatives. I have had instances where that was the only reason I ever got anything meaningful out of them. It really comes down to the fact that the numerical representation of a function is in general not truly analytic. It is has kinks and bends all over the place.

n00b101 · on Oct 24, 2013

Do you have any references for the approach that you recommended? You're saying that differentiation (not even just numerical differentiation) is "not something you want to apply to real-world data" ... This is surprising since differentiation obviously is used in the real-world in some very critical applications.

mrcactu5 · on Oct 24, 2013

I believe you, but what applications do you have in mind?

backprojection · on Oct 24, 2013

[EDIT] As others have said below, this comment, while maybe useful, doesn't really apply to the article (or AD), since AD is about differentiating functions, not data.

on Oct 24, 2013

[deleted]

waqf · on Oct 24, 2013

Well, it's exact until you evaluate it at a float — because it's the floats that are inexact, not the technique.

And that limitation is just as true of symbolic algebra. Symbolic algebra is "exact" iff you can do exact evaluation, or if your application doesn't require you to evaluate at all. Automatic differentiation is the same, it's just that the unevaluated forms in symbolic algebra are "nicer".

on Oct 24, 2013

[deleted]

porges · on Oct 24, 2013

>Dual numbers cannot, because they use plain old floating point math in the derivative equations.

There's nothing stopping you from switching out the floating point for rationals or something like that.

For example, the Haskell `ad` package has its functions parameterized over all instances of Num: http://hackage.haskell.org/package/ad-3.4/docs/Numeric-AD.ht...

ot · on Oct 24, 2013

> they are almost never used in any real automatic differentiation system

They're efficient enough for first-order derivatives. For example they are used in Ceres, Google's library for non-linear least-squares optimization

https://ceres-solver.googlesource.com/ceres-solver/+/master/...

pmr_ · on Oct 24, 2013

The exact part struck me the most. I was expecting some technique that tries to figure out the maximal error during the calculations and provides enough space for an exact floating point presentation in advance to avoid the usual problem of continually increasing memory demands that you usually get with exact numbers. Such techniques can be used while calculating delaunay triangulations and maybe they are applicable here as well.

fusiongyro · on Oct 25, 2013

Very neat article. This is essentially calculus with infinitesimals (also called "nonstandard analysis") implemented on the machine. If you like the approach, a more general and rigorous investigation can be had by reading H. Jerome Keisler's book Elementary Calculus, which is freely available online in the 2nd edition here:

  http://www.math.wisc.edu/~keisler/calc.html

The third edition is now in print. I've been studying calculus with it off-and-on for a while and I find the approach very intuitive, though Spivak's Calculus is probably a better book, the "standard analysis" is a little less intuitive (and now, evidently, harder to teach a machine).

mrcactu5 · on Oct 24, 2013

congratulations, you have implmented the Zariski tangent space using nilpotent matrices. welcome to the beautiful theory of algebraic geometry and schemes.

http://math.stanford.edu/~vakil/0708-216/216class21.pdf

This really does fall in the ream of algebraic geometry since this method only works for rational functions - as he implemented it.

To numerically compute sin(x + ε) you need the Taylor series.

gpsarakis · on Oct 24, 2013

Nice analysis. Hope you don't mind me adding that by omitting terms of the Taylor series you do have some loss of precision, however small. Also, solving linear equation systems may even introduce instability as the following must be preserved: http://en.wikipedia.org/wiki/Diagonally_dominant_matrix

crntaylor · on Oct 24, 2013

The point is that when you're considering the Taylor series for a dual number argument, you don't lose any precision, because higher powers of the "imaginary" part of the dual number vanish. The example he gives is

    f(a + be) = f(a) + f'(a)be + 0.5 * f''(a) b^2 e^2 + O(e^3)
              = f(a) + f'(a)be

because e^n=0 for all n>1. This isn't an approximation - it's an exact relationship for dual numbers!

You will lose some precision by using floating point numbers instead of an arbitrary-precision real number type, but this is a limitation of the machine you're working on. The method is exact.

jordigh · on Oct 24, 2013

No, the coefficients of the Taylor series are the exact derivatives, assuming the actual arithmetic were exact (it's not, because IEEE 754). There's no loss of precision there.

Pitarou · on Oct 25, 2013

How does this technique compare to a computer implementation of the kinds of techniques we learnt in High School? Is it easier to implement? More efficient? Are there some situations where it isn't appropriate?

Bill_Dimm · on Oct 24, 2013

There seems to be a typo at the beginning of the "Implementing dual numbers" section. It says:

  The number a+bd can be encoded as...

Should be:

  The number a+b*epsilon can be encoded as...

jliszka · on Oct 24, 2013

Thanks! fixing...

tomrod · on Oct 25, 2013

This read nicely until I got to the code block. Does anyone else see this as yellow and gray (with syntax highlighting) coloration--such that it's virtually impossible to read?

svantana · on Oct 24, 2013

Is it just me or this article pretty naive? The headline's use of the word "exact" would imply integer arithmetic only, but the computations are done with floating point. So basically (s)he is trading one rounding error for another, which seems to be small-ish in some particular cases. What about discontinuities? And why forward derivatives only? I hope noone will use this for any application that actually relies on exact derivatives.

Leszek · on Oct 24, 2013

They're exact in the sense that they give the same value as if you had calculated the value of the analytic derivative. This is different from numerical differentiation, which approximates the derivative with finite differences.

Bahamut · on Oct 25, 2013

As a former mathematician, the second sentence is an abomination:

"...but to give you an overview, the idea is that you introduce an algebraic symbol ϵ such that ϵ≠0 but ϵ^2=0"