10
$\begingroup$

In multivariable calculus, I was taught to compute the chain rule by drawing a "tree diagram" (a directed acyclic graph) representing the dependence of one variable on the others. I now want to understand the theory behind it.

Examples: Let $y$ and $x$ both be functions of $t$. Let $z$ be a function of both $x$ and $y$.

The derivative of z with respect to t is: $$\frac{dz}{dt} = \frac{\partial z}{\partial x} \frac{dx}{dt} + \frac{\partial z}{\partial y} \frac{dy}{dt}$$

To compute this derivative, I was taught to draw a graph with the following edges: $x \to z$, $y \to z$, $t \to x$, and $t \to y$. Source: http://www.math.hmc.edu/calculus/tutorials/multichainrule/

alt text

These tree diagrams can be constructed for arbitrarily complex functions with many variables.

In general, to find a derivative of a dependent variable with respect to an independent variable, you need to take the sum of all of the different paths to reach the dependent variable from the independent variable. Traveling down a path, you multiply the functions (e.g. $\frac{\partial z}{\partial x} \cdot \frac{dx}{dt}$).

Why does this work?

  • 2
    Everything should pretty much follow from the chain rule and induction. Have you tried proving it yourself?2010-12-16
  • 0
    As I know it, the chain rule only works for single variable functions. Part of my confusion about this is why differentiating in multiple variables corresponds to addition. Any insight?2010-12-17
  • 0
    The issue is with the way you have written it. If you use the partial differentiation notation, it will be clear.2010-12-17
  • 0
    @dsg: so you have never formally learned the multivariate chain rule?!2010-12-17
  • 4
    Is the [graph-theory] tag really relevant?2010-12-17
  • 2
    @Qiaochu, not everyone learns it formally, most people are more then happy with learning how to use it to get done with a course.2010-12-17
  • 0
    @Moron: Yes, the [graph-theory] tag is relevant. There is a deep reason why summing over all paths from source (independent variable) to sink (dependent variable) is equivalent to taking the derivative of the complex function. I just don't know the deep reason yet!2010-12-17
  • 2
    @dsg: When you find it, feel free to add it back. Meanwhile I am going to remove it.2010-12-17
  • 0
    @Qiaochu: Did you mean to remove the [group-theory] tag and add back the [graph-theory] tag?2010-12-17
  • 0
    @Moron: yes. There is a connection to graph theory here, while I don't see a connection to group theory.2010-12-17
  • 0
    @Qiaochu: I don't think we should add the [graph-theory] tag just because we can call the diagram a graph... I suppose a lot of category theory questions ought to be tagged too then!2010-12-17
  • 0
    @Moron: this is a well-known connection between matrix theory and incidence structures, including graphs. It has many nontrivial applications and is part of the question the OP is asking, so I think the question deserves to be tagged as such.2010-12-17
  • 0
    @Qiaochu: Fine. I won't mess with the tag :-)2010-12-17

3 Answers 3

8

The point of derivatives in one variable is to provide linear approximations $f(x) = f(p) + f'(p) (x - p) + o(|x - p|)$ to nice functions. Multivariate derivatives work the same way, except "linear approximation" here means approximation by a general linear transformation (a matrix) instead of a scalar.

This is made precise by the following definition: we say that a function $f : \mathbb{R}^n \to \mathbb{R}^m$ has total derivative a linear transformation $df_p : \mathbb{R}^n \to \mathbb{R}^m$ at a point $p$ if there exists $\epsilon > 0$ and a function $E_p(h)$ defined for $|h| < \epsilon$ such that

$$f(p + h) = f(p) + df_p(h) + |h| E_p(h)$$

where $\lim_{h \to 0} E_p(h) = 0$. The matrix $df_p$ is sometimes called the Jacobian. In little-o notation, we write this

$$f(p + h) = f(p) + df_p(h) + o(|h|).$$

This might seem unnecessarily complicated, but it is the key to understanding the multivariate chain rule. Suppose that in addition to $f$ we have another function $g : \mathbb{R}^m \to \mathbb{R}^k$ with a total derivative $dg_q$ at some point $q$, and suppose that $f(p) = q$. Then

$$gf(p + h) = g \left( f(p) + df_p(h) + o(|h|) \right) = gf(p) + dg_q df_p(h) + o(|h|)$$

or, in other words,

The total derivative $d(gf)_p$ of $gf$ at $p$ is the (matrix) product of the total derivatives $dg_q$ and $df_p$.

This is the most general statement of the multivariate chain rule. The relationship to tree diagrams is that one can model matrix multiplication using composition of incidence matrices, which come from graphs depicting incidence relationships between sets.

In your particular example, you have a function $t \mapsto (x, y) : \mathbb{R}^1 \to \mathbb{R}^2$ and another function $(x, y) \mapsto z : \mathbb{R}^2 \to \mathbb{R}^1$. The total derivative of the first function is $\left[ \begin{array}{c} \frac{dx}{dt} \\\ \frac{dy}{dt} \end{array} \right]$ and the total derivative of the second function is $\left[ \frac{dz}{dx}, \frac{dz}{dy} \right]$, so the total derivative of their composition is the product

$$\frac{dz}{dt} = \left[ \frac{dz}{dx}, \frac{dz}{dy} \right] \left[ \begin{array}{c} \frac{dx}{dt} \\\ \frac{dy}{dt} \end{array} \right]$$

and this is precisely the formula you give. The connection to diagrams is that one can represent a composition of linear transformations $\mathbb{R}^1 \to \mathbb{R}^2$ and $\mathbb{R}^2 \to \mathbb{R}^1$ using a pair of incidence matrices, one to represent incidences between a $1$-element set and a $2$-element set, and the other to represent incidences between that $2$-element set and another $1$-element set.

  • 0
    This is fantastic! Can you elaborate or provide a reference discussing how "one can model matrix multiplication using composition of incidence matrices"? Thank you.2010-12-17
  • 0
    @dsg: this is just a combinatorial interpretation of matrix multiplication and is straightforward to prove, but it is discussed, for example, in Brualdi and Cvetkovic: http://books.google.com/books?id=pwx6t8QfZU8C&printsec=frontcover&dq=combinatorial+approach+to+matrix+theory2010-12-17
  • 0
    thank you for a great answer!2010-12-17
  • 0
    The Brualdi and Cvetkovic text can be accessed here: http://www.crcnetbase.com/isbn/97814200822412011-04-04
2

Think about the differentiation as the derivatives along different axes.

So what you have is in essense $\frac{dz}{dt} = \frac{\partial z}{\partial x}|_y \frac{dx}{dt}+ \frac{\partial z}{\partial y}|_x \frac{dy}{dt}$

The sum exists when you are not travelling in either of those axes, then you are travelling along a path that is 'shared' by the two axes, and their sum tells you the gradient of that path.

  • 0
    @dsg, does this answer help somewhat? I am not quite sure what you expect though.2010-12-17
1

This video will certainly clarify things: http://www.youtube.com/watch?v=2bF6H_xu0ao.

Although it may take a bit longer, I personally find that computing the total differential is substantially easier and more intuitive than a tree diagram.