Vector Calculus Series Part 3 | The Chain Rule

The Rule That Chains Everything Together How to differentiate a function of a function, and why it powers all of deep learning.

May 20, 2026

In Part 2 we introduced the partial derivative and the gradient. We learned to differentiate a function of several variables by varying one input at a time. But what happens when the inputs of a function are themselves functions of other variables?

Vector Calculus Series Part 2 | The Gradient

Halik Mamam

May 19

Read full story

This situation is everywhere in practice. And the tool that handles it is the chain rule: the rule that lets you differentiate through layers of composition. It is the most important differentiation rule in this series, and the mathematical backbone of backpropagation in neural networks.

Meet Nadia

Nadia is a meteorologist working for a weather service in Switzerland. Her job is to predict the temperature at any location in the Alps. She knows two things from her models.

First, the altitude at any position (x, y) on the map (where x is the longitude and y is the latitude) is given by:

h(x, y) = 3000 - x² - 2y²

The altitude is 3000 meters at the origin (the summit) and decreases as you move away. Units are in meters.

Second, the temperature at any altitude h is given by:

T(h) = 20 - 0.006 · h

Temperature drops by 0.006°C for every meter of altitude gained. This is called the environmental lapse rate, a standard approximation in meteorology.

Nadia wants to know: if someone moves slightly east (increasing x), how does the temperature at their location change? This requires differentiating T with respect to x, but T does not depend directly on x. It depends on h, which depends on x. We need to chain the two relationships together.

The Chain Rule For One Variable

Before tackling Nadia’s situation, let us recall the chain rule from Part 1 in its simplest form.

Vector Calculus Series Part 1 | The Derivative

Halik Mamam

May 18

Read full story

If h is a function of x, and T is a function of h, then the composed function T(h(x)) has derivative:

dT/dx = (dT/dh) · (dh/dx)

Read it as: the rate at which T changes with x equals the rate at which T changes with h, times the rate at which h changes with x.

The name chain rule comes from exactly this: you chain the rates of change together. If h changes twice as fast as x, and T changes three times as fast as h, then T changes six times as fast as x. The rates multiply.

Let us apply this to Nadia’s problem along the x direction (holding y fixed):

dT/dh = -0.006

∂h/∂x = -2x

∂T/∂x = (dT/dh) · (∂h/∂x) = (-0.006) · (-2x) = 0.012x

At position x = 100 km east of the summit:

∂T/∂x = 0.012 × 100 = 1.2 °C per km

Moving east at that position, the temperature increases by 1.2°C per kilometer. This makes sense: moving away from the summit decreases altitude, and lower altitude means warmer temperature.

The Chain Rule In The Multivariate Setting

Nadia’s function depends on two variables x and y, not just x. The chain rule extends naturally to this case.

The composed function T(h(x, y)) has two partial derivatives, one for each input. Each is computed by the same chaining idea:

∂T/∂x = (dT/dh) · (∂h/∂x)

∂T/∂y = (dT/dh) · (∂h/∂y)

For Nadia’s functions:

dT/dh = -0.006

∂h/∂x = -2x

∂h/∂y = -4y

∂T/∂x = (-0.006) · (-2x) = 0.012x

∂T/∂y = (-0.006) · (-4y) = 0.024y

The gradient of T with respect to position (x, y) is:

∇T(x, y) = (0.012x, 0.024y)

At the position (x, y) = (100, 50):

∇T(100, 50) = (0.012 × 100, 0.024 × 50) = (1.2, 1.2)

Moving east and moving north both increase the temperature at the same rate at this position, 1.2°C per km in each direction.

The General Form

The real power of the chain rule emerges when both f and g are multivariate functions. Suppose:

x is a vector in ℝⁿ
h(x) is a function from ℝⁿ to ℝᵐ (m intermediate variables)
T(h) is a function from ℝᵐ to ℝ

Then the gradient of T with respect to x is:

∇ₓT = Jₕ(x)ᵀ · ∇ₕT

where Jₕ(x) is the Jacobian matrix of h at x (the matrix of all partial derivatives of h with respect to x, which we will define properly in the next post) and ∇ₕT is the gradient of T with respect to h.

Do not worry if this looks abstract for now. The key message is this:

The chain rule in the multivariate setting is a matrix multiplication.

This is not a coincidence. It is the reason why gradient computations in neural networks can be organised as sequences of matrix multiplications, which modern hardware (GPUs) is extremely efficient at. The chain rule is not just a mathematical nicety, it is the reason deep learning is fast.

Wrapping Up

The chain rule states that when a function T depends on intermediate variables h, which in turn depend on inputs x, the derivative of T with respect to x is obtained by multiplying the rates of change along the chain. In the multivariate setting this becomes a matrix multiplication involving the Jacobian, which we will define properly in the next post.

We saw the chain rule at work in two settings: Nadia’s temperature model with one intermediate variable and the two-variable extension that gave us the full gradient.

In the next post we take one step back and ask: what happens when the function itself returns a vector instead of a number? The answer is the Jacobian matrix, and it brings the chain rule and the determinants from our earlier series together in one place.

Vector Calculus Series Part 2 | The Gradient

Vector Calculus Series Part 1 | The Derivative

Discussion about this post

Ready for more?