1
$\begingroup$

I'm currently looking at the equation for a multiclass classifier of the form

$$\sum\limits_{i=1}^{N_I} \sum\limits_{k=1,\atop k \neq y_i}^{N_K} L(1+ \mathbf{w_k}\cdot\mathbf{x_i}-\mathbf{w_{y_i}}\cdot\mathbf{x_i})$$

where
$X$ is a $N_K \times N_F$ data matrix
$y$ is a vector of class labels
$W$ is an $N_K \times N_I$ matrix where each corresponds to the weights for the hyperplane splitting one class from the rest
$L$ is some arbitrary loss function returning a real number eg square loss, absolute, hinge etc

I've been told that an answer for the above equation looks like

$$\partial_{w_k} = \sum\limits_{i=1 \atop y_i \neq k}^{N_I} \ L'(1+ \mathbf{w_k}\cdot\mathbf{x_i}-\mathbf{w_{y_i}}\cdot\mathbf{x_i})\cdot\mathbf{x_i} - \sum\limits_{i=1 \atop y_i =k}^{N_I} \sum\limits_{l=1,\atop l \neq k}^{N_K} L'(1+ \mathbf{w_k}\cdot\mathbf{x_i}-\mathbf{w_{y_i}}\cdot\mathbf{x_i})\cdot\mathbf{x_i}$$ Where $L'$ is the derivative of the loss function

This answer doesn't make sense to me. Where does the second summation ($\sum\limits_{i=1 \atop y_i =k}^{N_I} \sum\limits_{l=1,\atop l \neq k}^{N_K} )$ come from? After all, in the initial equation, when $i$ is such that $y_i=k$, we skip over it. How then does this case have a large impact on the partial derivative? When I've attempted to do this myself, I've gotten the first part of the derivative but not the second.

  • 0
    You remembered to apply the chain rule, didn't you?2010-12-01
  • 0
    Yes. Thats why the x_i appears at the end of the first summation2010-12-02

0 Answers 0