-
Notifications
You must be signed in to change notification settings - Fork 341
Open
Description
Hi,
I am recently reading your excellent continual-learning implementation, in particular about the SI. In the following line of code, you used p.grad, which is the gradient of the regularized loss. However, based on my understanding about SI, the gradient should be computed merely on the data loss, so that it measures how much each weight contributes to the fitting error of the present task. Am I wrong about it, or I missed important factors in your implementation? Thanks ahead for your clarification.
Line 248 in d281967
| W[n].add_(-p.grad*(p.detach()-p_old[n])) |
Metadata
Metadata
Assignees
Labels
No labels