WiseFT: Robust fine-tuning of zero-shot models

Large pretrained models like CLIP give great zero shot performance, and even better finetuned performances on datasets. However, this leads to reduction in robustness to distribution shifts and catastrophic forgetting. WiseFT ¹ deals with this problem, by ensembling the weights of zero shot model and finetuned model. This could be used in continual learning.

Steering CLIP’s vision transformer with sparse autoencoders

CVPR Workshop on Mechanistic Interpratibility for Vision Paper from MILA: Joseph, Suresh, Richards et al

Context: Foundation models often compress information about multiple different things to a single neuron weight (polysemantic), making these not very interpretable. In language models with transformers, sparse autoenoders (SAEs) have helped to make language models based on transformers interpretable, by learning using internal activations of transformer layers. Since only a few features are allowed to activate at once, it is forced to learn disentangled features making the vectors interpretable.

Now if we are able to understand which part of the ViT is responsible for classifying a particular task, we can make its learning rate higher when a similar kind of task appears, while keeping other parts less learnable (remember SLCA: Slow Learner with Classifier Alignment).

In the paper by Joseph et al ², authors train SAEs on CLIP’s Vision transformer. They found that $10-15\%$ of neurons and features are steerable - which could be utilized. Another observation: The $L_0$ values of SAEs trained on spatial tokens are higher at center of image, with higher L0 values than the CLS token or the language model’s SAEs. Thus, language and vision models could have different sparsity.

For an input activation $x \in R^{d_{model}}$ where $R^{d_{model}}$ is a set of feature vectors. SAE computes the following decomposition: $\hat{x} + \epsilon (x) = \sum_{j=1}^{d_{SAE}} f_j (x) n_j + b + \epsilon (x)$

where $n_j\in R^{d_{model}}$ are normalized vectors and feature activations $F_j(x)\in R$ serves as sparse coefficients and $b\in R^{d_{model}}$ represents bias.

The papers Interpreting CLIP with sparse linear concept embeddings (spliCE). and beyond scalars: Concept-based alignment analysis in vision transformers. demonstrate CLIP’s internal representations align naturally with human interpretable concepts.

Also SAEs are better than CBM since CBMs rely on LLM generated features, while SAEs are unsupervised based on model’s internal representations.

Models being trained:

CLIP ViT B/32 being used. Vanilla SAE uses ReLU activation with sparsity induced by L1 reg and trained on Imagenet1K.

How to Steer?

CLIP SAEs can be used to control feature activations: “manipulating SAE features to influence model outputs”.

To measure steering effects, a feature f is selected, and its feature activation across all patches are replaced with steering strength s during forward pass.

I didnt quite get how these features are obtained while unsupervised.

What is SAE doing: The base model’s 768 dimension activation space is mapped to dictionary of $768\times 64$ features, with initial encoder weights set to transpose of decoder weights.

I didnt care about the evaluation metric here, so wont comment.

Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering

NeurIPS 2024 MINT Workshop paper ³

Authors: Postmus, Abreu from Groningen

Proposes using “conceptors” for steering LLM outputs (AKA activation engineering). Conceptors are mathematical constructs representing sets of activation vectors as ellipsoidal regions - soft projection matrices.

Conceptors are high dimensional ellipsoids which describe the overall shape and spread of the activations’ underlying pattern or state space region. Captures correlations between activations. Each conceptor thus corresponds to a pattern of activations. Projecting an activation on one particular pattern helps to “steer” the output to only that pattern. Multiple patterns can be enforced by boolean operations on multiple conceptors.

Activation engineering: Steering method which directly modifiees model’s activations at inference without changing model params or optimization. Steering vector representing the desired behavor is computed directly or contrastively from pos and neg examples.

Here cached activations are used to compute a conceptor or a steering matrix. Activations are softly projected using matrix vector multiplication instead of directly adding to LLM activations.

It is kind of like having separate activation (like ReLU) conditioned on the activations. Modified activation from the network output activation $h$ is $h' = Ch$ where $C$ is the image embedding at a particular layer.

In concept steering, prompts like good:bad, hot:cold etc are given called in-context prompts. These mappings are learned by the model implictly, and the hidden activation at some layer l is found for each such prompt - $h_l(p_i)$ that encodes representation of the function model is performing. The steering vector is then found by $\bar{h_l^f} = \frac{1}{|P_f|} \sum_{p_i \in P_f} h_l(p_i)$

During inference $h' = h + \beta_{add} \bar{h_l^f}$

Thus model gives opposites (for this example).

Conceptors instead act as a soft filter

$h'_l = \beta_c C_l^f h_l$ where $C$ has eigenvalues $\in [0,1]$.

Before, vectors were translated, and now they are projected on to an ellipsoid, that represents where task activations live.

From the paper: “the conceptor “softly projects” the activation vector $h_l$ toward the pattern represented by $C_l^f$ by scaling its components according to the patterns’ principal directions.”

Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws

ICML 2025 Spotlight ⁴

Authors: Wei, Lin, Yang et al

To Do: Use a trained model as a reference to guide and enhance the training of a target model via strategic data selection or weighting - kind of like WiSE-FT!

This is named model steering.

Semantic Drift Compensation for Incremental Learning: SDC-IL

CVPR 2020 paper

Authors: Yu, Joost Weijer et al

Navigating Semantic Drift in Task-Agnostic Class-Incremental Learning

ICML 2025 ⁵

Authors: Wu, Zhang, Wang et al…

CIL tries to sequentially learn new classes while retaining previous info. The gap in feature dist between novel and existing tasks is driven by differences in mean and covariance moments. Thus mean drift compensation and covariance calibration is required. This is done by finding each class’ mean and estimate task shifts - by weighted embedding changes based on proximity to previous mean.

What does that mean?

Further, Mahalanobis distance constraint used for covariance calibration which aligns class specific embedding b/w old and current networks to mitigate cov shift.

Dataset Single dataset consisting of T tasks containing $n^t$ inputs and corresponding labels.

Architecture: Frozen CLIP ViT model used a backbone with learnable task specific LoRA modules. There is a task shared LoRA module given by $W = W_0 + BA$ At task t, the update of weight matrix is given by $W = W_0 + \sum_{i=1}^t B_i A_i$. Output class tokens forwarded thr task specific classifier - gives class scores.

Before training $f_n$, covariance of each class is precomputed based on $f_{n-1}$, where $f_k$ denotes network trained on $k^{th}$ task. This cov matrices used to align dist of representation generated by current nw with that of old nw.

After we obtain $f_n$, class means are updated using mean shift compensation and classifier heads are retrained using calibrated class statistics.

The class mean shift is found using weighted average of the drift of each sample of that class. The Covariance shift is also minimized using $L_{cov}$.

For classifier alignment, a number of samples are drawn from the Gaussian models of the mean and covariances stored of the feature representation of each class c.

In other words:

Summary of process: Take current samples from $c \in C^t$, obtain features using $f_{t-1}$ to get mean and cov $\mu_c, \Sigma_c^{t-1}$. Next using these samples train and obtain $f_t$, which finishes training the feature extractor. Next we need to train the classifier to ensure catastrophic forgetting doesnt occur.

Post training, for $c \in \bigcup_{i=1}^{t-1}C^i$ draw s samples for each class from all prev tasks and estimate and compensate the class mean shift using $\mu_c^t = \mu_c^{t-1} + \Delta\mu_c^{t-1\rightarrow t}$

Sample from Gaussian with $\mu_c, \Sigma_c$ to retrain classifiers upto task t and store the new mean and covariance.

Wortsman, M., Ilharco, G., Kim, J.W., Li, M., Kornblith, S., Roelofs, R., Lopes, R.G., Hajishirzi, H., Farhadi, A., Namkoong, H. and Schmidt, L., 2022. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7959-7971). ↩
Joseph, S., Suresh, P., Goldfarb, E., Hufe, L., Gandelsman, Y., Graham, R., Bzdok, D., Samek, W. and Richards, B.A., 2025. Steering CLIP’s vision transformer with sparse autoencoders. arXiv preprint arXiv:2504.08729. ↩
Postmus, J. and Abreu, S., 2024. Steering large language models using conceptors: Improving addition-based activation engineering. arXiv preprint arXiv:2410.16314. ↩
Wei, X., Lin, M., Ye, F., Song, F., Cao, L., Thai, M.T. and Yang, T., 2025. Model steering: Learning with a reference model improves generalization bounds and scaling laws. arXiv preprint arXiv:2505.06699. ↩
Wu, F., Cheng, L., Tang, S., Zhu, X., Fang, C., Zhang, D. and Wang, M., 2025. Navigating semantic drift in task-agnostic class-incremental learning. arXiv preprint arXiv:2502.07560. ↩