Not much more to say about it. It's emerging topic of building transformers for computer vision tasks. And this is more or less technical paper, to show progress of the transformers towards replacing ResNet-like architectures wherever they are still used this time (not the first attempt, though) in self-supervision.
The loss itself is superposition of the MoCov2 and BYOL approaches. It simultaneously has queue of the negative examples and contrastive loss from MoCov2 with model-level asymmetry from BYOL, where one branch is momentum-updated.
Results are on par with SoTA for linear evaluation scheme. But the method (1) have lesser complex tricks to achieve and (2) is applicable to transfer-learning towards detection and segmentation.
The loss itself is superposition of the MoCov2 and BYOL approaches. It simultaneously has queue of the negative examples and contrastive loss from MoCov2 with model-level asymmetry from BYOL, where one branch is momentum-updated.
Results are on par with SoTA for linear evaluation scheme. But the method (1) have lesser complex tricks to achieve and (2) is applicable to transfer-learning towards detection and segmentation.
👍1
Forwarded from Just links
Self-Supervised Learning with Swin Transformers https://arxiv.org/abs/2105.04553
Contrastive Conditional Transport for Representation Learning.
This paper tries to make the step similar to GAN->WGAN step, although in terms of the representation learning. At first authors propose the idea, that instead of the training with more-or-less classical SimCLR loss, authors proposed to just minimize (C+) - (C-), where (C+) is mean distance between anchor and positive samples (different views of the anchor) and (C-) is mean distance between anchor and negative samples.
Although, when this does not work out, authors proposed to add more complex weighting procedure: now, positive samples are weighted with respect to their distance to the anchor (more distance — larger weight) and vice versa goes for the negative samples (less distance — larger weight).
Despite description of the idea is somewhat chaotic, the reported results looks good. Also, one more positive side-effect: this loss easily works with multiple positive samples drawn in one minibatch.
Source here.
This paper tries to make the step similar to GAN->WGAN step, although in terms of the representation learning. At first authors propose the idea, that instead of the training with more-or-less classical SimCLR loss, authors proposed to just minimize (C+) - (C-), where (C+) is mean distance between anchor and positive samples (different views of the anchor) and (C-) is mean distance between anchor and negative samples.
Although, when this does not work out, authors proposed to add more complex weighting procedure: now, positive samples are weighted with respect to their distance to the anchor (more distance — larger weight) and vice versa goes for the negative samples (less distance — larger weight).
Despite description of the idea is somewhat chaotic, the reported results looks good. Also, one more positive side-effect: this loss easily works with multiple positive samples drawn in one minibatch.
Source here.
It was quite a vacation, huh. Now back to the matter.
Object-aware Contrastive Learning for Debiased Scene Representation, from current NeurIPS.
The authors proposed to alter the Class Activation Map method a bit, to set it ready for contrastive learning. They named the thing ContraCAM. It's just the usual CAM with:
1. loss replaced with contrastive loss
2. negative gradients dropped
3. iterative accumulation of the masks.
And this itself shows unsupervised object localization, with SoTA IoU.
Based on this localization, the authors proposed two augmentations to reduce negative biases in contrastive learning:
1. guided random crop, to avoid having multiple objects on one image; this avoids over-reliance on co-occurring objects.
2. replacing background (using a soft mask of the localization); this helps to avoid over-reliance on the typical background for the sample.
Since localization is gained without additional information, this is still a self-supervised approach, and therefore could be directly compared with them.
Authors compare those augmentations with self-supervised localization and ground truth masks. They found that both ways can produce a notable boost to the MoCov2 or BYOL results.
More and with images here.
Source here.
Object-aware Contrastive Learning for Debiased Scene Representation, from current NeurIPS.
The authors proposed to alter the Class Activation Map method a bit, to set it ready for contrastive learning. They named the thing ContraCAM. It's just the usual CAM with:
1. loss replaced with contrastive loss
2. negative gradients dropped
3. iterative accumulation of the masks.
And this itself shows unsupervised object localization, with SoTA IoU.
Based on this localization, the authors proposed two augmentations to reduce negative biases in contrastive learning:
1. guided random crop, to avoid having multiple objects on one image; this avoids over-reliance on co-occurring objects.
2. replacing background (using a soft mask of the localization); this helps to avoid over-reliance on the typical background for the sample.
Since localization is gained without additional information, this is still a self-supervised approach, and therefore could be directly compared with them.
Authors compare those augmentations with self-supervised localization and ground truth masks. They found that both ways can produce a notable boost to the MoCov2 or BYOL results.
More and with images here.
Source here.
swanky-pleasure-bcf on Notion
Object-aware Contrastive Learning for Debiased Scene Representation | Notion
In this paper, the authors propose to modify Class Activation Map w.r.t. self-supervised losses and create ContraCAM. Thus allowing unsupervised object localization by network trained with self-supervised losses. With this localization in mind authors propose…
👍1
PiCIE Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering
Paper from CVPR'21
There is one of more or less classical approaches to deep unsupervised segmentation: cluster your embeddings and use it as pseudo labels. Add tricks, repeat multiple times. In this paper the authors made one step further to unite it with self-supervision. They designed a loss function to enforce invariance of these clustered representation to the color augmentations and equivariance to the spatial augmentations.
The algorithm of the loss calculation is:
1. Get two representations of the same image. Both disturbed with different color augmentations but with same spatial augmentation. in first case image is disturbed before going through the network and in second — output of the network is disturbed. So, ideally both outputs should be identical, it will show invariance and equivariance to color and spatial augmentations respectively. I will name these representations
2. For each of those outputs, run KMeans clustering of the embeddings. I will name obtained centroids
3. The next step is going to finally mix those two spaces. Let say that
3.1. We enforce clustering in each representation with
3.2. We enforce that this clustering itself should hold across the representations with
And that's it. Training with this approach achieves SoTA on the unsupervised segmentation and shows qualitatively good object masks. The most improved part of the segmentation is thing (foreground object) segmentation, which is systematically problematic for unsupervised learning, because of the huge imbalance in class sizes.
More here.
Source here.
Paper from CVPR'21
There is one of more or less classical approaches to deep unsupervised segmentation: cluster your embeddings and use it as pseudo labels. Add tricks, repeat multiple times. In this paper the authors made one step further to unite it with self-supervision. They designed a loss function to enforce invariance of these clustered representation to the color augmentations and equivariance to the spatial augmentations.
The algorithm of the loss calculation is:
1. Get two representations of the same image. Both disturbed with different color augmentations but with same spatial augmentation. in first case image is disturbed before going through the network and in second — output of the network is disturbed. So, ideally both outputs should be identical, it will show invariance and equivariance to color and spatial augmentations respectively. I will name these representations
z1
and z2
.2. For each of those outputs, run KMeans clustering of the embeddings. I will name obtained centroids
µ1
and µ2
.3. The next step is going to finally mix those two spaces. Let say that
L(z, µ)
is a loss, that for each vector in z
brings it closer to the nearest vector of µ
. (prototype learning waves). Then:3.1. We enforce clustering in each representation with
L(z1, µ1) + L(z2, µ2)
.3.2. We enforce that this clustering itself should hold across the representations with
L(z1, µ2) + L(z2, µ1)
.And that's it. Training with this approach achieves SoTA on the unsupervised segmentation and shows qualitatively good object masks. The most improved part of the segmentation is thing (foreground object) segmentation, which is systematically problematic for unsupervised learning, because of the huge imbalance in class sizes.
More here.
Source here.
swanky-pleasure-bcf on Notion
PiCIE Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering | Notion
The proposed approach on one hand falls in line with many approaches for semantic segmentation training based on clustering (e.g. DeepCluster). Although, unlike these approaches, authors propose not to rely solely on the clustering iterative improvement.…
Lilian Weng's description of the contrastive learning is a good way to have a quick dive intro.
Lil'Log
Contrastive Representation Learning
MERLIN: another example of how important it is to know your data.
Imaging with SAR (Synthetic Aperture Radar) introduces a very specific type of noise, called speckles. The problem with training a denoising model for this case is obtaining very specific data. It should be the same data from point of view of information but should have decorrelated noise, which can cost a lot in areas where SAR is applied, e.g. in satellite imagery.
The authors proposed to utilize the structure of the image received by SAR. These images are obtained as a pair of values per pixel: amplitude and phase. Typically, phase is considered non-important for the imaging, as amplitude is of interest.
Authors demonstrate, that using the statistical model of the speckle noise, they can extract two noisy images from this information, having the same information, but with the different but i.i.d. noise. This way, authors can apply the Noise2Noise framework to this case, and train a NN to predict true non-noisy amplitude.
It allows training a neural network for each detector specifically, without the requirement to obtain expensive training data or to construct an artificial one.
Source: here
Imaging with SAR (Synthetic Aperture Radar) introduces a very specific type of noise, called speckles. The problem with training a denoising model for this case is obtaining very specific data. It should be the same data from point of view of information but should have decorrelated noise, which can cost a lot in areas where SAR is applied, e.g. in satellite imagery.
The authors proposed to utilize the structure of the image received by SAR. These images are obtained as a pair of values per pixel: amplitude and phase. Typically, phase is considered non-important for the imaging, as amplitude is of interest.
Authors demonstrate, that using the statistical model of the speckle noise, they can extract two noisy images from this information, having the same information, but with the different but i.i.d. noise. This way, authors can apply the Noise2Noise framework to this case, and train a NN to predict true non-noisy amplitude.
It allows training a neural network for each detector specifically, without the requirement to obtain expensive training data or to construct an artificial one.
Source: here
👍1
How Useful is Self-Supervised Pretraining for Visual Tasks?
A relatively old paper (CVPR2020), by our fast life standards. Nevertheless, it has a pair of practical takeaways.
Authors created a synthetic dataset with several degrees of freedom to vary difficulty. It varies from almost monochrome objects to randomized textures and positioning on image.
The target was to compare how good different self-supervised approaches help to tune for different downstream tasks. From classification to depth estimation.
Two practical takeways are:
1. The self-supervised method utility is wildly dependent on task, markup amount and even data complexity.
2. A linear evaluation score, so popular in papers, has almost no correlation with actual fine-tuning results.
Authors found out, that there is no improvement by self-supervised training when lots of labeled data presented (which became kinda well known since then). Based on this, they hypothesise, that improvement of SSL pre-training is rather kind of a regularization than optimization. That is, SSL pre-training helps to find wider optimum, not better. Though, to claim this, some kind of loss plane investigation would be more helpful.
Source: here
A relatively old paper (CVPR2020), by our fast life standards. Nevertheless, it has a pair of practical takeaways.
Authors created a synthetic dataset with several degrees of freedom to vary difficulty. It varies from almost monochrome objects to randomized textures and positioning on image.
The target was to compare how good different self-supervised approaches help to tune for different downstream tasks. From classification to depth estimation.
Two practical takeways are:
1. The self-supervised method utility is wildly dependent on task, markup amount and even data complexity.
2. A linear evaluation score, so popular in papers, has almost no correlation with actual fine-tuning results.
Authors found out, that there is no improvement by self-supervised training when lots of labeled data presented (which became kinda well known since then). Based on this, they hypothesise, that improvement of SSL pre-training is rather kind of a regularization than optimization. That is, SSL pre-training helps to find wider optimum, not better. Though, to claim this, some kind of loss plane investigation would be more helpful.
Source: here
👍2
Improving Self-supervised Learning with Automated Unsupervised Outlier Arbitration
from NeurIPS2021.
It was already noted, that quality of the contrastive learning may suffer from intense augmentation. In this paper, the authors make one step further and try to understand the source of this.
The main hypothesis is, if augmentations are too intense, the assumption of invariance of the image information to augmentation just breaks. That is, we augment images so hard, that it isn't meaningful to ask a model to predict close embeddings for such different inputs.
To mitigate this, the authors proposed to model a distribution of the embeddings of views (positive samples, different augmentations of the same image) as a normal distribution with a shared covariance matrix (experiments show that shared covariance matrix is somehow very effective). And then add weight to each component of the loss with a normalized distance between two views which are pulled together in this component. The distance here is Mahalanobis distance defined by the fitted distribution.
To put it simpler: if two positive samples are too far away from each other, maybe they are not so positive after all?
This makes contrastive methods to not over relate on the assumption of the invariance to augmentation. And also makes them more aware of what happens in the embedded space itself.
Authors demonstrate consistent improvement for different contrastive losses.
source: here
from NeurIPS2021.
It was already noted, that quality of the contrastive learning may suffer from intense augmentation. In this paper, the authors make one step further and try to understand the source of this.
The main hypothesis is, if augmentations are too intense, the assumption of invariance of the image information to augmentation just breaks. That is, we augment images so hard, that it isn't meaningful to ask a model to predict close embeddings for such different inputs.
To mitigate this, the authors proposed to model a distribution of the embeddings of views (positive samples, different augmentations of the same image) as a normal distribution with a shared covariance matrix (experiments show that shared covariance matrix is somehow very effective). And then add weight to each component of the loss with a normalized distance between two views which are pulled together in this component. The distance here is Mahalanobis distance defined by the fitted distribution.
To put it simpler: if two positive samples are too far away from each other, maybe they are not so positive after all?
This makes contrastive methods to not over relate on the assumption of the invariance to augmentation. And also makes them more aware of what happens in the embedded space itself.
Authors demonstrate consistent improvement for different contrastive losses.
source: here
👍1
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
from the ICML2020.
Previously it was noted, that if one swaps contrastive loss with a tighter bound on MI, the downstream quality decreases. The authors propose, therefore, to move from InfoMax intuition to rather simple concepts: alignment and uniformity. The former enforces that positive pairs stay as close as possible and the latter enforces that all samples stay as evenly distributed as possible.
These components are empirically important for downstream performance. Furthermore, their direct optimization may outperform the classical contrastive loss training.
With images and a bit longer: here
Source: here
from the ICML2020.
Previously it was noted, that if one swaps contrastive loss with a tighter bound on MI, the downstream quality decreases. The authors propose, therefore, to move from InfoMax intuition to rather simple concepts: alignment and uniformity. The former enforces that positive pairs stay as close as possible and the latter enforces that all samples stay as evenly distributed as possible.
These components are empirically important for downstream performance. Furthermore, their direct optimization may outperform the classical contrastive loss training.
With images and a bit longer: here
Source: here
Telegraph
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
The work from the ICML2020 goes deeper in the understanding of contrastive learning. Authors diverged from a proposal that contrastive loss maximizes the mutual information between the positive views because it was shown that optimizing tighter bound on MI…
Well, there was more than three years since the last post here. In these three years a lot has changed. I'm done with my PhD in Heidelberg Uni, and moved on to JetBrains to lead a team on AI agents. With all this on my hands, I will have even less time for writing the reviews I'd like to read. But on the other hand, I'd still like to share the papers I read.
So, instead, I will post here links to the papers that I read. You can view this experiment as copycatting the @j_links but with a bias towards LLMs and probably agents.
So, instead, I will post here links to the papers that I read. You can view this experiment as copycatting the @j_links but with a bias towards LLMs and probably agents.
🔥10👍4