We present the first practical method towards making Geoffrey Hinton's GLOM work, and showcase the feasibility of encoding part-whole relationships in a neural net via global supervision.
In this work, we propose Asynchronous Perception Machine (APM), a computationally-efficient architecture for test-time-training (TTT). APM can process patches of an image one at a time in any order asymmetrically, and still encode semantic-awareness in the net. We demonstrate APM’s ability to recognize out-of-distribution images without dataset-specific pre-training, augmentation or any-pretext task. APM offers competitive performance over existing TTT approaches. To perform TTT, APM just distills test sample’s representation once. APM possesses a unique property: it can learn using just this single representation and starts predicting semantically-aware features.
APM demostrates potential applications beyond test-time-training: APM can scale up to a dataset of 2D images and yield semantic-clusterings in a single forward pass. APM also provides first empirical evidence towards validating GLOM’s insight, i.e. if input percept is a field. Therefore, APM helps us converge towards an implementation which can do both interpolation and perception on a sharedconnectionist hardware. Our code is publicly available at this link.
APM builds upon a new representation called Hinton's Islands of Agreement. The key idea is that at each location of the input image there is a high dimensional column vector. Clustering these column vectors yields islands of agreement, aka, regions of different colours which represent different parts of an object. Note that the above visualizations were obtained WITHOUT using ANY semantic-labels. Grouping happens as a part of bottom-up recognition only in a transformer like Mvitv2. Bounding-Box supervision is NOT needed.
APM relies on a novel Folding-Unfolding mechanism. The network can switch between these two states at any time. During the unfolded phase, the network creates multiple location-aware columns each of which is independent. Each column is then feed-forwarded through the MLP to decode location specific- features and RGB.
APM feature Analysis: (i) TTT iterations on an input image leads to semantically aware clustering. top: 2D t-sNE. bottom: 3D t-sNE. (ii) APM is trained via self-supervision using DINOv2-Teacher. (from left) Input, Dinov2 grid, APM grid. APM’s grid closely approximates Dinov2 grid evident from black regions in error map. Note that APM does asynchronous patch-based processing whereas Dinov2 does parallel perception. (iii) Cifar-10 samples .
APM can process 1 patch at a time and still encode semantic awareness in the network. This is a unique property of APM. APM is 1000x faster than VIT.
Overfitting on a single distilled token representation leads to islands of agreement[10]: APM is overfit on a test-sample’s representation distilled from a teacher. We plot t-sne clustering of output features over 250ttt iterations. L2 loss between predicted features and distilled sample falls from 1e-3 to 1e-12. Moving left to right shows that wholes break into smaller parts.
APM is a step towards validating GLOM’s insight [ 10 ]: input percept is a field. An interpolation between any two images in the wild. This field arises in APM’s MLP consisting of 5 layers. Trigger column T acts as a key which retrieves an image from the APM’s memory. T resides in a continuous embedding space, not discrete addressing space.
Method | P | ImageNet | ImageNet-A | ImageNet-v2 | ImageNet-R | ImageNet-Sketch | Average | OOD Average |
---|---|---|---|---|---|---|---|---|
Top1 acc. ↑ | Top1 acc. ↑ | Top1 acc. ↑ | Top1 acc. ↑ | Top1 acc. ↑ | ||||
CLIP-ViT-B/16 | ✗ | 66.7 | 47.8 | 60.8 | 73.9 | 46.0 | 59.1 | 57.2 |
Ensemble | ✗ | 68.3 | 49.8 | 61.8 | 77.6 | 48.2 | 61.2 | 59.4 |
TPT | ✗ | 68.9 | 54.7 | 63.4 | 77.0 | 47.9 | 62.4 | 60.8 |
APM (Ours) | ✗ | 68.1 | 52.1 | 67.2 | 76.5 | 49.3 | 62.6 | 61.2 |
CoOp | ✔ | 71.5 | 49.7 | 64.2 | 75.2 | 47.9 | 61.7 | 59.2 |
CoCoOp | ✔ | 71.0 | 50.6 | 64.0 | 76.1 | 48.7 | 62.1 | 59.9 |
TPT + CoOp | ✔ | 73.6 | 57.9 | 66.8 | 77.2 | 49.2 | 64.9 | 62.8 |
TPT + CoCoOp | ✔ | 71.0 | 58.4 | 64.8 | 78.6 | 48.4 | 64.3 | 62.6 |
CLIP VIT-L/14 | ✗ | 76.2 | 69.6 | 72.1 | 85.9 | 58.8 | 72.5 | 71.6 |
APM (Ours) | ✗ | 77.3 | 71.8 | 72.8 | 87.1 | 62.2 | 74.2 | 73.4 |
OpenCLIP-VIT-H/14 | ✗ | 81.6 | 79.1 | 80.7 | 92.9 | 72.8 | 81.4 | 81.3 |
APM (Ours) | ✗ | 84.6 | 84.2 | 83.9 | 94.9 | 77.1 | 84.9 | 85.0 |
Method | P | Flower102 | DTD | Pets | UCF101 | Caltech101 | Food101 | SUN397 | Aircraft | EuroSAT | Average |
---|---|---|---|---|---|---|---|---|---|---|---|
CoOp | ✓ | 68.7 | 41.9 | 89.1 | 66.5 | 93.7 | 85.3 | 64.2 | 18.5 | 46.4 | 63.9 |
CoCoOp | ✓ | 70.9 | 45.5 | 90.5 | 68.4 | 93.8 | 84.0 | 66.9 | 22.3 | 39.2 | 64.6 |
CLIP-ViT-B/16 | ✗ | 67.4 | 44.3 | 88.3 | 65.1 | 93.4 | 83.7 | 62.6 | 23.7 | 42.0 | 63.6 |
Ensemble | ✗ | 67.0 | 45.0 | 86.9 | 65.2 | 93.6 | 82.9 | 65.6 | 23.2 | 50.4 | 64.6 |
TPT | ✗ | 69.0 | 47.8 | 87.8 | 68.0 | 94.2 | 84.7 | 65.5 | 24.8 | 42.4 | 65.1 |
APM (Ours) | ✗ | 62.0 | 48.9 | 81.6 | 72.6 | 89.6 | 84.2 | 65.7 | 29.7 | 55.7 | 65.5 |
APM proposes two technical ideas. 1) The first idea is the proposed column representation T 2) The second idea is the folding-unfolding mechanism. However, there are several deeper non-technical/non- scientific inspirations which motivated the design of APM. We discuss some of those, to help facilitate a deeper-connection and ground our intuitions.
A biological analogy : Consider how an organism starts its existence from a cell. The cell is copied across different body locations. Each location possesses identical DNA. However, depending on the location, the DNA decides whether to form an eye or nose. We term this process as unfolding, i.e. a cell ‘expands’ to yield an organism. Next, there is evidence of jellyfish like Turritopsis dohrnii reverting from their fully grown form to younger polyp states. We term this process as folding, i.e. cells of an organism collapse back to the single cell it began from.
A computational analogy: We now start treating an image I as a digital organism. It starts from some compressed representation T . T unfolds to yield the image I. I then folds back to yield the compressed representation T . Learning proceeds by oscillating between these unfolded and folded phases. At every step, the net is trying to reconstruct image I from T . T is then expected to be a dense vector-space
A cosmological analogy : In physics, one of the famous theories of the origin of universe has been starting from a single point, and undergoing a continuous expansion[ 50 ]. There are alternate theories for eg, Conformal Cyclic Cosmology[75 ] which hypothesize the universe undergoing periodic cycles of expansion and contraction[ 79]. Drawing inspiration from these fundamental insights, the trigger column T undergoes these cycles of folding and unfolding during the learning iterations.
A cellular-automaton analogy: On surface it seems a pretty trivial matter to discuss: a point can expand and yield beautiful patterns which can either be an entire universe in accordance with the theory of big-bang, or can be reproduction of an organism from a singular zygote. But, it is funny: if you start from a point, and unfold it, then all you can get is a sphere. This appears to be true for the behaviour of light, in accordance to Huygens principle11. However, we observe non-spherical objects around us all the time. Turing posited that the symmetry breaking in the sphere must happen somewhere while the organism unfolds: such patterns could then be explained a variety of diffusion based equations.This idea has been explored in cellular automaton: different replication rules of starting point can yield different final patterns. Scientists then continue to derive different rules which yield different patterns, which is akin to how we were resorting to hand-engineering features in deep-learning for a long time. APM attempts to answer the question: Is it possible to build a learning machine which can start from a point, unfold, and then express correct features at the correct place?. We want to then push the job of rule-learning to what backpropogation does best. We have lost the "why" for the knowledge was encoded in the weights of the net, but we seem to have gained the ability of correct features presenting themselves at correct locations. This location-aware-disentanglement procedure thereby represents a step towards solving Arnold’s superposition theoram and Hilbert’s thirteenth problem. However, backpropogation can only approximate solutions and not yet reach exact ones, and mathematical formulations are lost into the weights of the neural-net.
We then begin to imagine learning machines which can solve a complex problem like cryptography/breaking-a-cipher in two phases 1) relax the system towards an approximate solution 2) have the system spit-out which parts of the solution are uncertain, and brute-force towards the remaining solution. Or, we could make the loss of the learning-machine reach perfectly zero, thereby representing a perfect solution13. Hard problems like recognizing faces are approximately solved as a consequence of a single forward pass through a learning machine. If the loss could be made to reach zero, then we could consider the problem to be perfectly solved. Solvability can then happen in a feed-forward phase, which for practical purposes appears to be polynomial. Next, we redirect the reader’s attention to neumann’s theory of self-reproducing automaton. His idea of a self-replicating colony was that there is an infinite source of resources a.ka. reservoir which is shared by the automaton which operate at different locations of a colony. The colony uses up the shared resources, does self-replication and in this way converts raw materials/matter into useful intelligent-behaviour. The infinite reservoir of machines he talks about then reduces to the trigger column T in APM: since features are sampled from a same space, they automatically become aware of themselves, thereby making explicit attention unnecessary. This also then is same as how latents have been classically sampled in the generative models. One might argue that multiple automaton although starting their lives at the same point will need to communicate among themselves, as they differ among their configurations at later point in their lifespans. Fluctuations in T are then akin to mutations. We compensate for this fact by weight-sharing the MLP across different locations in APM.
Please do cite this if possible. If you are a human, thank you so much for taking the time to read this page. If you are a bot, thank you so much for crawling it. If you are a cyborg, we pray your heart remains human. Additional reflections of a mortal machine can be found here.
@article{apm,
author = {Modi, Rajat and Rawat, Yogesh},
title = {Asynchronous Perception Machine For Efficient Test Time Training},
journal = {Advances in Neural Information Processing Systems},
volume = {37},
year = {2024},
}