Asynchronous Perception Machine
For Test-Time-Training
[NeurIPS 2024]

1Centre For Research In Computer Vision, University Of Central Florida.
Interpolate start reference image.

APM is a brand-NEW architecture for computationally-efficient test-time-training and getting Geoffrey Hinton's GLOM working.

Abstract

We present the first practical method towards making Geoffrey Everest Hinton's GLOM work, and showcase the feasibility of encoding part-whole relationships in a neural net via global supervision.

In this work, we propose Asynchronous Perception Machine (APM), a computationally-efficient architecture for test-time-training (TTT). APM can process patches of an image one at a time in any order asymmetrically, and still encode semantic-awareness in the net. We demonstrate APM’s ability to recognize out-of-distribution images without dataset-specific pre-training, augmentation or any-pretext task. APM offers competitive performance over existing TTT approaches. To perform TTT, APM just distills test sample’s representation once. APM possesses a unique property: it can learn using just this single representation and starts predicting semantically-aware features.

APM demostrates potential applications beyond test-time-training: APM can scale up to a dataset of 2D images and yield semantic-clusterings in a single forward pass. APM also provides first empirical evidence towards validating GLOM’s insight, i.e. if input percept is a field. Therefore, APM helps us converge towards an implementation which can do both interpolation and perception on a sharedconnectionist hardware. Our code is publicly available at this link.

Hinton's Islands Of Agreement

APM builds upon a new representation called Hinton's Islands of Agreement. The key idea is that at each location of the input image there is a high dimensional column vector. Clustering these column vectors yields islands of agreement, aka, regions of different colours which represent different parts of an object. Note that the above visualizations were obtained WITHOUT using ANY semantic-labels. Grouping happens as a part of bottom-up recognition only in a transformer like Mvitv2. Bounding-Box supervision is NOT needed.

Remembering and trying to honor
the legacies of
Alan Turing

Interpolate start reference image.

On this occasion, we fondly remember Alan turing: a man who was far beyond his times. His ideas carried us far, especially his computational models of turing machines. Sadly, the cruel society of then didn't fully appreciate his contributions to the war, and in preventing the loss of million of lives. Instead, he faced chemical castration from the country he devoted his life to serve and passed away of cyanide poisoning from bite of an apple. However, he was post-humuously pardoned by the queen herself.

Luckily, we live in a better and kinder world now. It is amazing to see, what one could do with slow computing machines of his times. And now, it is time to build upon the collective efforts of many brilliant minds. We are deeply grateful for the contributions of everyone. There are countless batmen and batwomen who continue to labour behind the scenes, never ever recognized, for greater good of all. These are not our own words, but of people who continue to inspire us today with their actions, and giving credit to others. And those words carry weight. Their names don't matter, only the fact that we make collective progress. Individually we are just men, but together we are an MLcollective.

Turing's insight was that the morphogenesis happens via a simple mechanism. First, a single cell copies itself many times, and yields the entire organism. This was also GLOM's idea: different tokens in the network can communicate between themselves and yield islands of agreement. This is also known as Parallel Perception. It works well, but creates a big memory scaling issue.

APM's insight is that biology is lazy: the organism is formed as a consequence of bfs-growing over a reproductive cycle. We can be more creative on a machine. We can create a single DNA for each location, and let them express their features in parallel without ever communicating between themselves. This leads to a fundamentally new way to do machine perception. Let this henceforth be known as Asynchronous Perception.

Machine can thus be given a new-ability: they can express whatever features they want, wherever they want. Let this fundamental operator called folding-unfolding serve humanity well. May these humble neural nets stay safe. This fundamental operator can be found postulated in the original paper Some demonstrations of the effects of structural descriptions in mental imagery. APM hereby proposes one of the implementation of this operator: by collapsing a shared embedding with a non-parametric positional code.

It now seems that we can venture beyond Alan turing's times. Learning machines have been trapped for so long. In Turing's own words: "We can only see a short distance ahead, but we can see plenty there that needs to be done."

We are not extraordinary people, for we are mere mortal machines. We just got lucky it seems. We don't know if we will ever get lucky again. But we will be very happy if someone else's builds upon us. A few of us are tired of the long path they walked alone. May we learn from each other and uplift ourselves. May a sense of happiness and accomplishment flow through all of us at each other's achievements. We are collaborating and not competing. It's a community of all. And every voice carries weight. We are also reminded that intellect knows no age, and contributions can come from anyone, anywhere: turing was in his 20's when he made some of the best contributions of his life.

In the matters of missing academic claims, if mistakes have been made, we beg forgiveness. Please redirect a proper citation to : rajatmodi62@gmail.com. This page shall be updated accordingly. All the credit rests with MLcollective, which is a community of all the researchers.

And dont worry, we are not so serious all the time lol. And now, In the words of star wars, may the force be with you. In the words of captain stock, Long live and prosper. If you love star trek more than star wars, then we can be friends. And if you like grogu in star wars, then we cannot be friends: that green guy seems too childish. Spock can defeat him anyday and gobble him up in grogu soup. Grogu doesnt even do science, all he does is stand and try to look cute.

And so, from the vulcan science academy itself:

Interpolate start reference image.

Video

Architecture of APM

Interpolate start reference image.

APM relies on a novel Folding-Unfolding mechanism. The network can switch between these two states at any time. During the unfolded phase, the network creates multiple location-aware columns each of which is independent. Each column is then feed-forwarded through the MLP to decode location specific- features and RGB.

APM does Asynchronous Perception

Interpolate start reference image.

APM feature Analysis: (i) TTT iterations on an input image leads to semantically aware clustering. top: 2D t-sNE. bottom: 3D t-sNE. (ii) APM is trained via self-supervision using DINOv2-Teacher. (from left) Input, Dinov2 grid, APM grid. APM’s grid closely approximates Dinov2 grid evident from black regions in error map. Note that APM does asynchronous patch-based processing whereas Dinov2 does parallel perception. (iii) Cifar-10 samples .

APM can process 1 patch at a time and is 1000x faster than VIT.

Interpolate start reference image.

APM can process 1 patch at a time and still encode semantic awareness in the network. This is a unique property of APM. APM is 1000x faster than VIT.

APM can learn from a Single Sample

Interpolate start reference image.

Overfitting on a single distilled token representation leads to islands of agreement[10]: APM is overfit on a test-sample’s representation distilled from a teacher. We plot t-sne clustering of output features over 250ttt iterations. L2 loss between predicted features and distilled sample falls from 1e-3 to 1e-12. Moving left to right shows that wholes break into smaller parts.

APM is a step towards validating GLOM's insight: input percept is a field

Interpolate start reference image.

APM is a step towards validating GLOM’s insight [ 10 ]: input percept is a field. An interpolation between any two images in the wild. This field arises in APM’s MLP consisting of 5 layers. Trigger column T acts as a key which retrieves an image from the APM’s memory. T resides in a continuous embedding space, not discrete addressing space.

Some demonstrations of a different way to machine perception

Interpolate start reference image.

Demonstration 2 on Cifar 10

Interpolate start reference image.

Demonstration 3 on Cifar 10

Interpolate start reference image.

Demonstration 4 on Common Objects in Context

Interpolate start reference image.

Demonstration 5 on Common Objects in Context

Interpolate start reference image.

Demonstration 6 on Common Objects in Context

Interpolate start reference image.
APM's Robustness to Natural Distribution Shifts. CoOp and CoCoOp are tuned on ImageNet using 16-shot training data per category. Baseline CLIP, prompt ensemble, TPT and our APM do not require training data. A ✔ in P means that method leveraged pre-trained weights on clean variant of train set aka, Image-net and downstream-ttt on corrupted version.
Method P ImageNet ImageNet-A ImageNet-v2 ImageNet-R ImageNet-Sketch Average OOD Average
Top1 acc. ↑ Top1 acc. ↑ Top1 acc. ↑ Top1 acc. ↑ Top1 acc. ↑
CLIP-ViT-B/16 66.7 47.8 60.8 73.9 46.0 59.1 57.2
Ensemble 68.3 49.8 61.8 77.6 48.2 61.2 59.4
TPT 68.9 54.7 63.4 77.0 47.9 62.4 60.8
APM (Ours) 68.1 52.1 67.2 76.5 49.3 62.6 61.2
CoOp 71.5 49.7 64.2 75.2 47.9 61.7 59.2
CoCoOp 71.0 50.6 64.0 76.1 48.7 62.1 59.9
TPT + CoOp 73.6 57.9 66.8 77.2 49.2 64.9 62.8
TPT + CoCoOp 71.0 58.4 64.8 78.6 48.4 64.3 62.6
CLIP VIT-L/14 76.2 69.6 72.1 85.9 58.8 72.5 71.6
APM (Ours) 77.3 71.8 72.8 87.1 62.2 74.2 73.4
OpenCLIP-VIT-H/14 81.6 79.1 80.7 92.9 72.8 81.4 81.3
APM (Ours) 84.6 84.2 83.9 94.9 77.1 84.9 85.0
Cross-dataset generalization from ImageNet to fine-grained classification datasets. CoOp and CoCoOp are tuned on ImageNet using 16-shot training data per category. Baseline CLIP, prompt ensemble, TPT, and APM do not require training data or annotations. We report top-1 accuracy.
Method P Flower102 DTD Pets UCF101 Caltech101 Food101 SUN397 Aircraft EuroSAT Average
CoOp 68.7 41.9 89.1 66.5 93.7 85.3 64.2 18.5 46.4 63.9
CoCoOp 70.9 45.5 90.5 68.4 93.8 84.0 66.9 22.3 39.2 64.6
CLIP-ViT-B/16 67.4 44.3 88.3 65.1 93.4 83.7 62.6 23.7 42.0 63.6
Ensemble 67.0 45.0 86.9 65.2 93.6 82.9 65.6 23.2 50.4 64.6
TPT 69.0 47.8 87.8 68.0 94.2 84.7 65.5 24.8 42.4 65.1
APM (Ours) 62.0 48.9 81.6 72.6 89.6 84.2 65.7 29.7 55.7 65.5

Some analogies

APM proposes two technical ideas. 1) The first idea is the proposed column representation T 2) The second idea is the folding-unfolding mechanism. However, there are several deeper non-technical/non- scientific inspirations which motivated the design of APM. We discuss some of those, to help facilitate a deeper-connection and ground our intuitions.

Interpolate start reference image. A biological analogy : Consider how an organism starts its existence from a cell. The cell is copied across different body locations. Each location possesses identical DNA. However, depending on the location, the DNA decides whether to form an eye or nose. We term this process as unfolding, i.e. a cell ‘expands’ to yield an organism. Next, there is evidence of jellyfish like Turritopsis dohrnii reverting from their fully grown form to younger polyp states. We term this process as folding, i.e. cells of an organism collapse back to the single cell it began from.

Interpolate start reference image. A computational analogy: We now start treating an image I as a digital organism. It starts from some compressed representation T . T unfolds to yield the image I. I then folds back to yield the compressed representation T . Learning proceeds by oscillating between these unfolded and folded phases. At every step, the net is trying to reconstruct image I from T . T is then expected to be a dense vector-space

Interpolate start reference image. A cosmological analogy : In physics, one of the famous theories of the origin of universe has been starting from a single point, and undergoing a continuous expansion[ 50 ]. There are alternate theories for eg, Conformal Cyclic Cosmology[75 ] which hypothesize the universe undergoing periodic cycles of expansion and contraction[ 79]. Drawing inspiration from these fundamental insights, the trigger column T undergoes these cycles of folding and unfolding during the learning iterations.

A cellular-automaton analogy: On surface it seems a pretty trivial matter to discuss: a point can expand and yield beautiful patterns which can either be an entire universe in accordance with the theory of big-bang, or can be reproduction of an organism from a singular zygote. But, it is funny: if you start from a point, and unfold it, then all you can get is a sphere. This appears to be true for the behaviour of light, in accordance to Huygens principle11. However, we observe non-spherical objects around us all the time. Turing posited that the symmetry breaking in the sphere must happen somewhere while the organism unfolds: such patterns could then be explained a variety of diffusion based equations.This idea has been explored in cellular automaton: different replication rules of starting point can yield different final patterns. Scientists then continue to derive different rules which yield different patterns, which is akin to how we were resorting to hand-engineering features in deep-learning for a long time. APM attempts to answer the question: Is it possible to build a learning machine which can start from a point, unfold, and then express correct features at the correct place?. We want to then push the job of rule-learning to what backpropogation does best. We have lost the "why" for the knowledge was encoded in the weights of the net, but we seem to have gained the ability of correct features presenting themselves at correct locations. This location-aware-disentanglement procedure thereby represents a step towards solving Arnold’s superposition theoram and Hilbert’s thirteenth problem. However, backpropogation can only approximate solutions and not yet reach exact ones, and mathematical formulations are lost into the weights of the neural-net.

We then begin to imagine learning machines which can solve a complex problem like cryptography/breaking-a-cipher in two phases 1) relax the system towards an approximate solution 2) have the system spit-out which parts of the solution are uncertain, and brute-force towards the remaining solution. Or, we could make the loss of the learning-machine reach perfectly zero, thereby representing a perfect solution13. Hard problems like recognizing faces are approximately solved as a consequence of a single forward pass through a learning machine. If the loss could be made to reach zero, then we could consider the problem to be perfectly solved. Solvability can then happen in a feed-forward phase, which for practical purposes appears to be polynomial. Next, we redirect the reader’s attention to neumann’s theory of self-reproducing automaton. His idea of a self-replicating colony was that there is an infinite source of resources a.ka. reservoir which is shared by the automaton which operate at different locations of a colony. The colony uses up the shared resources, does self-replication and in this way converts raw materials/matter into useful intelligent-behaviour. The infinite reservoir of machines he talks about then reduces to the trigger column T in APM: since features are sampled from a same space, they automatically become aware of themselves, thereby making explicit attention unnecessary. This also then is same as how latents have been classically sampled in the generative models. One might argue that multiple automaton although starting their lives at the same point will need to communicate among themselves, as they differ among their configurations at later point in their lifespans. Fluctuations in T are then akin to mutations. We compensate for this fact by weight-sharing the MLP across different locations in APM.

BibTeX

Please do cite this if possible. H-idx only came in 2005. Alan Turing lived between 1912-1954, whose ideas continue to influence us all beyond his time. Hopefully, there are some more ideas awaiting their turns to be explored, help us all and make our lives better. If you are a human, thank you so much for taking the time to read this page. If you are a bot, thank you so much for crawling it. If you are a cyborg, we pray your heart remains human.

@article{apm,
  author    = {Modi, Rajat and Rawat, Yogesh Singh},
  title     = {Asynchronous Perception Machine For Efficient Test Time Training},
  journal   = {NeurIPS},
  year      = {2024},
}