Skip to main content

New method improves efficiency of ‘vision transformer’ AI systems

Close shot of a human eye reflecting light.

Vision transformers (ViTs) are powerful artificial intelligence (AI) technologies that can identify or categorize objects in images – however, there are significant challenges related to both computing power requirements and decision-making transparency. Researchers have now developed a new methodology that addresses both challenges, while also improving the ViT’s ability to identify, classify and segment objects in images.

Transformers are among the most powerful existing AI models. For example, ChatGPT is an AI that uses transformer architecture, but the inputs used to train it are language. ViTs are transformer-based AI that are trained using visual inputs. For example, ViTs could be used to detect and categorize objects in an image, such as identifying all of the cars or all of the pedestrians in an image.

However, ViTs face two challenges.

First, transformer models are very complex. Relative to the amount of data being plugged into the AI, transformer models require a significant amount of computational power and use a large amount of memory. This is particularly problematic for ViTs, because images contain so much data.

Second, it is difficult for users to understand exactly how ViTs make decisions. For example, you might have trained a ViT to identify dogs in an image. But it’s not entirely clear how the ViT is determining what is a dog and what is not. Depending on the application, understanding the ViT’s decision-making process, also known as its model interpretability, can be very important.

The new ViT methodology, called “Patch-to-Cluster attention” (PaCa), addresses both challenges.

“We address the challenge related to computational and memory demands by using clustering techniques, which allow the transformer architecture to better identify and focus on objects in an image,” says Tianfu Wu, corresponding author of a paper on the work and an associate professor of electrical and computer engineering at North Carolina State University. “Clustering is when the AI lumps sections of the image together, based on similarities it finds in the image data. This significantly reduces computational demands on the system. Before clustering, computational demands for a ViT are quadratic. For example, if the system breaks an image down into 100 smaller units, it would need to compare all 100 units to each other – which would be 10,000 complex functions.

“By clustering, we’re able to make this a linear process, where each smaller unit only needs to be compared to a predetermined number of clusters. Let’s say you tell the system to establish 10 clusters; that would only be 1,000 complex functions,” Wu says.

“Clustering also allows us to address model interpretability, because we can look at how it created the clusters in the first place. What features did it decide were important when lumping these sections of data together? And because the AI is only creating a small number of clusters, we can look at those pretty easily.”

The researchers did comprehensive testing of PaCa, comparing it to two state-of-the-art ViTs called SWin and PVT.

“We found that PaCa outperformed SWin and PVT in every way,” Wu says. “PaCa was better at classifying objects in images, better at identifying objects in images, and better at segmentation – essentially outlining the boundaries of objects in images. It was also more efficient, meaning that it was able to perform those tasks more quickly than the other ViTs.

“The next step for us is to scale up PaCa by training on larger, foundational data sets.”

The paper, “PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers,” will be presented at the IEEE/CVF Conference on Computer Vision and Pattern Recognition, being held June 18-22 in Vancouver, Canada. First author of the paper is Ryan Grainger, a Ph.D. student at NC State. The paper was co-authored by Thomas Paniagua, a Ph.D. student at NC State; Xi Song, an independent researcher; and Naresh Cuntoor and Mun Wai Lee of BlueHalo.

The work was done with support from the Office of the Director of National Intelligence, under contract number 2021-21040700003; the U.S. Army Research Office, under grants W911NF1810295 and W911NF2210010; and the National Science Foundation, under grants 1909644, 1822477, 2024688 and 2013451.


Note to Editors: The study abstract follows.

“PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers”

Authors: Ryan Grainger, Thomas Paniagua and Tianfu Wu, North Carolina State University; Xi Song, independent researcher; Naresh Cuntoor and Mun Wai Lee, BlueHalo

Presented: IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, June 18-22, Vancouver, Canada

Abstract: Vision Transformers (ViTs) are built on the assumption of treating image patches as “visual tokens” and learn patch-to-patch attention. The patch embedding based tokenizer has a semantic gap with respect to its counterpart, the textual tokenizer. The patch-to-patch attention suffers from the quadratic complexity issue, and also makes it nontrivial to explain learned ViTs. To address these issues in ViT, this paper proposes to learn Patch-to-Cluster attention (PaCa) in ViT. Queries in our PaCa-ViT starts with patches, while keys and values are directly based on clustering (with a predefined small number of clusters). The clusters are learned end-to-end, leading to better tokenizers and inducing joint clustering-for-attention and attention-for-clustering for better and interpretable models. The quadratic complexity is relaxed to linear complexity. The proposed PaCa module is used in designing efficient and interpretable ViT backbones and semantic segmentation head networks. In experiments, the proposed methods are tested on ImageNet-1k image classification, MS-COCO object detection and instance segmentation and MIT-ADE20k semantic segmentation. Compared with the prior art, it obtains better performance in all the three benchmarks than the SWin and the PVTs by significant margins in ImageNet-1k and MIT-ADE20k. It is also significantly more efficient than PVT models in MS-COCO and MIT-ADE20k due to the linear complexity. The learned clusters are semantically meaningful. Code and model checkpoints are available at

This post was originally published in NC State News.