Transformer was Once called CargoNet
It took seven years for all the authors of ‘Attention is All You Need’, minus Niki Parmer, to come together in the same room. This moment finally arrived during the NVIDIA GTC 2024 session ‘Transforming AI’, hosted by GPU king Jensen Huang.
Noam Shazeer, founder of Character AI, revealed that Transformer architecture was once called ‘CargoNet,’ but nobody really paid any attention to it whatsoever.
“There were a lot of names; there was something called CargoNet [short for Convolution, Attention, Recognition, and Google],” said Shazeer excitedly. However, the name failed to impress, with everyone unanimously downvoting it as “horrible”. “Wise people,” quipped Huang, pulling Shazeer’s leg.
Jakob Uszkroeit eventually came up with the name Transformer. “The reason it became such a generic name is that, on paper, we weren’t focused solely on translation. We were definitely aware that we were trying to create something very general, something that could truly transform anything into anything else,” said Llion Jones, founder of Sakana AI.
Speaking on Transformer’s multimodality aspect, Aidan Gomez, founder of Cohere, said, “When we were building the Tensor library, we were really focused on scaling up autoregressive training. It wasn’t just for language; there were components in there for images, audio, and text, both in input and output.”
What are the Creators of Transformer Up To Now?
Illia Polusukhin was the first one to leave Google in 2017. He ended up building NEAR Protocol, a blockchain platform designed to be faster, cheaper, and more user-friendly than existing options like Ethereum.
Ashish Vaswani left Google in 2021. “One of the big reasons why I left was that the only way to make these models smarter was not just by working in a vacuum of a lab, you actually had to go out and put them into people’s hands,” he said.
In late 2022, he, along with Niki Parmer, founded a company called Essential AI. “We’re really excited about building models that can ultimately learn to solve new tasks at the same level of efficiency as humans as they watch what we do,” said Vaswani, adding that their ultimate goal is to change the way we interact with computers and how we work.
Meanwhile, Shazeer founded Character AI in 2021. “The biggest frustration at that time was, here’s this incredible technology, and it’s not getting out to everyone, it has so many uses,” expressed Noam excitedly and energetically, so much so that Huang commented, saying, “This is what Zen looks like”.
Gomez founded Cohere in 2019. He said that the idea behind Cohere was the same as Noam’s, where he felt that this technology would change the world as computers started speaking back to humans.
“I think the way that I’ve gone about it is a bit different from Noam’s in the sense that Cohere builds for enterprises. We create a platform for every enterprise to adopt and integrate it (genAI) into their product, as opposed to directly going to consumers,” said Gomez.
Jones in 2023, co-founded a Japanese AI startup called Sakana AI, which is a nature-inspired artificial intelligence company. Sakana in Japanese means fish. The company is currently working on a technique called Evolutionary Model Merge, where it combines different models from the vast ocean of open-source models with diverse capabilities.
“We’re making the algorithms by hand. To do it we took all the models available on Hugging Face and then used very large amounts of computation to use evolutionary computation to search through the space of how to merge and stack the layers,” said Jones.
“I want to remind you that the massive amount of computation that NVIDIA has given us, there are other things we can do apart from gradient descent,” he added.
Lukasz Kaiser joined OpenAI in 2021. “That’s where the best Transformers were built. It’s a lot of fun at the company. We know you can take a ton of data and a ton of compute and make something nice,” said Kaiser.
Uszkroeit founded Inceptive AI in 2021 to use AI to design novel biological molecules for vaccines, therapeutics, and other treatments, essentially creating a new kind of ‘biological software’. “My first child was born during the pandemic, which certainly, but also otherwise, gave me a newfound appreciation for the fragility of life,” said Uskzoreit.
What After Transformer?
Huang asked the panel about the most significant improvements to the base Transformer design. Gomez replied that extensive work has been done on the inference side to speed up these models. However, Gomez said that he is quite unhappy with the fact that all the developments happening today are built on top of Transformers.
“I still think it kind of disturbs me how similar to the original form we are. I think the world needs something better than the Transformer,” he said, adding that he hopes it will be succeeded by a ‘new plateau of performance’. “I think it is too similar to the thing that was there six or seven years ago.”
Jones said that companies like OpenAI are currently using a lot of computation. “I think they’re doing a lot of wasted computation,” he said when Jensen asked about their interest in a larger context window and faster token generation capability. Huang chipped in quickly saying that, “We are trying to make it efficient”.
Uszkroeit thinks that the solution to the computation problem is the right allocation. “It’s really about spending the right amount of effort and ultimately energy.” Moreover, regarding SSMs (State Space Models) he is of the opinion that it is ‘too complicated’ and ‘not elegant enough’.
Meanwhile, Ashish Vaswani, chief Essential AI, believes that to make better models the right interface is essential. “If we ultimately want to build models that can mimic and learn how to solve tasks by watching us, the interface is going to be absolutely huge,” he said.
Jones believes that many young researchers have forgotten the pre-Transformer age. He said that all the problems they were facing back then while trying to get things working are likely still present in these models. “People seem to have forgotten the pre-Transformer age, so they have to rediscover all those problems,” he added.
Polusukhin said that the Transformer has recurrent steps. “The fun fact is that I find that nobody is actually playing with the fact that you can run a Transformer from a variable number of steps, and train that differently.”
Meanwhile, Lukasz Kaiser believes that we have never truly learned how to train recurrent levels with gradient descent. “I have this personal belief that we have never truly learned how to train recurrent levels with gradient descent. Maybe it’s just impossible,” he said.
The post Transformer was Once called CargoNet appeared first on Analytics India Magazine.



