Perceiver AR: general-purpose, long-context autoregressive era

Over the previous couple of years, autoregressive Transformers have introduced a gradual stream of breakthroughs in generative modeling. These fashions generate every factor of a pattern – the pixels of a picture, the characters of a textual content (sometimes in “token” chunks), the samples of an audio waveform, and so forth – by predicting one factor after the opposite. When predicting the following factor, the mannequin can look again at people who have been created earlier.

Nonetheless, every of a Transformer’s layers grows dearer as extra components are used as enter, and practitioners can solely afford to coach deep Transformers on sequences not more than about 2,048 components in size. And so, most Transformer-based fashions ignore all components past the latest previous (round 1,500 phrases or 1/6 of a small picture) when making a prediction.

In distinction, our just lately developed Perceiver models give glorious outcomes on quite a lot of real-world duties with as much as round 100,000 components. Perceivers use cross-attention to encode inputs right into a latent house, decoupling the enter’s compute necessities from mannequin depth. Perceivers additionally spend a set value, no matter enter dimension, at practically each layer.

Whereas latent-space encoding handles all components in a single move, autoregressive era assumes processing occurs one factor at a time. To handle this drawback, Perceiver AR proposes a easy answer: align the latents one after the other with the ultimate components of the enter, and punctiliously masks the enter so latents see solely earlier components.

Perceiver AR maps an enter sequence (P e r c e i v e r A R) to a small latent house by cross-attention to provide one latent for every goal token (3 latents proven, one for the targets A R , for End Of Sequence). These latents are then processed by a deep stack of self-attention layers. Perceiver AR may be skilled for end-to-end autoregressive era, all whereas making use of very lengthy enter sequences.

The result’s an structure (proven above) that attends to as a lot as 50x longer inputs as normal Transformers, whereas deploying as extensively (and primarily as simply) as normal decoder-only Transformers.

As context size or mannequin dimension will increase, the quantity of compute wanted to coach a mannequin grows. We are able to quantify the compute price range for various fashions by measuring their pace on actual {hardware} (steps per second on TPUv3), because the enter context size and mannequin dimension improve. In contrast to different generative fashions like Transformer or Transformer-XL, Perceiver AR decouples enter context size from mannequin depth, permitting us to simply deploy the deep fashions wanted to mannequin lengthy sequences on current-generation TPUs or GPUs.

Perceiver AR scales significantly higher with dimension than each normal Transformers and Transformer-XL fashions at a spread of sequence lengths in actual phrases. This property permits us to construct very efficient long-context fashions. For instance, we discover {that a} 60-layer Perceiver AR with context size 8192 outperforms a 42-layer Transformer-XL on a book-length era activity, whereas operating quicker in actual wall-clock phrases.

On normal, long-context picture (ImageNet 64×64), language (PG-19), and music (MAESTRO) era benchmarks, Perceiver AR produces state-of-the-art outcomes. Growing enter context by decoupling enter dimension from compute price range results in a number of intriguing outcomes:

  • Compute price range may be tailored at eval time, permitting us to spend much less and easily degrade high quality or to spend extra for improved era.
  • A bigger context permits Perceiver AR to outperform Transformer-XL, even when spending the identical on compute. We discover that larger context results in improved mannequin efficiency even at inexpensive scale (~1B parameters).
  • Perceiver AR’s pattern high quality displays a lot much less sensitivity to the order wherein it generates components. This makes Perceiver AR straightforward to use to settings that don’t have a pure left-to-right ordering, reminiscent of knowledge like pictures, with construction that spans a couple of dimension.

Utilizing a dataset of piano music, we skilled Perceiver AR to generate new items of music from scratch. As a result of every new observe is predicted based mostly on the total sequence of notes that got here earlier than, Perceiver AR is ready to produce items with a excessive degree of melodic, harmonic, and rhythmic coherence:

Study extra about utilizing Perceiver AR:

  • Obtain the JAX code for coaching Perceiver AR on Github
  • Learn our paper on arXiv
  • Try our highlight presentation at ICML 2022

See the Google Magenta blog post with extra music!

Date: 2022-07-15 20:00:00

Source link



Related articles

Alina A, Toronto
Alina A, Toronto
Alina A, an UofT graduate & Google Certified Cyber Security analyst, currently based in Toronto, Canada. She is passionate for Research and to write about Cyber-security related issues, trends and concerns in an emerging digital world.


Please enter your comment!
Please enter your name here