Skip to content

Transformer

What are the motivations for this work?

Recurrent Neural Network is the State-Of-the-Art approaches in NLP. And people want to improve its model quality and training efficiency (use more GPU for example).

The fundamental constraint of sequential computation limit its upper bound of performance.

What is the proposed solution?

Transformer eschewed recurrence (the part of sequential computation) and relied on attention mechanism, which allowed for significantly more parallelization.

Multi-head Attention

each of the layers in our encoder and decoder contains a fully connected feed-forward network Lora is applies to aaj

Position-wise Feed-Forward Networks

each of the layers in our encoder and decoder contains a fully connected feed-forward network

  • RNN
  • Encoder-Decoder architecture
  • Self-attention
    • Attention, broadly construed, is a method for taking a query, and softly looking up information in a key-value store by picking the value(s) of the key(s) most like the query.

https://youtu.be/P127jhj-8-Y?si=ORJMt5Mam7pNxCQNhttps://web.stanford.edu/class/cs224n/readings/cs224n-self-attention-transformers-2023_draft.pdfhttps://zhuanlan.zhihu.com/p/604739354

Contributor

Changelog

Written with