Attention Is All You Need

1 Introduction

<aside> ๐Ÿ’ก ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ์‚ฌ์ด์˜ global dependencies์„ ์ด๋Œ์–ด๋‚ด๊ธฐ ์œ„ํ•ดย ๋ณ‘๋ ฌํ™”๊ฐ€ ๊ฐ€๋Šฅํ•˜๊ณ  ๋†’์€ ์„ฑ๋Šฅ์„ ์ž๋ž‘ํ•˜๋Š” Transformer์„ ์ œ์•ˆ!

</aside>

๐Ÿ“Œ Transformer ์ •์˜

RNN์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ๊ธฐ์กด์˜ seq2seq์˜ ๊ตฌ์กฐ์ธ ์ธ์ฝ”๋”-๋””์ฝ”๋”๋ฅผ ๋”ฐ๋ฅด๋ฉด์„œ ๋…ผ๋ฌธ์˜ ์ด๋ฆ„์ฒ˜๋Ÿผ ์–ดํ…์…˜(Attention)๋งŒ์œผ๋กœ ๊ตฌํ˜„ํ•œ ๋ชจ๋ธ

2 Background

ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์‹œํ€€์Šค ์ •๋ ฌ๋œ RNN ๋˜๋Š” ์ปจ๋ณผ๋ฃจ์…˜์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ์˜ ํ‘œํ˜„์„ ๊ณ„์‚ฐํ•˜๋Š” self-attention์— ์™„์ „ํžˆ ์˜์กดํ•˜๋Š” ์ฒซ๋ฒˆ์งธ ๋ณ€ํ™˜ ๋ชจ๋ธ

3 Model Architecture

encoder๋Š” input ์‹œํ€€์Šค์˜ ์—ฐ์†์ ์ธ representation์ธ (x1, ...,ย xn) ์„ ๋‹ค๋ฅธ ์—ฐ์†์ ์ธ representation๋“ค์˜ ์‹œํ€€์Šคย z=(z1,...,zn)ย ์œผ๋กœ ๋งคํ•‘

z๋ฅผ ๊ฐ€์ง€๊ณ  decoder๋Š” output ์‹œํ€€์Šคย (y1,...,ym)๋ฅผ ์ƒ์„ฑ

๊ฐ ํƒ€์ž„์Šคํ…์—์„œ ๋‹ค์Œ ์‹ฌ๋ณผ์„ ์ƒ์„ฑํ•  ๋•Œ, ๋ชจ๋ธ์€ ์ด์ „์— ์ƒ์„ฑ๋œ ์‹ฌ๋ณผ๋“ค์„ additional input์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์—ย Auto-Regressive


ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์œ„์™€ ๊ฐ™์€ ์ „๋ฐ˜์ ์ธ encoder-decoder ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉ

self-attention๊ณผ point-wise fully connected layer๋“ค์„ encoder์™€ decoder ๊ฐ๊ฐ์— ์Œ“์•„์˜ฌ๋ ค ์ด์šฉ

transformer_attention_overview.png

3.1 Encoder and Decoder Stacks

1.PNG