Fast Inference from Transformers via Speculative Decoding

글 작성일 : 2024.02

함께 보면 좋은 논문 → Accelerating Large Language Model Decoding with Speculative Sampling

논문의 내용 설명 및 2024.02 기준 간단한 추가 정보, 짧은 필자의 의견으로 구성했다.

Introduction

모델의 크기가 큰 Transformer 모델의 경우, single decode step에 오랜 시간이 걸린다. 거기에 더해 K개의 토큰을 생성하기 위해서는 K번의 single dcode step이 필요하다.

이러한 문제점을 해결하기 위해 다양한 접근들이 있었는데, 대부분의 접근 방법은 모델 아키텍처를 바꾸거나, training 방법을 바꾸거나, re-train이 필요하거나 똑같은 model output을 보장하지 않는다.

이 논문에서는 대부분의 Large model inference 상황에서 bottleneck이 arithmetic operation이 아니라 memory bandwidth에 의한 것이며, 일부 inference step 은 추론이 쉽고 일부 step은 추론이 어렵다는 아이디어를 통해 새로운 inference 방법을 제안한다.

‘Speculative sampling’, ‘Speculative decoding’이라는 방법을 제안하는데, 이를 사용하면 기존 모델 아키텍처를 변경할 필요 없고, training 방법을 바꿀 필요도, re-train할 필요도 없으며 model output 역시 single model을 사용했을 때와 완전히 똑같다.

Speculative Decoding

$M_p$를 target model(우리가 원래 사용하려던 모델)이라고 하고, $p(x|x_{<t})$ 를 $x_{<t}$ prefix를 target model에 넣었을때 분포라고 하자.

그리고 $M_q$ 를 더 효과적인 approximation model 이라고 하고 $q(x|x_{<t})$ 를 $x_{<t}$ prefix를 approximation model에 넣었을때 분포라고 하자

<aside> 👉 Approximation model

이 논문에서는 approximation model 이라고 표현하지만 deepmind의 논문에서는 ‘draft model’이라는 표현을 쓴다. 필자의 경우 ‘draft model’이라는 표현을 선호한다. 이후 부터는 ‘draft model’이라는 표현을 사용할텐데 ‘approximation model’과 같은 뜻으로 보면 된다.

Deepmind 논문 리뷰→ Accelerating Large Language Model Decoding with Speculative Sampling

</aside>