[논문 리뷰] Large Language Models are Zero-Shot Reasoners, 2022

자연어처리(NLP)/논문 리뷰

[논문 리뷰] Large Language Models are Zero-Shot Reasoners, 2022

jun0823 2023. 1. 31. 22:37

https://arxiv.org/pdf/2205.11916.pdf

0. Abstract

Pre-trained large language models은 NLP의 많은 sub-field에서 널리 사용되고 있으며, 일반적으로 task-specific examplar을 가진 좋은 few-shot learner로 알려져 있다.

특히, 최근 step-by-step 방법으로 복잡한 multi-step 문제를 해결하는 Chain of Thought prompting이 standard scaling law를 따르지 않는 difficult system-2 task에 대해 SOTA의 성능을 내었다.

이러한 성공은 자주 LLM의 few-shot learning이 좋아서 일 수 있지만, 저자는 LLM이 각 답변 앞에 단순히 "Let's think step by step" 라는 문장을 추가함으로써 적절한 zero-shot reasoner임을 알 수 있었다.

이 방법은 다양한 모델에서 매우 좋은 성능을 보임을 확인할 수 있었다.

다양한 reasoning task에 걸친 single prompt의 다용성은 아직 연구되지 않았거나 적게 연구된 LLM의 zero-shot 기능에 대한 hint를 제공하며, high level multi-task braod cognitive capabilites들은 간단한 prompting에 의해 추출될 수 있음을 말한다.

1. Introduction

최근 언어모델의 크기를 확장하는 것은 NLP revolution의 핵심 요소이다.

LLM의 성공은 few-shot, zero-shot learning에 기인한다.

model들을 few examples (few-shot) 이나 task를 묘사하는 instruction (zero-shot)으로 간단히 conditioning함으로써 다양한 task들을 풀어낼 수 있다.

이렇게 language model을 conditioning하는 방법을 "prompting"이라고 하고, prompt를 수동 또는 자동으로 설계하는 방법이 NLP에서 hot topic이다.

task-specific few-shot이나 zero-shot prompting을 이용하였을 때 intuitive하고 single-step인 system-1 task에서는 LLM이 좋은 성능을 내는 것과는 반대로, 느리고 multi-step reasoning이 필요한 system-2 task에 대해서는 100B이상의 parameter를 가진 언어모델 조차도 어려워하였다.

이러한 단점을 다루기 위해 standard QA example 대신 step-by-step reasoning example인 Chain-of-Thought를 제안하였다.

이러한 CoT는 복잡한 reasoning을 여러개의 쉬운 step으로 분해하는 reasoning path를 만들어내도록 한다.

특히 CoT를 사용하면 reasoning performance이 scaling law를 더욱 만족하고, 언어모델의 크기에 따라 성능이 증가한다.

이러한 성공은 자주 LLM의 few-shot learning이 좋아서 일 수 있지만,

저자는 LLM이 각 답변 앞에 단순히 "Let's think step by step" 라는 문장을 추가함으로써 적절한 zero-shot reasoner임을 알 수 있었다.

이러한 simplicity에도 불구하고, 이 논문에서의 zero-shot CoT는 그럴듯한 reasoning path를 성공적으로 만들어내고, standard zero-shot approach가 실패했던 문제들에 대해 알맞은 answer를 만들어 낸다.

중요한 점은 이 논문에서 제안한 Zero-shot-CoT는 이전 형식에서 사용했던 forms of exampels(few-shot)이나 templates(zero-shot)의 형태와는 달리 versatile하고 task-agnostic하다.

이는 arithmetic, symbolic reasoning, commonsense reasoning 과 strategy QA와 같은 다양한 reasoning task에 대해 prompt의 수정없이 step-by-step하게 답을 만들어 낼 수 있다.

더욱이 단일 고정 prompt를 이용한 zero-shot LLM은 few-shot CoT baseline과 비교하였을 때 더 나은 scaling curve를 가지고 있음을 확인하였다.

3. Zero-shot Chain of Thought

이 논문에서는 chain of thought reasoning을 위한 zero-shot template-based prompting인 Zero-shot-CoT를 제안하였다.

이는 기존의 original chain of thought prompting과는 다르다고 볼 수 있는데, 단일 template만을 사용하여 본질적으로 task-agnostic하고 multi-hop reasoning을 도출하기 때문에 이전 template prompt와는 다르다.

이 방법의 핵심 아이디어는 Fig 1.에서 묘사한 것과 같다.

3.1 Two-stage prompting

Zero-shot-CoT은 개념적으로 간단하지만, Fig 2.에서 설명한 것처럼 두번의 prompt를 사용하여 reasoning과 answer 모두 도출해낸다.

반대로 Fig 1.의 왼쪽 아래에 있는 zero-shot baseline은 이미 "The answer is"라는 형식의 prompting을 사용하여 올바른 형식의 answer를 추출한다.

standard or CoT few-shot prompting은 명시적으로 Fig 1.의 위에 있는 두가지 형식으로 answer를 끝내도록 설계함으로써, answer-extraction prompting이 필요하지 않게 한다.

요약하자면, Few-shot-CoT는 task당 특정한 answer format을 가져야 하기 때문에 사람이 따로 engineering을 해주어야 하는 반면,

Zero-shot-CoT은 engineering을 적게 필요로 하지만 LLM prompting을 두번 요청해야 한다.

1st prompt: reasoning extraction

이 단계에서는 간단한 "Q: [X]. A: [T]" template을 사용하여 input question x를 prompt x'로 변환한다.

[X] : input slot for x

[T] : question x에 answer하기 위한 chain of thought를 추출하는 hand-crafted trigger sentence t을 위한 slot

예를 들어, trigger sentence로 "Let's think step by step" 문장을 사용한다면, prompt x'는 "Q: [X]. A: Let's think step by step."이 될 것이다.

prompted text x'은 언어모델에 들어가고 subsequent setence z를 만들어낸다.

2nd prompt: answer extraction

두번째 단계에서는 prompted sentence x'과 언어모델에 의해 만들어진 sentence z를 같이 사용하여 최종 answer를 도출한다.

더욱 구체적으로 저자는 "[X'] [Z] [A]" 형태로 3개의 elements를 concatenate한다.

[X'] : 1st prompt x'

[Z] : 첫번째 step에서 만들어진 sentence z

[A] : answer를 추출하기 위한 trigger sentence

이 단계에서는 self-augmented prompt가 사용되는데, 동일한 언어모델에서 만들어진 sentence z를 prompt가 포함하고 있기 때문이다.

예를 들어, multi-choice QA에서는 "Therefore, among A through E, the answer is"를 사용하고, numerical answer를 요구하는 math problem에서는 "Therefore, the answer (arabic numerals) is"라는 sentence를 사용한다.

최종적으로 prompted text를 언어모델의 input으로 집어넣어 최종 answer y_hat을 만들어 낸다.

Result

Zero-shot-CoT vs. Zero-shot

Conclusion

이 논문에서는 다양한 reasoning task을 다루는 large language model로부터 chain of thought를 도출하는 zero-shot prompt인 Zero-shot-CoT을 제안하였다.

논문에서의 simple method는 LLM의 scaling law를 오랫동안 회피했던 system-2 reasoning task에 대해 minimalist하고 strongest한 zero-shot baseline을 제시했을 뿐만 아니라 광범위한 인지능력을 도출하는 multi-task prompt를 발견하도록 한다.