This short tutorial explains the training objectives used to develop ChatGPT, the new chatbot language model from OpenAI.
Timestamps:
0:00 - Non-intro
0:24 - Training overview
1:33 - Generative pretraining (the raw language model)
4:18 - The alignment problem
6:26 - Supervised fine-tuning
7:19 - Limitations of supervision: distributional shift
8:50 - Reward learning based on preferences
10:39 - Reinforcement learning from human feedback
13:02 - Room for improvement
ChatGPT:
Relevant papers for learning more:
InstructGPT: Ouyang et al., 2022 -
GPT-3: Brown et al., 2020 -
PaLM: Chowdhery et al., 2022 -
Efficient reductions for imitation learning: Ross & Bagnell, 2010 -
Deep reinforcement learning from human preferences: Christiano et al., 2017 -
Learning to summarize from human feedback: Stiennon et al., 2020 -
Scaling laws for reward model overoptimization: Gao et al., 2022 -
Proximal policy optimization algorithms: Schulman et al., 2017 -
Special thanks to Elmira Amirloo for feedback on this video.
Links:
YouTube:
Twitter:
Homepage:
If you’d like to help support the channel (completely optional), you can donate a cup of coffee via the following:
Venmo:
PayPal: