2024-04-18 Un séminaire de Stanford sur les Transformers

Date de récolte : 2024-04-18-jeudi

CS25: Transformers United V4

Mon avis :

Un séminaire avec des praticiens de premier ordre sur la pratique actuelle d'entraînement et conception des LLM. Les slides et les enregistrements youtube sont publiés.
La séance d'aujourd'hui, 18 avril 2024, est consacrée à l'alignement des modèles, avec Nathan Lambert.

Texte complet :

CS25: Transformers United V4

Spring 2024

Apr. 4 - May 30

attachments/1ad7bc1bc6f8b19ddca8cb63ae9647e2_MD5.png

Description

Interested in Transformers, the deep learning model that has taken the world by storm? Want to have intimate discussions with researchers? If so, this course is for you! It's not every day that you get to personally hear from and chat with the authors of the papers you read!

Each week, we invite folks at the forefront of Transformers research to discuss the latest breakthroughs, from LLM architectures like GPT and Gemini to creative use cases in generating art (e.g. DALL-E and Sora), biology and neuroscience applications, robotics, playing complex games, and so forth!

CS25 has become one of Stanford's hottest and most exciting seminar courses. We invite the coolest speakers such as Andrej Karpathy, Geoffrey Hinton, Jim Fan, Ashish Vaswani, and folks from OpenAI, Google, NVIDIA, etc. Our class has an incredibly popular reception within and outside Stanford, and around 1 million total views on YouTube. Our class with Andrej Karpathy was the second most popular YouTube video uploaded by Stanford in 2023 with over 500k views!

We have significant improvements for Spring 2024, including a large lecture hall, professional recording and livestreaming (to the public), social events, and potential 1-on-1 networking! The only homework for students is weekly attendance to the talks/lectures. Also, livestreaming and auditing are available to the public. Feel free to audit in-person or by joining the Zoom livestream. Anybody can attend, you don't have to be affiliated with Stanford!

We also have a Discord server (over 1500 members) used for Transformers discussion. We open it to the public as more of a "Transformers community". Feel free to join and chat with hundreds of others about Transformers!

Logistics

Recordings & Slides

Disclaimers for Students & Attendees

Previous Iterations

Instructors

attachments/7e41927ed554c6e44f99768f5846294c_MD5.png

Div Garg

attachments/dfc9c5603a97fad32a0965c44398515c_MD5.jpg

Steven Feng

attachments/ea6a36dc6fec356c8e9836a794a5ae6b_MD5.jpg

Seonghee Lee

attachments/abca8dabec10e4702e15452ef5e529bd_MD5.jpg

Emily Bunnapradist

Faculty Advisor

attachments/71313a38036892d1d1a55f4be15b3cf8_MD5.jpg

Chris Manning

Schedule

The current class schedule is below (subject to change):

Date Title Description
April 4 Instructor Lecture: Overview of Transformers [In-Person]

Speakers: Steven Feng, Div Garg, Emily Bunnapradist, Seonghee Lee
Brief intro and overview of the history of NLP, Transformers and how they work, and their impact. Discussion about recent trends, breakthroughs, applications, and remaining challenges/weaknesses. Also discussion about AI agents. Slides posted here.
April 11 Intuitions on Language Models (Jason) [In-Person]

Shaping the Future of AI from the History of Transformer (Hyung Won) [In-Person]

Speakers: Jason Wei & Hyung Won Chung, OpenAI

Jason Wei is an AI researcher based in San Francisco. He is currently working at OpenAI. He was previously a research scientist at Google Brain, where he popularized key ideas in large language models such as chain-of-thought prompting, instruction tuning, and emergent phenomena.

Hyung Won Chung is a research scientist at OpenAI ChatGPT team. He has worked on various aspects of Large Language Models: pre-training, instruction fine-tuning, reinforcement learning with human feedback, reasoning, multilinguality, parallelism strategies, etc. Some of the notable work includes scaling Flan paper (Flan-T5, Flan-PaLM) and T5X, the training framework used to train the PaLM language model. Before OpenAI, he was at Google Brain and before that he received a PhD from MIT.
Jason will talk about some basic intuitions on language models, inspired by manual examination of data. First, he will discuss how one can view next word prediction as massive multi-task learning. Then, he will discuss how this framing reconciles scaling laws with emergent individual tasks. Finally, he will talk about the more general implications of these learnings. Slides posted here.

Hyung Won: AI is developing at such an overwhelming pace that it is hard to keep up. Instead of spending all our energy catching up with the latest development, I argue that we should study the change itself. First step is to identify and understand the driving force behind the change. For AI, it is the exponentially cheaper compute and associated scaling. I will provide a highly-opinionated view on the early history of Transformer architectures, focusing on what motivated each development and how each became less relevant with more compute. This analysis will help us connect the past and present in a unified perspective, which in turn makes it more manageable to project where the field is heading. Slides posted here.
April 18 Aligning Open Language Models [Virtual/Zoom]
Speaker: Nathan Lambert, Allen Institute for AI (AI2)

Nathan Lambert is a Research Scientist at the Allen Institute for AI focusing on RLHF and the author of Interconnects.ai. Previously, he helped build an RLHF research team at HuggingFace. He received his PhD from the University of California, Berkeley working at the intersection of machine learning and robotics. He was advised by Professor Kristofer Pister in the Berkeley Autonomous Microsystems Lab and Roberto Calandra at Meta AI Research.
Since the emergence of ChatGPT there has been an explosion of methods and models attempting to make open language models easier to use. This talk retells the major chapters in the evolution of open chat, instruct, and aligned models, covering the most important techniques, datasets, and models. Alpaca, QLoRA, DPO, PPO, and everything in between will be covered. The talk will conclude with predictions and expectations for the future of aligning open language models. Slides posted here. All the models in the figures are in this HuggingFace collection.
April 25 Demystifying Mixtral of Experts [Virtual/Zoom]
Speaker: Albert Jiang, Mistral AI / University of Cambridge

Albert Jiang is an AI scientist at Mistral AI, and a final-year PhD student at the computer science department of Cambridge University. He works on language model pretraining and reasoning at Mistral AI, and language models for mathematics at Cambridge.
In this talk I will introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combines their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. I will go into the architectural details and analyse the expert routing decisions made by the model.
May 2 Developing precision language models from self-attentive feed-forward units, and applying them in edge computing scenarios as untrained language models prompted to predict symbolic switches (U-LaMPS)
Speaker: Jake Williams, Drexel University

Jake Ryland Williams is an Associate Professor of Information Science at Drexel University's College of Computing and Informatics in Philadelphia, Pennsylvania. Dr. Williams' undergraduate background is in Physics, and he holds an Applied Mathematics MS alongside a PhD in Mathematical Sciences (all from the University of Vermont). Dr. Williams' PhD was largely in pure mathematics, with doctoral research in quantitative linguistics that applied mathematics to the study of statistical linguistic phenomena, treating the subject as a domain of statistical physics. To conduct this research, the necessities of data processing led Dr. Williams to become a data scientist, which he followed post-graduation into a Postdoctoral appointment in the School of Information at the University of California, Berkeley (Cal). At Cal, Dr. Willams began his career in graduate data science (DS) education on techniques for large-scale machine learning, while he studied opportunities for the application of statistical theory to natural language processing (NLP). Upon becoming a DS faculty at Drexel, Dr. Williams drove the foundation of a DS MS program, where he developed and instructed DS coursework, ultimately in the methodological subject of NLP with deep learning. Teaching NLP with deep learning ultimately brought Dr. Williams to realize an alternative pedagogical model for teaching neural network methodology that integrates theory from traditional statistical learning, which is borne out in his research and this talk.
Dr. Williams' research develops and applies theory on what neural networks learn (statistically) as a means to improve the design and function of neural architectures and learning processes. This has recently inspired Dr. Williams to invent a range of precision technologies developed for effectively and efficiently training both large and small neural language models, which are capable of greatly reducing the costs of training and infrastructure behind, e.g., OpenAI's ChatGPT. Dr. Williams will discuss these architectures, which modify standard self-attention layers and model long-range dependencies without significant reliance on layer depth. After being introduced, peripheral components of these near-shallow networks—as well as their modified forward operations and learning processes—will be discussed in detail. Following this discussion of architecture and model details, current applications of this research will be presented, which are focused on embedding untrained precision language models (PLMs) on microprocessors in edge computing scenarios, i.e., acting as hardware-based controllers for small electronics devices. Discussion will focus on how these PLM systems have been designed to operate in air-gapped environments over CPU-driven training on microprocessors from scratch, and will go on to detail a fully developed control system of this kind and its user interface. This final subject will present recent positive experimental results at training localized PLMs on Le Potato (https://libre.computer/products/aml-s905x-cc/), whose success was identified upon a U-LaMPS very first training run, in only 20 minutes of lay user interaction through a microphone and light switch.
May 9 TBD [Virtual/Zoom]
Speaker: Ming Ding, Zhipu AI
May 16 TBD [In-Person]
Speaker: Edward Hu, Prev. OpenAI
A new training objective for LLMs.

Recommended Reading:

1. Amortizing Intractable Inference in Large Language Models
May 23 TBD
Speaker: Loubna Ben Allal, Hugging Face
Code LLMs (e.g. StarCoder).
May 30 TBD
Speaker: TBD