CS/Stat 184 Intro to RL

CS/Stat 184: Introduction to Reinforcement Learning

Modern AI systems often need the ability to make sequential decisions in an unknown, uncertain, possibly hostile environment, by actively interacting with the environment to collect relevant data. Reinforcement Learning (RL) is a general framework that can capture the interactive learning setting and has been used to design intelligent agents that achieve high-level performance in challenging applications such as Go, computer games, robotic manipulation, health care, and education.

This course provides an introduction to reinforcement learning covering a range of problem formulations, algorithms, and theory. The four main themes of the course are (1) Markov decision processes (Bellman equations/optimality, planning, UCB, unknown environments, linear quadratic control, exploration, imitation learning), (2) bandits (epsilon-greedy, UCB, Thompson sampling, contextual bandits, linear bandits, exploration in MDPs), and (3) deep RL and methods for large-scale systems (policy gradient methods, Monte Carlo tree search, Q-learning, imitation learning).

There will also be an Embedded Ethics lecture on ethical issues arising in reinforcement learning. The assignments will focus on a mix of algorithmic and statistical principles, along with their programming implementations.

After taking this course, students will be able to understand fundamental RL algorithms and their analysis.

The course will go through algorithms and their analysis. All homework will have a programming component to give students more hands-on experience with the concepts.

Staff and Organization

Instructors: Lucas Janson Sham Kakade

TFs: Benjamin Schiffer

CAs: Luke Bailey, Alex Dazhen Cai, Kevin Yee Du, Kevin Yifan Huang, Saket Joshi, Thomas Kaminsky, Patrick McDonald, Eric Meng Shen, Natnael Mekuria Teshome, Jeffrey George Wang,

Lecture time: Monday/Wednesday 11:15am - 12:30pm

Lecture location: 114 Western Ave, 2111+2112

Sections (starting 9/11/23):

Mon 5-6pm, SC706

Tue 10:30-11:30am, SC706

Wed 12:45-1:45pm, SEC LL2.221

Thu 11am-12pm, SC706

Fri 2-3pm, SC706

**Please double check the website for office hour location changes/cancellations before you arrive.**

Instructor office hours:

Lucas Janson: Thu 10:00-11:00am, SC 710

Sham Kakade: ~~Th 3-4p, SEC 4.410~~

TAs office hours:

Luke & Thomas: Mon 7-9pm, Dunster DHall

Ben: Tue 11:30am-1:30pm, SC706

Alex & Patrick: Tue 7-9pm, SC706

Kevin D & Kevin H & Eric: Wed 7-9pm, SC706

Natnael & Saket: Fri 3-6pm, SC706

Discussion: Ed discussion board

Contact Info:

Please communicate to the instructors only by making a post that is “Private”, i.e., “Visible to you and staff only” in Ed. Any course related email sent directly to the instructors will not be responded to in a timely manner.

Announcements:

Please make sure you monitor for (and receive) announcements from both the official class mailing list and from Ed. Ed is a convenient way to send out some announcements, such as homework corrections and clarifications. It is important for you to make sure you get these announcements in a timely manner.

Prerequisites

Lectures will focus on algorithm design and analysis. We require a background in: calculus & linear algebra (e.g., AM22a, Math 21b), probability theory (e.g., Stat 110), and programming in Python. The following topics are recommended but not required: linear regression, supervised learning, algorithms.

Homeworks will have a programming component, and we expect students to be comfortable with programming in Python (or committed to quickly learning it). We will use Python as the programming language in all HWs.

Grading Policies

(TENTATIVE) Participation 5%; Assignments 45% (HW0: 5%, HW1-HW4: 40% total); Midterm 20%; Project 30%;

The course is letter-graded by default, but you may switch to SAT/UNSAT if you prefer.

In order to pass the course, you must attempt and submit all homework, even if they are submitted for zero credit (as per the late day policy below). We will also have an "embedded ethics" lecture with 1-2 corresponding questions, either incorporated into a homework or as a standalone short assignment (with the grading scheme adjusted appropriately). All homeworks are mathematical and have a programming component (we use Python and OpenAI Gym).

Participation: 5% of the grade will be participation. People can participate in the course in many different ways, including regular attendance of lectures (there will be a form after each class where students can record their attendance), participating in section, in the Ed forum, and more. At the end of the term, you will write a paragraph on how you participated in the course. The requirements to get the full 5% contribution will not be too onerous, and regularly attending the lectures will suffice. If for some reason you are not able to regularly attend all the lectures, then increased participation in Ed and section will be sufficient. If you have another responsibility that prevents you from attending all the lectures, please let us know by making a post that is “Private”, i.e., “Visible to you and staff only” in Ed, and we will take this into consideration.

Homework Policies: Collaboration is permitted though each student must understand, write, and hand in their own submission. In particular, it is acceptable for students to discuss problems with each other; it is not acceptable for students to look at another student's written answers. It is also not acceptable to publicly post your (partial) solution on Ed, but it is encouraged for you to ask public questions on Ed. You must also indicate on each homework with whom you collaborated and what online resources you used.

Each student will have 96 cumulative hours of late time (as measured on Gradescope), which will be forgiven. After this cumulative amount of time has passed, any assignment that is turned in late will receive zero credit. Furthermore, only up to 48 hours of late time may be used on any one assignment; any assignment turned in more than 48 hours late will receive zero credit.

The final homework score for HW1-4 will be determined by summing up the total points earned across all four assignments. This sum will then be divided by the total possible points to calculate the overall percentage score for the HW1-4 component of the course.

We highly encourage you to use LaTex. We will also accept neatly written handwritten homework.

Homeworks must be submitted through Gradescope. PDF files of the homeworks can be accessed on Gradescope. PDF and LaTeX files for the homeworks will also be uploaded to Canvas.

Regrading Policy: All homework regrading requests must be submitted on Gradescope within seven days after the grades are released. For example, if we return the grades on Monday, then you have until midnight the following Monday to submit any regrade requests. If you feel that we have made an error in grading your homework, please let us know with a written explanation. This policy is to ensure that we can address any concerns in a timely and fair manner. The focus of office hours and in person discussions are solely limited to asking knowledge related questions. Grade related questions must be submitted by making a post that is “Private”, i.e., “Visible to you and staff only” in Ed

Project: Please see the course project page.

Diversity and Inclusiveness

While many academic disciplines have historically been dominated by one cross section of society, the study of and participation in STEM disciplines is a joy that the instructors hope that everyone can pursue, regardless of their socio-economic background, race, gender, etc. We encourage students to both be mindful of these issues, and, in good faith, try to take steps to fix them. You are the next generation here.

You should expect to be treated by your classmates and the course staff with respect. You belong here, and we are here to help you learn and enjoy this course. If any incident occurs that challenges this commitment to a supportive and inclusive environment, please let the instructors know so that the issue can be addressed. We are personally committed to this and subscribe to Harvard's Values for Inclusion.

Honor Code

You must always understand and write up your own solutions.

Collaborations only where explicitly allowed.

Do not use forums like Course Hero, Chegg, etc.

Any outside materials you use for your HWs, properly cite these references. Do not directly search for answers on the internet. If you are unclear about whether some online material can be used, pleasek the course staff first.

No sharing of your solutions within or outside class at any time.

Do not use Generative AI tools to explicitly obtain answers. Think of generative AI tools as you would a message board or collaborator: you can use it for assistance (and if you do, you should cite it) but you may not directly ask it for the answer.

The above is not an exhaustive list, and in general, common sense rules about academic integrity apply. If it is something in doubt, please ask us whether it is OK before you do it. Also see the Harvard College Honor Code.

Course Materials

Slides will be posted before each lecture, and annotated slides (with all notes taken on them by the instructor during lecture) will be posted after each lecture. We will make reasonable attempts to record and post each lecture. It possible some lectures may not be recorded, in which case we will not be able to do any make-ups of that lecture. We encourage the students to attend the lectures in person (see the Participation Policy) and participate in the class discussion.

Section materials will also be posted by the TFs. These materials serve as the reference material for the course content.

We will often post "Supplementary Reading" for a lecture, but this reading is strictly supplementary: homework and final exam questions will be based purely on material covered in the lectures and sections.

One source of supplementary material will be from a draft textbook being written for this course. This material, when available, should closely follow the lecture content, notation, and structure, with some additional material and examples. Feedback on this draft is welcome and appreciated via the Ed forum.

More advanced supplementary reading may come from the working draft of the book "Reinforcement Learning Theory and Algorithms", available here. Note that this is an advanced RL theory book, with much material out of the scope of this class, so we will only use very select subsections of it as supplemental reading. If you find typos or errors, please let the authors (e.g., Sham) know--they would appreciate it!

You can also self-study from the classic book "Reinforcement Learning: An Introduction", available here

Schedule (tentative)

	Lecture	Slides	Supplementary Reading
9/6/23	MDPs: Introduction to RL and Markov Decision Processes	Slides Annotated Slides	Textbook: 1.1, 1.2.0-1.2.4 AJKS: 1.1.1, 1.1.2
9/11/23	MDPs: Dynamic Programming	Slides Annotated Slides	Textbook: 1.2.5 AJKS: 1.1.1, 1.1.2
9/13/23	MDPs: Discounted, Infinite Horizon MDPs	Slides Annotated Slides	Textbook: 1.3.0-1.3.4 AJKS: 1.1.1, 1.1.2
9/18/23	MDPs: Value and Policy Iteration	Slides Annotated Slides	Textbook: 1.3.5
9/20/23	Control: Optimal Control in Linear Quadratic Regulator (LQRs)	Slides Annotated Slides
9/25/23	Control: Control for Nonlinear systems (Iterative LQR)	Slides Annotated Slides
9/27/23	Bandits: Introduction to Bandits	Slides Annotated Slides
10/2/23	Bandits: Explore-Then-Commit (ETC), ε-greedy	Slides Annotated Slides
10/4/23	Bandits: Upper Confidence Bound (UCB)	Slides Annotated Slides
10/9/23	No class
10/11/23	Bandits: Thompson Sampling	Slides Annotated Slides
10/16/23	Learning: Supervised Learning	Slides Annotated Slides
10/18/23	Learning: fitted Dynamic Programming	Slides Annotated Slides
10/23/23	PG: (Stochastic) Gradient Descent & Policy Gradient	Slides Annotated Slides
10/25/23	Midterm
10/30/23	PG: Estimation and Baselines	Slides Annotated Slides
11/1/23	PG: Trust Region Methods and Natural PG	Slides Annotated Slides
11/6/23	Embedded EthiCS
11/8/23	PG: NPG and PPO	Slides Annotated Slides
11/13/23	PG: PPO & Importance Sampling	Slides Annotated Slides
11/15/23	Imitation Learning: Behavior Cloning & DAgger	Slides Annotated Slides
11/20/23	MCTS: Monte Carlo Tree Search	Slides Annotated Slides
11/22/23	No class
11/27/23	Exploration: Exploration in MDPs and UCB-VI	Slides Annotated Slides
11/29/23	Bandits: Linear Bandits	Slides Annotated Slides
12/4/23	Bandits: Contextual Bandits and Case Study: RL in the Real World	Slides Annotated Slides