CS/Stat 184: Introduction to Reinforcement Learning
Enrollment
information: This year,
because it is a new course, CS/Stat 184 will have its enrollment
capped at 120. We will respond to all petitions the morning of
August 26, so in order to ensure consideration for a spot in the
course, submit an enrollment petition (it can be blank) before
August 26. On August 26, we will approve petitions in the following
order: (1) any senior undergraduate, (2) any other undergraduate who
has a concentration or secondary in CS or Statistics, in decreasing
order of seniority, (3) any other undergraduates, in decreasing
order of seniority, (4) all graduate students (note this is an
undergraduate class, hence priority is given to undergraduates). Any
ties will be broken randomly among petitions with equal priority. If
your petition is not approved on August 26, you may still be able to
take the class; if spots open up as students drop, we will enroll
further students in the same order as above.
Modern Artificial Intelligent (AI) systems often need the ability to make sequential decisions in an
unknown,
uncertain, possibly hostile environment, by actively interacting with the environment to collect relevant
data.
Reinforcement Learning (RL) is a general framework that can capture the interactive learning setting and
has been used to design intelligent agents that achieve superhuman level performance on
challenging tasks such as Go, computer games, and robotics manipulation.
This course focuses on basics of Reinforcement Learning. The four main parts of the course are
(1) multiarmed bandits, (2) Planning and
Control in MDPs, (3) Learning in Large MDPs (function approximation), and (4)
advanced topics.
After taking this course, students will be able to understand fundamental RL algorithms and their
analysis.
All lectures will be math heavy. We will go
through algorithms and their analysis. All homework will have
a programming component to give students more handson
experience with the concepts.

Staff and Organization
Instructors: Lucas Janson Sham Kakade
TFs:
Daniel Garces, Kuanhao Jiang, Yanke Song
CAs:
Alex Dazhen Cai, Howie Guo, Angela Yilin Li, Richard Qiu, Eric Meng
Shen, Lara Zeng, Saba Zerefa
Lecture time: Tuesday/Thursday 10:30am  11:45am
Lecture location: Maxwell Dworkin G115
**Please double check the
website for office hour location
changes/cancellations before you arrive.**
Instructor office hours:
Lucas Janson: Mon 1:302:30pm, SC 710
Sham Kakade: Thu 45pm, SEC 4.410
TAs office hours:
Alex, Howie & Richard: Thu 35pm, Sever Hall 110
Angela & Eric: Sun 24pm, Quincy Dhall
Daniel: Fri 11amnoon, SEC 1.414
Kuanhao: Tue & Thu 2:303:30pm, Sever Hall 214
Lara & Saba: Tue 8:3010:30pm, Lowell Dhall
Yanke: Mon & Wes 5:306:30pm, SC 222
Discussion: Ed discussion board
Contact Info:
Please communicate to the instructors only
by making a private post to the "instructors" in ED.
Any course related email sent directly to the
instructors will not be responded to in a timely manner.
Announcements:
Please make sure you monitor for (and receive) announcements from both the official class mailing
list and from Ed. Ed is a convenient way to send out some announcements, such as homework corrections
and clarifications. It is important for you to make sure you get these announcements in a timely manner.

Prerequisites
Lectures will be mathematically oriented, where we
focus on algorithm design and analysis. We require a
background in: calculus & linear algebra (e.g., AM22a, Math
21b), probability theory (e.g., Stat 110), and programming
in Python. The following topics are recommended but not
required: linear regression, supervised learning,
algorithms.
Homeworks will
have a programming component, and we expect students to be
comfortable with programming in Python (or committed to
quickly learning it). We will use Python as the programming language in all HWs.

Grading Policies
Assignments 70% (HW0: 6%, HW1HW4: 16%
each); Final 30%
The course
is lettergraded by default, but you may switch to SAT/UNSAT if you prefer.
In order to pass the course, you must attempt and
submit all homeworks, even if they are submitted for zero
credit (as per the late day policy below).
We will also have an "embedded ethics" lecture with
12 corresponding questions, either incorporated into a homework or as a
standalone short assignment (with the grading scheme adjusted appropriately).
All homeworks are mathematical and have a
programming component (we use Python and OpenAI Gym). The final
exam covers concepts and algorithms,
and it does not contain a programming component.
Homework Policies: Collaboration is permitted though each
student must understand, write, and hand in their own
submission. In particular, it is acceptable for students to
discuss problems with each other; it is not acceptable for
students to look at another student's written answers. It is
also not acceptable to publicly post your (partial)
solution on Ed, but it is is encouraged for you to ask
public questions on Ed.
You must also indicate on each homework with whom you
collaborated and what online resources you used.
Each student will have 96 cumulative hours of late time (as
measured on Canvas), which will be forgiven. After this
cumulative amount of time has passed, any assignment that is
turned in late will receive zero credit. Furthermore,
only up to 48 hours of late time may be used on any one
assignment; any assignment turned in more than 48 hours
late will receive zero credit.
We highly encourage you to use LaTex. We will also accept
neatly written handwritten homework.
Homeworks must be submitted through Gradescope. PDF files of the homeworks can be
accessed on Gradescope. PDF and LaTeX files for the homeworks will also be uploaded to Canvas.
Regrading Policy:
All homework regrading requests must be submitted on Gradescope within
seven days after the grades are released. For example,
if we return the grades on Monday, then you have until midnight the
following Monday to submit any regrade requests. If you feel that we
have made an error in grading your homework, please let us know with a
written explanation. This policy is to ensure that we can address any
concerns in a timely and fair manner. The focus of office hours and in
person discussions are solely limited to asking knowledge related
questions. Grade related questions must be submitted to the
course mailing list.
Final Exam:
If you are not able to make the final exam on the official date (and
do not have an exception based on university policies), then please do not
enroll in the course. The course is in Exam Group FAS14_B.

Diversity and Inclusiveness
While many academic disciplines have historically been dominated by one cross section of society,
the study of and participation in STEM disciplines is a joy that the instructors hope that everyone can
pursue,
regardless of their socioeconomic background, race, gender, etc.
We encourage students to both be mindful of these issues, and,
in good faith, try to take steps to fix them. You are the next generation here.
You should expect to be treated by your classmates and the course staff with respect.
You belong here, and we are here to help you learn and enjoy this course.
If any incident occurs that challenges this commitment to a supportive and inclusive environment,
please let the instructors know so that the issue can be addressed. We are personally committed to this
and subscribe to
Harvard's
Values for Inclusion.

Honor Code
Collaborations only where explicitly allowed.
Do not use forums like Course Hero, Chegg, etc.
Any outside materials you use for your HWs,
properly cite these references. If you are unclear about
whether some online material can be used, please ask the course staff first.
No sharing of your solutions within or outside class at any time.
The above is not an exhaustive list, and in general,
common sense rules about academic integrity apply. If it is
something in doubt, please ask us whether it is OK before you
do it. Also see the Harvard
College Honor Code.

Course Materials
Slides will be posted before each lecture, and
annotated slides (with all notes taken on them by the
instructor during lecture) will be posted after each
lecture. We will make reasonable attempts to record and post
each lecture. It possible some lectures may not be recorded,
in which case we will not be able to do any makeups of that lecture. We
encourage the students to attend the lectures in person and participate
in the class discussion.
Section materials will also be posted by the TFs. These
materials serve as the reference material for the course
content.
We will often post "Supplementary Reading" for a lecture, but this reading is strictly supplementary:
homework and final exam questions will be based purely on material covered in the lectures and sections.
The supplementary reading will sometimes come from the working draft of
the book "Reinforcement Learning Theory and
Algorithms", available
here.
Note that this is an advanced RL theory book, with much material out of the scope of this class, so we
will only use very select subsections of it as supplemental reading.
If you find typos or errors, please let the authors (e.g., Sham) knowthey would appreciate it!
You can also selfstudy from the classic book "Reinforcement Learning: An Introduction", available here

Schedule (tentative)


Lecture 
Slides 
Supplementary Reading 
9/1/22 

Bandits: Introduction to RL and Bandits 
Slides 
Section_1
Solutions 
9/6/22 

Bandits: ExploreThenCommit (ETC), εgreedy 
Slides
Annotated Slides 


9/8/22 

Bandits: Upper Confidence Bound (UCB) 
Slides
Annotated Slides 
Section_2
Solutions 

9/13/22 

Bandits: Instancedependent regret for UCB 
Slides
Annotated Slides 


9/15/22 

Bandits: Thompson Sampling 
Slides
 Section_3
Solutions 

9/20/22 

Bandits: Gittins Index 
Slides
Annotated Slides 


9/22/22 

Bandits: Contextual Bandits 
Slides
Annotated Slides
 Section_4
Solutions 

9/27/22 

MDPs: Markov Decision Processes 
Slides
Annotated Slides

AJKS: 1.1.1, 1.1.2


9/29/22 

MDPs: Markov Decision Processes (Continue) 
Slides
Annotated Slides

AJKS: 1.1.1, 1.1.2


10/4/22 

MDPs: Value Iteration 

AJKS: 1.1.1, 1.1.2


10/6/22 

MDPs: Policy Evaluation 

AJKS: 1.4.1


10/11/22 

MDPs: Policy Iteration 

AJKS: 1.4.2


10/13/22 

Control: Linear Quadratic Regulator (LQRs) 


10/18/22 

Control: Optimal Control in LQRs 


10/20/22 

Control: Control for Nonlinear systems (Iterative LQR) 


10/25/22 

Learning: Modelbased RL w/ Generative Model 


10/27/22 

Learning: Supervised Learning & Approximate Policy Iteration 


11/1/22 

Learning: Approximate Policy Iteration & Performance Difference Lemma 


11/3/22 

Learning: Conservative Policy Iteration 


11/8/22 

Learning: (Stochastic) Gradient Descent & Policy Gradient 


11/10/22 

Learning: PG Continue 

11/15/22 

Learning: Trust Region and Natural PG 


11/17/22 

Learning : NPG Continue and Review 

11/22/22 

Imitation Learning: Behavior Cloning 


11/24/22 

No Class 

11/29/22 

Imitation Learning: Interactive Learning w/ DAgger 


12/1/22 

LAST DAY: DAgger (continue) 



