CS/Stat 184: Introduction to Reinforcement Learning

Enrollment information: This year, because it is a new course, CS/Stat 184 will have its enrollment capped at 120. We will respond to all petitions the morning of August 26, so in order to ensure consideration for a spot in the course, submit an enrollment petition (it can be blank) before August 26. On August 26, we will approve petitions in the following order: (1) any senior undergraduate, (2) any other undergraduate who has a concentration or secondary in CS or Statistics, in decreasing order of seniority, (3) any other undergraduates, in decreasing order of seniority, (4) all graduate students (note this is an undergraduate class, hence priority is given to undergraduates). Any ties will be broken randomly among petitions with equal priority. If your petition is not approved on August 26, you may still be able to take the class; if spots open up as students drop, we will enroll further students in the same order as above.

Modern Artificial Intelligent (AI) systems often need the ability to make sequential decisions in an unknown, uncertain, possibly hostile environment, by actively interacting with the environment to collect relevant data. Reinforcement Learning (RL) is a general framework that can capture the interactive learning setting and has been used to design intelligent agents that achieve super-human level performance on challenging tasks such as Go, computer games, and robotics manipulation.

This course focuses on basics of Reinforcement Learning. The four main parts of the course are (1) multi-armed bandits, (2) Planning and Control in MDPs, (3) Learning in Large MDPs (function approximation), and (4) advanced topics.

After taking this course, students will be able to understand fundamental RL algorithms and their analysis.

All lectures will be math heavy. We will go through algorithms and their analysis. All homework will have a programming component to give students more hands-on experience with the concepts.

Staff and Organization

Instructors: Lucas Janson  Sham Kakade

TFs: Daniel Garces, Kuanhao Jiang, Yanke Song

CAs: Alex Dazhen Cai, Howie Guo, Angela Yilin Li, Richard Qiu, Eric Meng Shen, Lara Zeng, Saba Zerefa

Lecture time: Tuesday/Thursday 10:30am - 11:45am

Lecture location: Maxwell Dworkin G115

**Please double check the website for office hour location changes/cancellations before you arrive.**

Instructor office hours:

Lucas Janson: Mon 1:30-2:30pm, SC 710

Sham Kakade: Thu 4-5pm, SEC 4.410

TAs office hours:

Alex, Howie & Richard: Thu 3-5pm, Sever Hall 110

Angela & Eric: Sun 2-4pm, Quincy Dhall

Daniel: Fri 11am-noon, SEC 1.414

Kuanhao: Tue & Thu 2:30-3:30pm, Sever Hall 214

Lara & Saba: Tue 8:30-10:30pm, Lowell Dhall

Yanke: Mon & Wes 5:30-6:30pm, SC 222

Discussion: Ed discussion board

Contact Info:

Please communicate to the instructors only by making a private post to the "instructors" in ED. Any course related email sent directly to the instructors will not be responded to in a timely manner.


Please make sure you monitor for (and receive) announcements from both the official class mailing list and from Ed. Ed is a convenient way to send out some announcements, such as homework corrections and clarifications. It is important for you to make sure you get these announcements in a timely manner.


Lectures will be mathematically oriented, where we focus on algorithm design and analysis. We require a background in: calculus & linear algebra (e.g., AM22a, Math 21b), probability theory (e.g., Stat 110), and programming in Python. The following topics are recommended but not required: linear regression, supervised learning, algorithms.

Homeworks will have a programming component, and we expect students to be comfortable with programming in Python (or committed to quickly learning it). We will use Python as the programming language in all HWs.

Grading Policies

Assignments 70% (HW0: 6%, HW1-HW4: 16% each); Final 30%

The course is letter-graded by default, but you may switch to SAT/UNSAT if you prefer.

In order to pass the course, you must attempt and submit all homeworks, even if they are submitted for zero credit (as per the late day policy below).

We will also have an "embedded ethics" lecture with 1-2 corresponding questions, either incorporated into a homework or as a standalone short assignment (with the grading scheme adjusted appropriately).

All homeworks are mathematical and have a programming component (we use Python and OpenAI Gym). The final exam covers concepts and algorithms, and it does not contain a programming component.

Homework Policies: Collaboration is permitted though each student must understand, write, and hand in their own submission. In particular, it is acceptable for students to discuss problems with each other; it is not acceptable for students to look at another student's written answers. It is also not acceptable to publicly post your (partial) solution on Ed, but it is is encouraged for you to ask public questions on Ed. You must also indicate on each homework with whom you collaborated and what online resources you used.

Each student will have 96 cumulative hours of late time (as measured on Canvas), which will be forgiven. After this cumulative amount of time has passed, any assignment that is turned in late will receive zero credit. Furthermore, only up to 48 hours of late time may be used on any one assignment; any assignment turned in more than 48 hours late will receive zero credit.

We highly encourage you to use LaTex. We will also accept neatly written handwritten homework.

Homeworks must be submitted through Gradescope. PDF files of the homeworks can be accessed on Gradescope. PDF and LaTeX files for the homeworks will also be uploaded to Canvas.

Regrading Policy: All homework regrading requests must be submitted on Gradescope within seven days after the grades are released. For example, if we return the grades on Monday, then you have until midnight the following Monday to submit any regrade requests. If you feel that we have made an error in grading your homework, please let us know with a written explanation. This policy is to ensure that we can address any concerns in a timely and fair manner. The focus of office hours and in person discussions are solely limited to asking knowledge related questions. Grade related questions must be submitted to the course mailing list.

Final Exam: If you are not able to make the final exam on the official date (and do not have an exception based on university policies), then please do not enroll in the course. The course is in Exam Group FAS14_B.

Diversity and Inclusiveness

While many academic disciplines have historically been dominated by one cross section of society, the study of and participation in STEM disciplines is a joy that the instructors hope that everyone can pursue, regardless of their socio-economic background, race, gender, etc. We encourage students to both be mindful of these issues, and, in good faith, try to take steps to fix them. You are the next generation here.

You should expect to be treated by your classmates and the course staff with respect. You belong here, and we are here to help you learn and enjoy this course. If any incident occurs that challenges this commitment to a supportive and inclusive environment, please let the instructors know so that the issue can be addressed. We are personally committed to this and subscribe to Harvard's Values for Inclusion.

Honor Code

  • Collaborations only where explicitly allowed.
  • Do not use forums like Course Hero, Chegg, etc.
  • Any outside materials you use for your HWs, properly cite these references. If you are unclear about whether some online material can be used, please ask the course staff first.
  • No sharing of your solutions within or outside class at any time.
  • The above is not an exhaustive list, and in general, common sense rules about academic integrity apply. If it is something in doubt, please ask us whether it is OK before you do it. Also see the Harvard College Honor Code.

    Course Materials

    Slides will be posted before each lecture, and annotated slides (with all notes taken on them by the instructor during lecture) will be posted after each lecture. We will make reasonable attempts to record and post each lecture. It possible some lectures may not be recorded, in which case we will not be able to do any make-ups of that lecture. We encourage the students to attend the lectures in person and participate in the class discussion.

    Section materials will also be posted by the TFs. These materials serve as the reference material for the course content.

    We will often post "Supplementary Reading" for a lecture, but this reading is strictly supplementary: homework and final exam questions will be based purely on material covered in the lectures and sections.

    The supplementary reading will sometimes come from the working draft of the book "Reinforcement Learning Theory and Algorithms", available here. Note that this is an advanced RL theory book, with much material out of the scope of this class, so we will only use very select subsections of it as supplemental reading. If you find typos or errors, please let the authors (e.g., Sham) know--they would appreciate it!

    You can also self-study from the classic book "Reinforcement Learning: An Introduction", available here

    Schedule (tentative)

    Lecture Slides Supplementary Reading
    9/1/22 Bandits: Introduction to RL and Bandits Slides Section_1
    9/6/22 Bandits: Explore-Then-Commit (ETC), ε-greedy Slides
    Annotated Slides
    9/8/22 Bandits: Upper Confidence Bound (UCB) Slides
    Annotated Slides
    9/13/22 Bandits: Instance-dependent regret for UCB Slides
    Annotated Slides
    9/15/22 Bandits: Thompson Sampling Slides
    Annotated Slides
    9/20/22 Bandits: Gittins Index Slides
    Annotated Slides
    9/22/22 Bandits: Contextual Bandits Slides
    Annotated Slides
    9/27/22 MDPs: Markov Decision Processes Slides
    Annotated Slides
    AJKS: 1.1.1, 1.1.2
    9/29/22 MDPs: Markov Decision Processes (Continue) Slides
    Annotated Slides
    AJKS: 1.1.1, 1.1.2
    10/4/22 MDPs: Value Iteration Slides
    Annotated Slides
    AJKS: 1.1.1, 1.1.2
    10/6/22 MDPs: Policy Evaluation and Policy Iteration Slides
    Annotated Slides
    AJKS: 1.4.1
    10/11/22 Control: Linear Quadratic Regulator (LQRs) & Finite Horizon MDPs Slides
    Annotated Slides
    10/13/22 Control: Optimal Control in LQRs Slides
    Annotated Slides
    10/18/22 Control: Control for Nonlinear systems (Iterative LQR) Slides
    Annotated Slides
    10/20/22 Learning: (Stochastic) Gradient Descent & Policy Gradient Slides
    Annotated Slides
    10/25/22 Learning: PG Continue Slides
    Annotated Slides
    AJKS: 11.1
    10/27/22 Learning: PG w/ Baselines & Supervised Fitting Methods Slides
    Annotated Slides
    AJKS: 11.3
    11/1/22 Learning: Supervised Learning and fitted Dynamic Programming Slides
    Annotated Slides
    11/3/22 Learning: Trust Region and Natural PG Slides
    Annotated Slides
    11/8/22 No class
    11/10/22 Learning: NPG Slides
    Annotated Slides
    AJKS: 12.4, 14.2
    11/15/22 Embedded EthiCS Guided Worksheet Section_11
    11/17/22 Imitation Learning: Behavior Cloning Slides
    Annotated Slides
    11/22/22 Imitation Learning: Interactive Learning w/ DAgger Slides
    Annotated Slides
    11/24/22 No Class
    11/29/22 Contextual Bandits Slides
    Annotated Slides
    12/1/22 Exploration in MDPs and UCB-VI Slides
    Annotated Slides