CS 6789 Foundations of RL

Important Dates

Proposal due: 11/20/2023

Milestone report due: 12/4/2023

Final report due: 12/15/2023

Grading

Proposal: 10%

Milestone report: 20%

Final report: 70%

Policies

Projects must be undertaken in groups of 2 or 3 people. Individual projects are not permitted. All projects must have an empirical component. Students can either implement algorithms on MDPs they construct, replicate prior work on small MDPs, or use simulators or real data. If students opt to construct their own MDPs, they should carefully consider which class of MDPs would be intriguing to examine.

Report, Milestone, and Proposal Formats: We adhere to the NeurIPS format. You must use the NeurIPS LaTeX format.

Proposal: Your proposal should be a maximum of 2 pages (in the NeurIPS format) and must state your project title and team members. It should also include preliminary potential formulations, a timeline detailing steps for the milestone and project completion, a description of the data you intend to use (e.g., a simulator, real/offline data, or synthetic MDPs you construct), and a brief discussion on the algorithms you plan to explore, along with your implementation strategy. Ensure your timeline aligns with understanding the algorithmic approaches.

Milestone Report: For your milestone submission, please prepare a document of up to 4 pages. You are welcome to incorporate any relevant material from this milestone into your final report. In your milestone, it's essential to rearticulate your problem formulation, reflecting its current status. This should include a clear outline of the RL aspects under investigation, your motivation for choosing them, and any related work if applicable. Also, detail the data you've gathered, whether it's synthetic or real, and provide an overview of your preliminary code development or any experiments conducted to date. Particular emphasis should be placed on demonstrating that you have thought through how your project will explore some RL component.

Final Report: Your final report should be a maximum of 9 pages, excluding references. It will be evaluated based on the following criteria:

Merit: Do you have sound reasoning for the approach? Is the question well motivated and are you taking a justifiably simple approach or, if you are choosing a more complicated method, do you have sound reasoning for doing this?
Technical depth: How technically challenging was what you did? Did you use a package or write your own code? It is fine if you use a package, though this means other aspects of your project must be more ambitious.
Presentation: Does your report comprehensively explain your methodology, results, and interpretations? Did you incorporate effective graphs and visualizations? How clear is your writing? Have you justified your chosen approach?

Project Ideas

We provide a few project ideas below. Please remember that all reports must have an empirical component.

Multi-agent Reinforcement Learning: Explore what happens with multiple agents in the same environment with their own rewards. In some cases, like games, they could oppose each other, see e.g., Zhang et.al

Restless Multi-Armed Bandits: Explore what happens in the "restless MAB" setting, where each arm also has a state that changes over time. see e.g., Qian et.al

2048 Gameplay with MDPs: Explore how the game 2048 can be studied with MDPs. see e.g., Li et.al , Goenawan et.al , Sijtsma

Flappy Bird Gameplay with RL: Explore how the game Flappy Bird can be studied with Reinforcement Learning. see e.g., Blog Post

Continuous MDPs: Most of the RL methods we have looked at in class concern discrete state and action spaces. However, in many RL settings, the state and or action space may in fact be continuous. In this setting, we have two choices—we may discretize the state and/or action space, or use an RL method that supports continuous state and/or action spaces. This project would involve setting up some continuous MDP environments and investigating when discretization of state spaces works “well” or "badly", in terms of final performance, training speed, compute, etc. You could also explore different discretization methods, or continuous methods independently as opposed to looking at both.

Aggregated Delayed Reward MAB: In the multi-armed bandit setting, the learner pulls one of K arms and aims to minimize its regret. However, in some settings, the reward is not known individually between the arms. We aim to extract the signal from the optimal action even when feedback between arms are aggregated or delayed. This project would involve investigating existing methods geared towards Aggregated delayed reward MABs and implementing them in example environments. Possible extensions include finding new algorithms for this task and finding their regret bounds or applying this framework to new aggregate-reward scenarios.

Tabular MDPs: Come up with some interesting (and difficult) candidate tabular MDPs and test some example algorithms. What are hard tabular MDPs? Hard for which algorithms?

Contextual MAB: We will cover contextual bandits later in the course, but the key idea is that each arm has a “context” vector that has some relationship to the unknown distribution parameter. Choose an application of MAB that has some form of context. Either find some real data or simulate some fake data in this setting. How much does having the context improve the algorithm performance? What are the best algorithms that you can find (either covered in class or from other sources) for performing contextual MAB in your setting? What happens if you add adversarial or random corruptions in your context?

Miscellaneous Project Ideas: Explore examples with one of the advanced topics from class: MCTS, Immitation Learning, Policy Gradient Approaches, etc.

Miscellaneous Project Ideas: For more project idea inspiration, please check out the projects from previous reinforcement learning courses. Some of these courses are graduate courses, but you can still take inspiration from the projects. Course 1 , Course 2