# Biography

I am a first-year PhD student in reinforcement learning at Facebook AI Research Paris and Inria Lille SequeL team, under the supervision of Alessandro Lazaric and Michal Valko.

### Interests

• Reinforcement Learning (RL)
• Unsupervised RL
• Goal-oriented RL

### Education

• MSc in Machine Learning (MVA), 2018

Ecole Normale Supérieure Paris-Saclay

• MSc in Applied Mathematics, 2017

Ecole polytechnique

• BSc in Applied Mathematics, 2016

Ecole polytechnique

# Experience

#### Research Intern

May 2018 – Sep 2018 Paris, France
– Research project in reinforcement learning
– Active exploration in Markov decision processes
– Supervisor: Alessandro Lazaric

#### Cornell University

Apr 2017 – Aug 2017 Ithaca, NY, USA
– Research project in operations research and statistics
– Incorporating correlation in travel time prediction for route planning in stochastic networks
– Supervisor: Samitha Samaranayake

#### Infosys

Jun 2016 – Aug 2016 Bangalore, India
– Research project in NLP
– Extraction and visualization of lexical ontologies from OSS incident ticket data
– Supervisor: Suman Roy

# Publications

### No-Regret Exploration in Goal-Oriented Reinforcement Learning

Many popular reinforcement learning problems (e.g., navigation in a maze, some Atari games, mountain car) are instances of the episodic setting under its stochastic shortest path (SSP) formulation, where an agent has to achieve a goal state while minimizing the cumulative cost. Despite the popularity of this setting, the exploration-exploitation dilemma has been sparsely studied in general SSP problems, with most of the theoretical literature focusing on different problems (i.e., fixed-horizon and infinite-horizon) or making the restrictive loop-free SSP assumption (i.e., no state can be visited twice during an episode). In this paper, we study the general SSP problem with no assumption on its dynamics (some policies may actually never reach the goal). We introduce UC-SSP, the first no-regret algorithm in this setting, and prove a regret bound scaling as $\widetilde{O}(D S \sqrt{ A D K})$ after $K$ episodes for any unknown SSP with $S$ states, $A$ actions, positive costs and SSP-diameter $D$, defined as the smallest expected hitting time from any starting state to the goal. We achieve this result by crafting a novel stopping rule, such that UC-SSP may interrupt the current policy if it is taking too long to achieve the goal and switch to alternative policies that are designed to rapidly terminate the episode.

### Active Exploration in Markov Decision Processes

We introduce the active exploration problem in Markov decision processes (MDPs). Each state of the MDP is characterized by a random value and the learner should gather samples to estimate the mean value of each state as accurately as possible. Similarly to active exploration in multi-armed bandit (MAB), states may have different levels of noise, so that the higher the noise, the more samples are needed. As the noise level is initially unknown, we need to trade off the exploration of the environment to estimate the noise and the exploitation of these estimates to compute a policy maximizing the accuracy of the mean predictions. We introduce a novel learning algorithm to solve this problem showing that active exploration in MDPs may be significantly more difficult than in MAB. We also derive a heuristic procedure to mitigate the negative effect of slowly mixing policies. Finally, we validate our findings on simple numerical simulations.