Tsallis-INF: Оптимальные алгоритмы для стохастических и адверсарных многоруких бандитов

Евгений Сельдин, Копенгагенский Университет, Копенгаген, Дания. Evgeniy started with a general introduction to online learning and then presented the Tsallis-INF algorithm, which achieved the optimal (within constants) pseudo-regret in both stochastic and adversarial multi-armed bandits without prior knowledge of the regime and time horizon. It also achieved the optimal regret guarantee in several intermediate regimes, including stochastically constrained adversarial bandits and stochastic bandits with adversarial corruptions. They provided empirical evaluation of the algorithm, demonstrating that it significantly outperformed UCB1 and EXP3 in stochastic environments. They also provided examples of adversarial environments, where UCB1 and Thompson Sampling exhibited almost linear regret, whereas Tsallis-INF suffered only logarithmic regret. Evgeniy also surveyed several follow-up works, including an optimal algorithm for adversarial bandits with arbitrary delays, and an algorithm for stochastic and adversar

1 view

3195

1171