Reward tampering problems and solutions in reinforcement learning: a causal influence diagram perspective

Everitt, Tom; Hutter, Marcus; Kumar, Ramana; Krakovna, Victoria

Reward tampering problems and solutions in reinforcement learning: a causal influence diagram perspective

dc.contributor.author	Everitt, Tom
dc.contributor.author	Hutter, Marcus
dc.contributor.author	Kumar, Ramana
dc.contributor.author	Krakovna, Victoria
dc.date.accessioned	2023-11-07T03:26:49Z
dc.date.issued	2021
dc.date.updated	2022-09-04T08:16:45Z
dc.description.abstract	Can humans get arbitrarily capable reinforcement learning (RL) agents to do their bidding? Or will sufficiently capable RL agents always find ways to bypass their intended objectives by shortcutting their reward signal? This question impacts how far RL can be scaled, and whether alternative paradigms must be developed in order to build safe artificial general intelligence. In this paper, we study when an RL agent has an instrumental goal to tamper with its reward process, and describe design principles that prevent instrumental goals for two different types of reward tampering (reward function tampering and RF-input tampering). Combined, the design principles can prevent reward tampering from being an instrumental goal. The analysis benefits from causal influence diagrams to provide intuitive yet precise formalizations.	en_AU
dc.format.mimetype	application/pdf	en_AU
dc.identifier.issn	0039-7857	en_AU
dc.identifier.uri	http://hdl.handle.net/1885/305604
dc.language.iso	en_AU	en_AU
dc.publisher	Springer International Publishing AG	en_AU
dc.rights	© 2021 The authors	en_AU
dc.source	Synthese	en_AU
dc.subject	AGI safety	en_AU
dc.subject	Reinforcement learning	en_AU
dc.subject	Bayesian learning	en_AU
dc.subject	Causality	en_AU
dc.subject	Decision theory	en_AU
dc.subject	Causal influence diagrams	en_AU
dc.title	Reward tampering problems and solutions in reinforcement learning: a causal influence diagram perspective	en_AU
dc.type	Journal article	en_AU
local.bibliographicCitation.lastpage	33	en_AU
local.bibliographicCitation.startpage	1	en_AU
local.contributor.affiliation	Everitt, Tom, College of Engineering and Computer Science, ANU	en_AU
local.contributor.affiliation	Hutter, Marcus, College of Engineering and Computer Science, ANU	en_AU
local.contributor.affiliation	Kumar, Ramana, DeepMind, UK	en_AU
local.contributor.affiliation	Krakovna, Victoria, Google DeepMind	en_AU
local.contributor.authoruid	Everitt, Tom, u5210859	en_AU
local.contributor.authoruid	Hutter, Marcus, u4350841	en_AU
local.description.embargo	2099-12-31
local.description.notes	Imported from ARIES	en_AU
local.identifier.absfor	460202 - Autonomous agents and multiagent systems	en_AU
local.identifier.absfor	461105 - Reinforcement learning	en_AU
local.identifier.absfor	500306 - Ethical theory	en_AU
local.identifier.ariespublication	a383154xPUB19446	en_AU
local.identifier.citationvolume	198	en_AU
local.identifier.doi	10.1007/s11229-021-03141-4	en_AU
local.identifier.scopusID	2-s2.0-85104302114
local.identifier.thomsonID	WOS:000652095400002
local.publisher.url	https://link.springer.com/	en_AU
local.type.status	Published Version	en_AU

Downloads

Original bundle

Now showing 1 - 1 of 1

Name:: s11229-021-03141-4.pdf
Size:: 968.31 KB
Format:: Adobe Portable Document Format
Description:

Download

Collections

ANU Research Publications