Reward tampering problems and solutions in reinforcement learning: a causal influence diagram perspective

dc.contributor.authorEveritt, Tom
dc.contributor.authorHutter, Marcus
dc.contributor.authorKumar, Ramana
dc.contributor.authorKrakovna, Victoria
dc.date.accessioned2023-11-07T03:26:49Z
dc.date.issued2021
dc.date.updated2022-09-04T08:16:45Z
dc.description.abstractCan humans get arbitrarily capable reinforcement learning (RL) agents to do their bidding? Or will sufficiently capable RL agents always find ways to bypass their intended objectives by shortcutting their reward signal? This question impacts how far RL can be scaled, and whether alternative paradigms must be developed in order to build safe artificial general intelligence. In this paper, we study when an RL agent has an instrumental goal to tamper with its reward process, and describe design principles that prevent instrumental goals for two different types of reward tampering (reward function tampering and RF-input tampering). Combined, the design principles can prevent reward tampering from being an instrumental goal. The analysis benefits from causal influence diagrams to provide intuitive yet precise formalizations.en_AU
dc.format.mimetypeapplication/pdfen_AU
dc.identifier.issn0039-7857en_AU
dc.identifier.urihttp://hdl.handle.net/1885/305604
dc.language.isoen_AUen_AU
dc.publisherSpringer International Publishing AGen_AU
dc.rights© 2021 The authorsen_AU
dc.sourceSyntheseen_AU
dc.subjectAGI safetyen_AU
dc.subjectReinforcement learningen_AU
dc.subjectBayesian learningen_AU
dc.subjectCausalityen_AU
dc.subjectDecision theoryen_AU
dc.subjectCausal influence diagramsen_AU
dc.titleReward tampering problems and solutions in reinforcement learning: a causal influence diagram perspectiveen_AU
dc.typeJournal articleen_AU
local.bibliographicCitation.lastpage33en_AU
local.bibliographicCitation.startpage1en_AU
local.contributor.affiliationEveritt, Tom, College of Engineering and Computer Science, ANUen_AU
local.contributor.affiliationHutter, Marcus, College of Engineering and Computer Science, ANUen_AU
local.contributor.affiliationKumar, Ramana, DeepMind, UKen_AU
local.contributor.affiliationKrakovna, Victoria, Google DeepMinden_AU
local.contributor.authoruidEveritt, Tom, u5210859en_AU
local.contributor.authoruidHutter, Marcus, u4350841en_AU
local.description.embargo2099-12-31
local.description.notesImported from ARIESen_AU
local.identifier.absfor460202 - Autonomous agents and multiagent systemsen_AU
local.identifier.absfor461105 - Reinforcement learningen_AU
local.identifier.absfor500306 - Ethical theoryen_AU
local.identifier.ariespublicationa383154xPUB19446en_AU
local.identifier.citationvolume198en_AU
local.identifier.doi10.1007/s11229-021-03141-4en_AU
local.identifier.scopusID2-s2.0-85104302114
local.identifier.thomsonIDWOS:000652095400002
local.publisher.urlhttps://link.springer.com/en_AU
local.type.statusPublished Versionen_AU

Downloads

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
s11229-021-03141-4.pdf
Size:
968.31 KB
Format:
Adobe Portable Document Format
Description: