Redteaming AI security research agendas
, Andrew Gritsevskiy, Sumeet Ramesh Motwani, Christian Schroeder de Witt
Demonstrates how backdoors can be seamlessly integrated into transformer models, questioning pre-deployment detection strategies.
PaperRaymond Douglas, Jacek Karwowski, Chan Bae,
, Victoria KrakovnaOutlines structural reasons why predictive models can fail when turned into agents, including auto-suggestive delusions and predictor-policy incoherence.
Paper