ML Researcher • AI Alignment
Machine learning researcher interested in AI alignment, neural networks, and music transcription. Currently focused on provably hard cases for AI alignment methods.
, Andrew Gritsevskiy, Sumeet Ramesh Motwani, Christian Schroeder de Witt
Demonstrates how backdoors can be seamlessly integrated into transformer models, questioning pre-deployment detection strategies.
PaperRaymond Douglas, Jacek Karwowski, Chan Bae,
, Victoria KrakovnaOutlines structural reasons why predictive models can fail when turned into agents, including auto-suggestive delusions and predictor-policy incoherence.
Paper