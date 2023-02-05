AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper,...
24 - Superalignment with Jan Leike
Recently, OpenAI made a splash by announcing a new "Superalignment" team. Lead by Jan Leike and Ilya Sutskever, the team would consist of top researchers, attempting to solve alignment for superintelligent AIs in four years by figuring out how to build a trustworthy human-level AI alignment researcher, and then using it to solve the rest of the problem. But what does this plan actually involve? In this episode, I talk to Jan Leike about the plan and the challenges it faces. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Episode art by Hamish Doodles: hamishdoodles.com/ Topics we discuss, and timestamps: 0:00:37 - The superalignment team 0:02:10 - What's a human-level automated alignment researcher? 0:06:59 - The gap between human-level automated alignment researchers and superintelligence 0:18:39 - What does it do? 0:24:13 - Recursive self-improvement 0:26:14 - How to make the AI AI alignment researcher 0:30:09 - Scalable oversight 0:44:38 - Searching for bad behaviors and internals 0:54:14 - Deliberately training misaligned models 1:02:34 - Four year deadline 1:07:06 - What if it takes longer? 1:11:38 - The superalignment team and... 1:11:38 - ... governance 1:14:37 - ... other OpenAI teams 1:18:17 - ... other labs 1:26:10 - Superalignment team logistics 1:29:17 - Generalization 1:43:44 - Complementary research 1:48:29 - Why is Jan optimistic? 1:58:32 - Long-term agency in LLMs? 2:02:44 - Do LLMs understand alignment? 2:06:01 - Following Jan's research The transcript: axrp.net/episode/2023/07/27/episode-24-superalignment-jan-leike.html Links for Jan and OpenAI: OpenAI jobs: openai.com/careers Jan's substack: aligned.substack.com Jan's twitter: twitter.com/janleike Links to research and other writings we discuss: Introducing Superalignment: openai.com/blog/introducing-superalignment Let's Verify Step by Step (process-based feedback on math): arxiv.org/abs/2305.20050 Planning for AGI and beyond: openai.com/blog/planning-for-agi-and-beyond Self-critiquing models for assisting human evaluators: arxiv.org/abs/2206.05802 An Interpretability Illusion for BERT: arxiv.org/abs/2104.07143 Language models can explain neurons in language models https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html Our approach to alignment research: openai.com/blog/our-approach-to-alignment-research Training language models to follow instructions with human feedback (aka the Instruct-GPT paper): arxiv.org/abs/2203.02155
7/27/2023
2:08:29
23 - Mechanistic Anomaly Detection with Mark Xu
Is there some way we can detect bad behaviour in our AI system without having to know exactly what it looks like? In this episode, I speak with Mark Xu about mechanistic anomaly detection: a research direction based on the idea of detecting strange things happening in neural networks, in the hope that that will alert us of potential treacherous turns. We both talk about the core problems of relating these mechanistic anomalies to bad behaviour, as well as the paper "Formalizing the presumption of independence", which formulates the problem of formalizing heuristic mathematical reasoning, in the hope that this will let us mathematically define "mechanistic anomalies". Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Episode art by Hamish Doodles: hamishdoodles.com/ Topics we discuss, and timestamps: 0:00:38 - Mechanistic anomaly detection 0:09:28 - Are all bad things mechanistic anomalies, and vice versa? 0:18:12 - Are responses to novel situations mechanistic anomalies? 0:39:19 - Formalizing "for the normal reason, for any reason" 1:05:22 - How useful is mechanistic anomaly detection? 1:12:38 - Formalizing the Presumption of Independence 1:20:05 - Heuristic arguments in physics 1:27:48 - Difficult domains for heuristic arguments 1:33:37 - Why not maximum entropy? 1:44:39 - Adversarial robustness for heuristic arguments 1:54:05 - Other approaches to defining mechanisms 1:57:20 - The research plan: progress and next steps 2:04:13 - Following ARC's research The transcript: axrp.net/episode/2023/07/24/episode-23-mechanistic-anomaly-detection-mark-xu.html ARC links: Website: alignment.org Theory blog: alignment.org/blog Hiring page: alignment.org/hiring Research we discuss: Formalizing the presumption of independence: arxiv.org/abs/2211.06738 Eliciting Latent Knowledge (aka ELK): alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge Mechanistic Anomaly Detection and ELK: alignmentforum.org/posts/vwt3wKXWaCvqZyF74/mechanistic-anomaly-detection-and-elk Can we efficiently explain model behaviours? alignmentforum.org/posts/dQvxMZkfgqGitWdkb/can-we-efficiently-explain-model-behaviors Can we efficiently distinguish different mechanisms? alignmentforum.org/posts/JLyWP2Y9LAruR2gi9/can-we-efficiently-distinguish-different-mechanisms
7/27/2023
2:05:52
Survey, store closing, Patreon
Very brief survey: bit.ly/axrpsurvey2023 Store is closing in a week! Link: store.axrp.net/ Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast
6/28/2023
4:26
22 - Shard Theory with Quintin Pope
What can we learn about advanced deep learning systems by understanding how humans learn and form values over their lifetimes? Will superhuman AI look like ruthless coherent utility optimization, or more like a mishmash of contextually activated desires? This episode's guest, Quintin Pope, has been thinking about these questions as a leading researcher in the shard theory community. We talk about what shard theory is, what it says about humans and neural networks, and what the implications are for making AI safe. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Episode art by Hamish Doodles Topics we discuss, and timestamps: 0:00:42 - Why understand human value formation? 0:19:59 - Why not design methods to align to arbitrary values? 0:27:22 - Postulates about human brains 0:36:20 - Sufficiency of the postulates 0:44:55 - Reinforcement learning as conditional sampling 0:48:05 - Compatibility with genetically-influenced behaviour 1:03:06 - Why deep learning is basically what the brain does 1:25:17 - Shard theory 1:38:49 - Shard theory vs expected utility optimizers 1:54:45 - What shard theory says about human values 2:05:47 - Does shard theory mean we're doomed? 2:18:54 - Will nice behaviour generalize? 2:33:48 - Does alignment generalize farther than capabilities? 2:42:03 - Are we at the end of machine learning history? 2:53:09 - Shard theory predictions 2:59:47 - The shard theory research community 3:13:45 - Why do shard theorists not work on replicating human childhoods? 3:25:53 - Following shardy research The transcript Shard theorist links: Quintin's LessWrong profile Alex Turner's LessWrong profile Shard theory Discord EleutherAI Discord Research we discuss: The Shard Theory Sequence Pretraining Language Models with Human Preferences Inner alignment in salt-starved rats Intro to Brain-like AGI Safety Sequence Brains and transformers: The neural architecture of language: Integrative modeling converges on predictive processing Brains and algorithms partially converge in natural language processing Evidence of a predictive coding hierarchy in the human brain listening to speech Singular learning theory explainer: Neural networks generalize because of this one weird trick Singular learning theory links Implicit Regularization via Neural Feature Alignment, aka circles in the parameter-function map The shard theory of human values Predicting inductive biases of pre-trained networks Understanding and controlling a maze-solving policy network, aka the cheese vector Quintin's Research agenda: Supervising AIs improving AIs Steering GPT-2-XL by adding an activation vector Links for the addendum on mesa-optimization skepticism: Quintin's response to Yudkowsky arguing against AIs being steerable by gradient descent Quintin on why evolution is not like AI training Evolution provides no evidence for the sharp left turn Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets
6/15/2023
3:28:21
21 - Interpretability for Engineers with Stephen Casper
Lots of people in the field of machine learning study 'interpretability', developing tools that they say give us useful information about neural networks. But how do we know if meaningful progress is actually being made? What should we want out of these tools? In this episode, I speak to Stephen Casper about these questions, as well as about a benchmark he's co-developed to evaluate whether interpretability tools can find 'Trojan horses' hidden inside neural nets. Patreon: patreon.com/axrpodcast Store: store.axrp.net Ko-fi: ko-fi.com/axrpodcast Topics we discuss, and timestamps: 00:00:42 - Interpretability for engineers 00:00:42 - Why interpretability? 00:12:55 - Adversaries and interpretability 00:24:30 - Scaling interpretability 00:42:29 - Critiques of the AI safety interpretability community 00:56:10 - Deceptive alignment and interpretability 01:09:48 - Benchmarking Interpretability Tools (for Deep Neural Networks) (Using Trojan Discovery) 01:10:40 - Why Trojans? 01:14:53 - Which interpretability tools? 01:28:40 - Trojan generation 01:38:13 - Evaluation 01:46:07 - Interpretability for shaping policy 01:53:55 - Following Casper's work The transcript Links for Casper: Personal website Twitter Electronic mail: scasper [at] mit [dot] edu Research we discuss: The Engineer's Interpretability Sequence Benchmarking Interpretability Tools for Deep Neural Networks Adversarial Policies beat Superhuman Go AIs Adversarial Examples Are Not Bugs, They Are Features Planting Undetectable Backdoors in Machine Learning Models Softmax Linear Units Red-Teaming the Stable Diffusion Safety Filter Episode art by Hamish Doodles
