Powered by RND
PodcastsTechnologyInterconnects
Listen to Interconnects in the App
Listen to Interconnects in the App
(36,319)(250,152)
Save favorites
Alarm
Sleep timer

Interconnects

Podcast Interconnects
Nathan Lambert
Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, an...

Available Episodes

5 of 74
  • Let me use my local LMs on Meta Ray-Bans
    Full post for images, etc: https://www.interconnects.ai/p/to-meta-ray-ban-local-aiWith the Rabbit r1, the Humane pin, the Friend thing, the Sam Altman rumors, Meta Ray-Bans, and everything in between, it is obvious that we are going to get new devices in the near future driven by advancements in AI. Trying some of those that already are public makes this obvious from a functional perspective rather than a marketing perspective.Even though many of these devices will have a shelf life drastically shortened by the underlying API access getting turned off when the parent company runs out of money, the call for these devices is very strong. AI is going to be more than a chat window we use for work, we just don’t know what that will feel like. AI should be fun, flexible, and available.Meta’s Ray-Bans were first launched in 2021, long before any of this ChatGPT-inspired interest in AI began. Having tried them — the form factor would have caught on eventually, but AI was the catalyst to accelerate adoption. AI expanded our expectations for the range of exciting outcomes that could be coming our way.Using the AI in the Ray-Bans is much like using a protolithic chatbot. If I had never used ChatGPT, it would have been transformative, but today it feels slightly outdated. We should be more impressed by these generally and contextualize the AI they’re delivering. The product excitement cumulatively feels unexpectedly like what AirPods had on day 1. I was not expecting this fondness.The form factor for the Meta Ray-Bans is fantastic and drives this connection. I’ve been legitimately excited to use them (albeit, much more during sunny Seattle summers relative to now), and it immediately made sense when taking them out of the packaging. My best use has been for outdoor activities, taking photos and videos without needing to fuss with a phone and communications. An example video is below -- like most things, it has a learning curve.Here’s a photo from that outing:Or a video:Clearly, they’re fine.What I want to use them for today has nothing to do with AI. In some ways, this makes me more bullish on the form factor, but it makes it clear that Meta is in a precarious position. Ironically, I would’ve been more reluctant to buy them if not for the excitement about AI.As of writing this, I would much rather have “Apple Ray-Bans” because of a seamless integration with the rest of my information ecosystem. However, Apple may not be willing to take the risk to build them (as I avoid an Apple Vision Pro Digression).This does not mean the long-term story of many new devices won’t be the AI.AI, in the recent past (and likely in the near future), left most electronic devices with an eerie, bland sameness. My sunglasses can answer basic questions about my day just like Siri. At the same time, my appliances try to talk to me. The hard-to-visualize step is how this changes (and overcomes the same integration dead ends that agents face). AI in 5 years (or way less) will actually know the context of our lives and be able to execute basic web tasks.When the AI is good, Meta Ray-Ban type devices will be indispensable. Reminders, calls, reasoning, integration, all on the go. Much like the sensation products like AirPods provide, AI devices (and services) done right will make us free to be in the world naturally.Meta now has a real hill to climb for AI. They just need to focus on building one more useful feature at a time rather than building a god. They have a tangible goal and a real product that is going to get better in the normal march of progress. If only we had an ecosystem of people who wanted to do this work and keep hill climbing the AI part for them.The AI of the Meta Ray-Bans (and the other devices I started with) being primarily in the cloud is a drag but is needed for these first generations of glasses to maintain battery life. The cloud-centric nature of the AI is the largest perceivable reason Meta cannot open a Software Development Kit (SDK) for the glasses — all the developers would be doing is changing Meta's internal Llama API calls, rather than uploading new and improved models to the glasses.AI models in the cloud are consistently the first ones to cross the frontier of new capabilities. As we figure out what we want to use new AI devices for, using the cloud models will make us more likely than not to find useful applications. Now that we have things that people actually like, we need to optimize and specialize these models out of the cloud.What’s the state of local LMs?The AI angle for this post is to prompt the question: What do people actually use local, or on-device, language models for? What are they driving innovation of?The local model ecosystem is composed of a distribution of tinkerers, researchers, and those whom API models refuse their use cases. Most people doing this are not directly innovating on local models in a way that dictates meaningful improvements to underlying AI innovations. Yes, companies surely monitor progress and observe lessons, but there are far bigger markets at play for why local models are needed in the future of AI than the tinkerers that get visibility.Local language models are crucial for maintaining privacy (not everyone can afford fancy inference data centers like Apple), optimizing inference speed, and providing access in situations with no web connectivity. The Meta Ray-Bans stand to benefit from all of these.Phrasing the reasoning starting from the frontier, cloud models most people are used to, rather than what we want, it goes as: Local models shouldn’t try to be our general use case model. Outsource that to the cloud. Use local models for efficient, specific tasks out in the world.What local model enthusiasts are doing is building an ecosystem around optimization, latency, and task specialty that drives a lot of value. This value is captured by companies with no feedback loops to the tinkerers. Having SDKs and other direct places where those evolving local models can benefit in real ways is the goal. The models themselves will actually get better too — an actual potential feedback loop from open AI models.Just about a year ago I wrote a very similar take on local models, on how they have different trade-offs and trajectories. Apple Intelligence, Google’s new models / Pixel phones, and the Meta Ray-Bans are showing us that this future is coming.What is left to be understood is the manner in which local models are developed for new devices. Will any major technology companies let us run our own models with deep integrations? How can open-source principles and local models synergize?Hillclimbing with open, local language modelsGiving developers ways to integrate their own AI models into the operating system (OS) hooks used by the Meta Ray-Bans would immediately spawn a platform for local, open-weight language models. I first learned how locked down the Ray-Ban developer ecosystem was because I was excited to try and get our multimodal LM Molmo on them. That attempt didn’t make it far.Other companies, like Apple, could conceivably have SDKs that let users point their language models at OS hooks. Creating operating systems that allow users to integrate certain open models (even only those that are approved by the companies) would completely change the (lack of) incentives for iterating on language models in the open.While we still don’t have the new Apple Intelligence version of Siri that can plug into multiple applications, we know this works by letting an AI model generate tokens that correspond to actions in other applications. Letting users choose AI models (maybe their own), even if they only are useful in a subset of the tasks, would be wonderful. I would love to sacrifice whatever the AI situation is on my version of the Ray-Bans by default and get just the best vision model for explaining my environment, the best model for cooking ideas, or the best conversational model to just push the limits for AI devices in any of these promising directions. It would be so fun to try different AI models on a real device.The open language modeling ecosystem desperately needs these types of feedback loops (and it is totally natural for excitement about a type of technological development like this to exist before the proof cases of its value).Getting to the point where Meta has an AI SDK for devices along with the leading open language models will make their entire strategy value additive (rather than just destroying the advantages of competitors). In fact, Meta likely needs to do so, or else Apple’s product competitor may dominate the market. Only different strategies and feedback loops can dislodge Apple’s integration.On the modeling side, there’s no doubt we have step-change improvements coming to those used on the Ray-Bans. On ChatBotArena, we have many models with a few billion parameters that beat the first versions of ChatGPT. The same type of performance gain — where at 100X smaller model can match or surpass performance in a few years — will come for the Ray-Bans and all other sorts of AI applications.The big picture arc of technologyStarting in 2025, I’m excited about the breadth and quantity of profound, new technological experiences I’m having. Some of them, like ChatGPT Advanced Voice Mode, haven’t really landed for me (even though they’re extremely impressive to non-tech non-AI friends and family). Meta Ray-Bans, Waymos, Codex, and standard ChatGPT all feel like technologies that were immediately obvious as something I needed. I need to get a Starlink hub in one of the remote locations my hobbies bring me to, and I’m sure I can add reusable rockets to the transformations I’ve embraced.The last technologies sparking these joys were the likes of the iPod and the iPad.Every person I take to ride a Waymo for the first time has a similar experience of joy.This year we may also have new models that solve arbitrary internet tasks for us in the background.The future is here and we’re living in a time where it’ll be more evenly distributed. Get full access to Interconnects at www.interconnects.ai/subscribe
    --------  
    10:21
  • (Voiceover) DeepSeek V3 and the actual cost of training frontier AI models
    Original post: https://www.interconnects.ai/p/deepseek-v3-and-the-actual-cost-ofChapters00:00 Opening03:15 DeepSeek’s learning efficiency06:49 DeepSeek’s compute transparency and realityFiguresFig 1: Benchmark ResultsFig 2: ChatBotArena ResultsFig 3: Compute Usage Table Get full access to Interconnects at www.interconnects.ai/subscribe
    --------  
    17:06
  • The state of post-training in 2025
    Slides for this post-training talk and slides for the full tutorial on language modeling (with a bit less post-training content and no recording yet). Here are some timestamps for the video:00:00 Introduction 10:00 Prompts & Skill Selection 14:19 Instruction Finetuning 21:45 Preference Finetuning 36:17 Reinforcement Finetuning 45:28 Open Questions 52:02 Wrap UpPsssst… we just recently released our technical report for OLMo 2 — 2 OLMo 2 Furious, check it out for tons of training details and tips!This post has some good content, but if you just want to watch the tutorial on YouTube, it’s here.I’m far more optimistic about the state of open recipes for and knowledge of post-training starting 2025 than I was starting 2024. Last year one of my first posts was how open post-training won’t match the likes of GPT-4. This is still the case, but now we at least understand the scope of things we will be working with better.It’s a good time to record an overview of what post-training looks like today. I gave a version of this tutorial talk for the first time in 2023 (at ICML), which felt like a review of the InstructGPT paper not based on reproduced literature knowledge. In 2024, the scientific community made substantial progress in actually training these models and expanding the frontier of knowledge. Doing one of these talks every year feels like a good way to keep tabs on the state of play (whereas last year, I just had a bunch of links to add to the conversation on where to start).With the talk, I wanted to add more context on where I see post-training generally.The most important one people need to know, given the excitement around OpenAI’s o1 series of models, is that post-training alone is nowhere near a complete enough lens or taxonomy to study training reasoning language models. It’s a step.Back to processes for all modern AI models. There are a lot of post-training methods to improve models and, more importantly, they can be segmented so the scientific community can make progress on each of them individually. The new state of finetuning stages is satisfying, with three groups of training methods:* Instruction finetuning (a.k.a. supervised finetuning),* Preference finetuning (the generalization of reinforcement learning from human feedback), and* Reinforcement finetuning is the new abstraction for improving performance on specific tasks.Some of the long-tail methods like rejection sampling, knowledge distillation, and extensive filtering aren’t studied well, but you can still do excellent post-training without them. We have options for studying post-training in 2025.Where last year we were settling debates such as “DPO vs. PPO” or “does AI feedback for RLHF work,” now we are focused on just making the best practices better.Similarly, the stress around doing research on outputs from foundation model providers, i.e. if research violates the OpenAI terms of service on training competitor models, has dropped further and is common practice — in fact, distilling from strong models is a fundamental part of successful post-training.Interconnects is a reader-supported publication. Consider becoming a subscriber.To summarize the state of post-training, there are a few things to keep in mind:1. Post-training techniques are more impactful on the final performance of modelsSome caveats before I toot the horn of post-training as all you need today. Given that “scaling as we know it is ending” this is not entirely a controversial take. Finally, it is obviously self-serving to myself as someone who is going to benefit from post-training being more important.All of this aside, it’s very logical that post-training will be the next domain for scaling model compute and performance. Predicting the next token accurately is not something that a user cares about — correct answers and how the answer is presented are. All through 2024, there were way more discussions on how post-training is more important.If we look at the Elo ratings of models on ChatBotArena, we can see progress has accelerated even though the models haven’t been getting noticeably bigger. Pretraining on these architectures is improving, yes, but the biggest and best models are used as tools and supervision for better post-training.Post-training got more popular because there was more low-hanging fruit on model performance. A lot of that potential has been realized and, in doing so, entirely new types of models are being made akin to o1.To interpret these numbers:* 100 Elo margin over another means ~2/3 win probability over the lower,* 200 Elo gives ~76% win probability,* 300 Elo gives ~85% win probability, and so on.You can play with these numbers here.2. Post-training can be very expensiveWhile still far cheaper than pretraining due to the price of GPUs, post-training costs have been growing rapidly. If we estimate the costs of post-training the Llama models, we could guess that the all-in costs for the models were about the following: Note — numbers are based primarily on a combination of headcount and data costs with compute driving them even higher.* LLaMA (Q1 2023) * Llama 2 (Q3 2023) ~$10-20M: 1.4M preference pairs, RLHF, IFT, Safety, etc. and other costs not in the paper.* Llama 3.1 (Q3 2024) >$50M: similar preference data to Llama 2, a ~200-person post-training team, larger models, etc. The number could be much higher.Post-training costs from large data bills and extensive inference to generate, clean, and verify multiple types of synthetic training data. More complex loss functions, e.g. RL optimizers, use a lot of memory to train, but far fewer FLOPs than pretraining for general Instruct models. This is all growing rapidly and is expected to change.This culminates in the o1 style models where the compute with post-training loss functions can account for 40% or more of the overall compute of the model. Even Tülu 3, our major post-training project at Ai2 that didn’t buy any human data, I estimate costs >$1M for a large academic project.3. Post-training is less reliant on human dataWhile all the frontier laboratories still rely on human data for parts of their post-training pipeline (including both training and evaluation), AI can be substituted at most stages and get a “good enough” outcome. For example, given the costs above, they can be slashed from moving from human preference data that is ~$5-20 per preference point to AI feedback that is optionality of synthetic data driven by having models that are good enough for supervision makes the pace of post-training progress far higher. In my experience, AI feedback for RLHF only became possible with GPT-4 tier models and the academic community reaps extreme benefits from the plummeting cost of inference.4. Post-training ability is the door to advanced reasoning modelsDoing post-training well and having mastery of the techniques seems crucial to making progress on reasoning models like o1 due to the infrastructure for RL finetuning of an instruct model is the same as what is used for large-scale RL training, at least you want it to be.Given the above trends — we know more, it is easier to study, we have cheaper alternatives, etc. — there is cause for optimism in open replications of o1. It should still be expected that the first “replications” of o1 are more relative models — scaled up post-training on reasoning rather than the special pretraining + scaled RL that OpenAI does. We will learn a lot soon.The talk on YouTubeSlides for this post-training talk and slides for the full tutorial on language modeling (with a bit less post-training content and no recording yet). Here are some timestamps for the video:* 00:00 Introduction* 10:00 Prompts & Skill Selection* 14:19 Instruction Finetuning* 21:45 Preference Finetuning* 36:17 Reinforcement Finetuning* 45:28 Open Questions* 52:02 Wrap Up Get full access to Interconnects at www.interconnects.ai/subscribe
    --------  
    53:50
  • Quick recap on the state of reasoning
    In 2025 we need to disambiguate three intertwined topics: post-training, reasoning, and inference-time compute. Post-training is going to quickly become muddied with the new Reasoning Language Models (RLMs — is that a good name), given that loss functions that we studied via advancements in post-training are now being leveraged at a large scale to create new types of models. I would not call the reinforcement learning training done for OpenAI’s o1 series of models post-training. Training o1 is large-scale RL that enables better inference-time compute and reasoning performance. Today, I focus on reasoning. Technically, language models definitely do a form of reasoning. This definition does not need to go in the direction of the AGI debate — we can clearly scope a class of behavior rather than a distribution of explicit AI capability milestones. It’ll take work to get an agreement here. Getting some members of the community (and policymakers) to accept that language models do their own form of reasoning by outputting and manipulating intermediate tokens will take time. I enjoy Ross Taylor’s definition:Reasoning is the process of drawing conclusions by generating inferences from observations.This is a talk I gave at NeurIPS at the Latent Space unofficial industry track. I wanted to directly address the question on if language models can reason and what o1 and the reinforcement finetuning (RFT) API tell us about it. It’s somewhat rambly, but asks the high level questions on reasoning that I haven’t written about yet and is a good summary of my coverage on o1’s implementation and the RFT API.Thanks swyx & Alessio for having me again! You can access the slides here (e.g. if you want to access the links on them). For more on reasoning, I recommend you read/watch:* Melanie Mitchell’s series on ARC at AI: A Guide for Thinking Humans: first, second, third, and final. And her post on reasoning proper.* Miles Brundage’s thread summarizing the prospects of generalization.* Ross Taylor’s (previous interview guest) recent talk on reasoning.* The inference-time compute tag on Interconnects.Listen on Apple Podcasts, Spotify, YouTube, and wherever you get your podcasts. Transcript + SlidesNathan [00:00:07]: Hey, everyone. Happy New Year. This is a quick talk that I gave at NeurIPS, the Latent Space unofficial industry event. So Swyx tried to have people to talk about the major topics of the year, scaling, open models, synthetic data, agents, etc. And he asked me to fill in a quick slot on reasoning. A couple notes. This was before O3 was announced by OpenAI, so I think you can take everything I said and run with it with even more enthusiasm and expect even more progress in 2025. And second, there was some recording issues, so I re-edited the slides to match up with the audio, so you might see that they're slightly off. But it's mostly reading like a blog post, and it should do a good job getting the conversation started around reasoning on interconnects in the new year. Happy New Year, and I hope you like this. Thanks. I wouldn't say my main research area is reasoning. I would say that I came from a reinforcement learning background into language models, and reasoning is now getting subverted into that as a method rather than an area. And a lot of this is probably transitioning these talks into more provocative forms to prime everyone for the debate that is why most people are here. And this is called the state of reasoning. This is by no means a comprehensive survey. To continue, I wanted to make sure that I was not off base to think about this because there's a lot of debates on reasoning and I wanted to revisit a very basic definition. And this is a dictionary definition, which is the action of thinking about something in a logical, sensible way, which is actually sufficiently vague that I would agree with it. I think as we'll see in a lot of this talk is that I think people are going crazy about whether or not language models reason. We've seen this with AGI before. And now we're going to talk about it. Now, reasoning kind of seems like the same thing, which to me is pretty ridiculous because it's like reasoning is a very general skill and I will provide more reasoning or support for the argument that these language models are doing some sort of reasoning when you give them problems. I think I don't need to share a ton of examples for what's just like ill-formed arguments for what language models are not doing, but it's tough that this is the case. And I think there are. Some very credible arguments that reasoning is a poor direction to pursue for language models because language models are not going to be as good at it as humans. But to say that they can't do reasoning, I don't see a lot of proof for, and I'll go through a few examples. And the question is like, why should language model reasoning be constrained to look what look like what humans do? I think language models are very different and they are stochastic. The stochastic parents thing is true for many reasons. And. We should embrace this. And we should continue. And I think a big trend of the year is that we're seeing new types of language model reasoning that look less human. And that can be good for kind of separating the discourse for expecting a really narrow type of behaviors. I did an interview with Ross Taylor, who was a reasoning lead at Meta, which I thought was a very good education for me on this. And this is just a direct pull from the transcript. But essentially it's saying is like, if you do chain of thought on a language model. What it is doing is essentially outputting its intermediate steps. If I were to ask you all a math problem right now, you can do most of them in your head and you are doing some sort of intermediate storage of variables. And language models have no ability to do this. They are kind of per token computation devices where each token is outputted after doing this forward pass. And within that, there's no explicit structure to hold these intermediate states. So I think embracing chain of thought and these kind of intermediate values for the language models is extremely reasonable. And it's showing that they're doing something that actually gets to valuable outputs.Nathan [00:04:10]: So this is like one of the many ways that we can kind of lead towards O1 is that language models have randomness built into them. And a lot of what people see as failures in reasoning are kind of these language models following very static chains and making very specific mistakes. Along the way with really no ability to correct for that. This is really not something that we see in human reasoning. So if a human makes a mistake, they will normally catch it on the next step. But we need to handle language models differently.Nathan [00:04:41]: And why O1 is exciting is because it's a new type of language models that are going to maximize on this view of reasoning. Which is that chain of thought and kind of a forward stream of tokens can actually do a lot to achieve better outcomes. When you're doing a reasoning like ability or reasoning like action, which is just repeatedly outputting tokens to make progress on some sort of intelligence defined task. So it's just making forward progress by spending more compute and the token stream is the equivalent of some intermediate state.Nathan [00:05:18]: What is O1 has been a large debate since its release. I'm not going to spend a lot of this talk on it. But the more I've spent on it. Is that you should take open AI at their face value, which they are doing very large scale. RL on the verifiable outcomes is what I've added, especially in context of the RL API that they've released, which I'll talk about more. But most of the reasons to believe in more complicated things like process rewards, models, self play. Monte Carlo tree search are mostly based on previous literature and things that we would have expected advanced reasoning to look like for language models. And not based on evidence that they have given us or the behavior, whether you're looking at evaluations or how actually like inference is done when serving the model. This takes us to replications, or I would probably call them relatives of O1 coming from the community. These are wonderful to see. We are exploring the boundaries for like what we can do with chain of thought in models. The two I've highlighted are from deep seek and Quinn and a lot of people in this room have probably seen them. And I think that these models are really substantially narrower than these full O1 models from open AI. So open AI is if you use O1, you can do it for a lot more tasks. If you use like I was using the deep seek model and it's supposed to be for math or code. But they've tried to keep the model so narrow that even in that if you ask a code question, sometimes it'll be like I have only supposed to work on math or code. And a lot of the success of O1 and the future models of this is going to be able to. It being able to handle more tasks and more domains. So SemiAnalysis wrote a post that I haven't read in full. But even if you look at the paywalled headings, you can kind of make some intelligent claims about what O1 is or is not. I think these are two of the things from the table of contents that you can see without paying. I'm due to pay at some point, but I have not. And incredible amounts of forward passes during training. I think you'll see this as I discuss RL fine tuning models. Maybe more in a little bit. But when you're doing RL, there's two types of ways that you see data many times and that will relate in many or result in many forward passes. One is that when you're doing RL on a prompt, you can sample many completions to then grade them or use them in different ways to update your policy. So if I ask one math problem, I could look at eight completions and choose the best one or do some contrast of thing between the best and the worst one. And that kind of gradation can help the RL policy actually learn. And the second time, because the loss function is more flexible than something like instruction tuning, you can go over the same prompts many more times than you would in instruction tuning or kind of pre-training. So this kind of means they're doing just a lot of this sampling from the model, which is very different than other types of training we've seen in the past at pre and post-training. And then one of this one is great. Thanks for doing for showing everyone this is that post-training flops exceed pre-training. I think this pretty much clearly says that they're using RL. They're using a ton of compute for this large scale RL. And at that point, it would probably mean something different where this is like pre-training RL. And this is something that these early relative models are not going to be doing because we don't like no one has this infrastructure like OpenAI does. It'll take a while to do that, but people will make it.Nathan [00:08:50]: OK, this takes us to reinforcement fine tuning. I would say that this is a hard pivot in the talk where O1 is essentially pre-training scale RL. Extremely big RL. And we don't know what all the details of the data are to OpenAI then showing us this new beta API program that they're making, which is just a sprinkle of this. So what can you do with a tiny bit of their infrastructure? I think one of the fine tuning leads responded to a tweet from Swyx. And they were like the tweet letter. There was like one of the tweets. There was a long tweet that gave a lot of details. But even the first tweet I hadn't seen, I had like eight likes. And I was like. This API is using the same infrastructure that we used to train O1. And I was like that alone is like a lot of detail. There's like on Twitter is random thing. And then there's a really long details on other stuff of it. But it is just a new paradigm for fine tuning. And I have seen some of this work and I'm pretty optimistic that it'll work for kind of kind of really specific capabilities where answers matter rather than features in your style of text mattering. So. Again, kind of like I was hinting at with O1. This reinforcement fine tuning does many passes over the data, which is why they can say you only need dozens of labeled samples to actually learn from it, which is just very different than previous training regimes. So what happens is that the model gets a greater gives a bonus when the answer is right. And the model learns to reinforce behaviors that get right answers. And I'll move later in the talk. I'll highlight a research project that we did. That was pretty much doing a very similar thing to target very specific evaluations on open models. And you do RL and you give a reward bonus when the answer is right. And that's all you do. And the kind of key innovation and the simplicity is that modern language models are strong enough base where just a really gentle RL fine tuning can add these specific capabilities without degrading the model. I think a lot of fear for adding RL to these training regimes. And I'm sure we'll get to that in the future. But I think one of the biggest challenges for teams, especially on general instruct models like in chat to BT was just that they're going to destroy the rest of the performance, the base of chatting. So you care about. And it really seems like you can just do this out of the box. If open AI is going to allow an API, they aren't going to let people train a model that then just gets worse on random other things.Nathan [00:11:20]: So what the data format looks like this, the example they gave is way more complicated than I think you should. It's like, you can start with like a grade school math problem. And just say, like the correct method. Correct answer is the correct number. The genes are confusing. But essentially, you have two components, a prompt and an answer, which is different than having a prompt and completion that you would train on. Or if you're doing preference tuning, you would do a prompt and a chosen completion and a rejected completion. So it's a new type of data format, I suspect. Quickly, we'll see things like Hugging Base having more of these. I will highlight we have some of ours for our specific project that we did. We have examples for math. Math on the screen is an example for precise instruction following, which is the idea that if you have a prompt, you can say something like, have every sentence start with the letter A. And you can verify that with Python really easily. This is something that we did in our project. And it's like, the model gets better at this. You have constrained data, and the RL algorithm learns to change the model just a tiny bit and actually reach these answers.Nathan [00:12:23]: A confusing thing for people was these grader models. I think the place to come from these is evaluation. There's been a lot of work in evaluation to make answer extraction stable, especially with math, where an example that I used in the blog post I wrote today on this is like Lama 3.1 details their evals. For math, they use both SymPy, a Python process or Python package for extraction, and LLM, it's a judge, to extract their answers for math. And what the graders are doing is essentially amping this up to a whole lot. It's a whole nother level where it's kind of a nested structure of configs for doing reward shaping on these verifiable outputs. For math, it can be really easy. It's like, you know, you have to handle these five formats that I came up with in a minute for how you could represent different numbers and tokens. But as you get to more complicated things and more complicated behaviors, it seems like OpenAI is insinuating that you're going to need more than just a yes, no loss function for your domains. And that seems fine. Um, well, we already have a bunch of things. We have a bunch of open models that are doing like, um, like judge of models and Prometheus and other things that are designed specifically for LLM as a judge. And I see that continuing to just become part of this kind of open RL infrastructure.Nathan [00:13:41]: OpenAI had a bunch of screenshots. I'm not going to end on a commentary on these, but it looks pretty standard. They're going to track how performance changes over time and stuff like this. You'll be able to look at all the outputs. This is just them making pretty things. And then they have this like very generic RL plot. Um, the most standard RL plot is a X axis of time or trials and a Y axis of reward here. Reward is like an accuracy or a success rate on a certain validation set. And X is actually supposed to be like how much training was done. And. That's a very similar to what we did in our project. I think this is kind of just another way you can put this with an RL feedback diagram. If you've seen RL, where you have this agent interacting with the environment, this, you can squint at it and it'll be familiar. If you haven't. You'll probably be in for more of these things if RL keeps becoming popular because RL is really formulated as trial and error learning. But if you're interested, we're happy to try to have people use our code, which does this for math and some instruction tuning already, and we want to try more complicated graders for things like code. So for code quality, a binary outcome doesn't really make sense, which is a good way to think about why you might need to do some reward shaping for how you would grade outputs from a various model. And to kind of compare the plot that OpenAI had, which is like performance improving over time. These are some experiments we ran on various evaluations. So the left column is some language model evaluation that we would use in an academic paper. And the right is all the various internal, um, RL statistics where like GSMAK math and IFVL are all being trained on training sets. So we have the answer, we have the prompts, which are math questions, and we have the answers, which are numbers, and we're really doing. This RL on seeing if this answer is right. And then it generalizes to various math evaluations that we care about. So I kind of see this as like, we got a tip from a industry lab member to do this a few months early. So we got a head start. And I think a lot of people are obviously going to be trying to replicate this now. So it's fun that we have a starting point and I'm excited to talk about it with people this week. And I think reasoning is worth continuing as something. Yeah. I can read the post that I was referencing here and I'm happy to take any related or hard question on reasoning. Cause I kind of opened the floor for that. So thank you. Okay. Get full access to Interconnects at www.interconnects.ai/subscribe
    --------  
    16:22
  • (Voiceover) 2024 Interconnects year in review
    Original posthttps://www.interconnects.ai/p/2024-interconnects-year-in-review Get full access to Interconnects at www.interconnects.ai/subscribe
    --------  
    6:02

More Technology podcasts

About Interconnects

Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, and telling stories. www.interconnects.ai
Podcast website

Listen to Interconnects, BG2Pod with Brad Gerstner and Bill Gurley and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features
Social
v7.2.0 | © 2007-2025 radio.de GmbH
Generated: 1/17/2025 - 3:46:51 AM