Interconnects

Nathan Lambert

Science Technology

Latest episode

Available Episodes

5 of 127

Olmo 3: America’s truly open reasoning models
We present Olmo 3, our next family of fully open, leading language models. This family of 7B and 32B models represents:* The best 32B base model.* The best 7B Western-origin thinking & instruct models.* The first 32B (or larger) fully open reasoning model.This is a big milestone for Ai2 and the Olmo project. These aren’t huge models (more on that later), but it’s crucial for the viability of fully open-source models that they are competitive on performance – not just replications of models that came out 6 to 12 months ago. As always, all of our models come with full training data, code, intermediate checkpoints, training logs, and a detailed technical report. All are available today, with some more additions coming before the end of the year.As with OLMo 2 32B at its release, OLMo 3 32B is the best open-source language model ever released. It’s an awesome privilege to get to provide these models to the broader community researching and understanding what is happening in AI today.Paper: https://allenai.org/papers/olmo3 Artifacts: https://huggingface.co/collections/allenai/olmo-3Demo: https://playground.allenai.org/ Blog: https://allenai.org/blog/olmo3 Base models – a strong foundationPretraining’s demise is now regularly overstated. 2025 has marked a year where the entire industry rebuilt their training stack to focus on reasoning and agentic tasks, but some established base model sizes haven’t seen a new leading model since Qwen 2.5 in 2024. The Olmo 3 32B base model could be our most impactful artifact here, as Qwen3 did not release their 32B base model (likely for competitive reasons). We show that our 7B recipe competes with Qwen 3, and the 32B size enables a starting point for strong reasoning models or specialized agents. Our base model’s performance is in the same ballpark as Qwen 2.5, surpassing the likes of Stanford’s Marin and Gemma 3, but with pretraining data and code available, it should be more accessible to the community to learn how to finetune it (and be confident in our results).We’re excited to see the community take Olmo 3 32B Base in many directions. 32B is a loved size for easy deployment on single 80GB+ memory GPUs and even on many laptops, like the MacBook I’m using to write this on.A model flow – the lifecycle of creating a modelWith these strong base models, we’ve created a variety of post-training checkpoints to showcase the many ways post-training can be done to suit different needs. We’re calling this a “Model Flow.” For post-training, we’re releasing Instruct versions – short, snappy, intelligent, and useful especially for synthetic data en masse (e.g. recent work by Datology on OLMo 2 Instruct), Think versions – thoughtful reasoners with the performance you expect from a leading thinking model on math, code, etc. and RL Zero versions – controlled experiments for researchers understanding how to build post-training recipes that start with large-scale RL on the base model.The first two post-training recipes are distilled from a variety of leading, open and closed, language models. At the 32B and smaller scale, direct distillation with further preference finetuning and reinforcement learning with verifiable rewards (RLVR) is becoming an accessible and highly capable pipeline. Our post-training recipe follows our recent models: 1) create an excellent SFT set, 2) use direct preference optimization (DPO) as a highly iterable, cheap, and stable preference learning method despite its critics, and 3) finish up with scaled up RLVR. All of these stages confer meaningful improvements on the models’ final performance.Instruct models – low latency workhorsesInstruct models today are often somewhat forgotten, but the likes of Llama 3.1 Instruct and smaller, concise models are some of the most adopted open models of all time. The instruct models we’re building are a major polishing and evolution of the Tülu 3 pipeline – you’ll see many similar datasets and methods, but with pretty much every datapoint or training code being refreshed. Olmo 3 Instruct should be a clear upgrade on Llama 3.1 8B, representing the best 7B scale model from a Western or American company. As scientists we don’t like to condition the quality of our work based on its geographic origins, but this is a very real consideration to many enterprises looking to open models as a solution for trusted AI deployments with sensitive data.Building a thinking modelWhat people have most likely been waiting for are our thinking or reasoning models, both because every company needs to have a reasoning model in 2025, but also to clearly open the black box for the most recent evolution of language models. Olmo 3 Think, particularly the 32B, are flagship models of this release, where we considered what would be best for a reasoning model at every stage of training.Extensive effort (ask me IRL about more war stories) went into every stage of the post-training of the Think models. We’re impressed by the magnitude of gains that can be achieved in each stage – neither SFT nor RL is all you need at these intermediate model scales.First we built an extensive reasoning dataset for supervised finetuning (SFT), called Dolci-Think-SFT, building on very impactful open projects like OpenThoughts3, Nvidia’s Nemotron Post-training, Prime Intellect’s SYNETHIC-2, and many more open prompt sources we pulled forward from Tülu 3 / OLMo 2. Datasets like this are often some of our most impactful contributions (see the Tülu 3 dataset as an example in Thinking Machine’s Tinker :D – please add Dolci-Think-SFT too, and Olmo 3 while you’re at it, the architecture is very similar to Qwen which you have).For DPO with reasoning, we converged on a very similar method as HuggingFace’s SmolLM 3 with Qwen3 32B as the chosen model and Qwen3 0.6B as the rejected. Our intuition is that the delta between the chosen and rejected samples is what the model learns from, rather than the overall quality of the chosen answer alone. These two models provide a very consistent delta, which provides way stronger gains than expected. Same goes for the Instruct model. It is likely that DPO is helping the model converge on more stable reasoning strategies and softening the post-SFT model, as seen by large gains even on frontier evaluations such as AIME.Our DPO approach was an expansion of Geng, Scott, et al. “The delta learning hypothesis: Preference tuning on weak data can yield strong gains.” arXiv preprint arXiv:2507.06187 (2025). Many early open thinking models that were also distilled from larger, open-weight thinking models likely left a meaningful amount of performance on the table by not including this training stage.Finally, we turn to the RL stage. Most of the effort here went into building effective infrastructure to be able to run stable experiments with the long-generations of larger language models. This was an incredible team effort to be a small part of, and reflects work ongoing at many labs right now. Most of the details are in the paper, but our details are a mixture of ideas that have been shown already like ServiceNow’s PipelineRL or algorithmic innovations like DAPO and Dr. GRPO. We have some new tricks too!Some of the exciting contributions of our RL experiments are 1) what we call “active refilling” which is a way of keeping the generations from the learner nodes constantly flowing until there’s a full batch of completions with nonzero gradients (from equal advantages) – a major advantage of our asynchronous RL approach; and 2) cleaning, documenting, decontaminating, mixing, and proving out the large swaths of work done by the community over the last months in open RLVR research.The result is an excellent model that we’re very proud of. It has very strong reasoning benchmarks (AIME, GPQA, etc.) while also being stable, quirky, and fun in chat with excellent instruction following. The 32B range is largely devoid of non-Qwen competition. The scores for both of our Thinkers get within 1-2 points overall with their respective Qwen3 8/32B models – we’re proud of this!A very strong 7B scale, Western thinking model is Nvidia’s NVIDIA-Nemotron-Nano-9B-v2 hybrid model. It came out months ago and is worth a shot if you haven’t tried it. I personally suspect it may be due to the hybrid architecture making subtle implementation bugs in popular libraries, but who knows.All in, the Olmo 3 Think recipe gives us a lot of excitement for new things to try in 2026.RL ZeroDeepSeek R1 showed us a way to new post-training recipes for frontier models, starting with RL on the base model rather than a big SFT stage (yes, I know about cold-start SFT and so on, but that’s an implementation detail). We used RL on base models as a core feedback cycle when developing the model, such as during intermediate midtraining data mixing. This is viewed now as a fundamental, largely innate, capability of the base-model.To facilitate further research on RL Zero, we released 4 datasets and series of checkpoints, showing per-domain RL Zero performance on our 7B model for data mixes that focus on math, code, instruction following, and all of them together.In particular, we’re excited about the future of RL Zero research on Olmo 3 precisely because everything is open. Researchers can study the interaction between the reasoning traces we include at midtraining and the downstream model behavior (qualitative and quantitative).This helps answer questions that have plagued RLVR results on Qwen models, hinting at forms of data contamination particularly on math and reasoning benchmarks (see Shao, Rulin, et al. “Spurious rewards: Rethinking training signals in rlvr.” arXiv preprint arXiv:2506.10947 (2025). or Wu, Mingqi, et al. “Reasoning or memorization? unreliable results of reinforcement learning due to data contamination.” arXiv preprint arXiv:2507.10532 (2025).)What’s nextThis is the biggest project we’ve ever taken on at Ai2, with 60+ authors and numerous other support staff.In building and observing “thinking” and “instruct” models coming today, it is clear to us that there’s a very wide variety of models that fall into both of these buckets. The way we view it is that thinking and instruct characteristics are on a spectrum, as measured by the number of tokens used per evaluation task. In the future we’re excited to view this thinking budget as a trade-off, and build models that serve different use-cases based on latency/throughput needs.As for a list of next models or things we’ll build, we can give you a list of things you’d expect from a (becoming) frontier lab: MoEs, better character training, Pareto efficient instruct vs think, scale, specialized models we actually use at Ai2 internally, and all the normal things.This is one small step towards what I see as a success for my ATOM Project.We thank you for all your support of our work at Ai2. We have a lot of work to do. We’re going to be hunting for top talent at NeurIPS to help us scale up our Olmo team in 2026. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
--------
10:57
--------
10:57
Why AI writing is mid
First, on the topic of writing, the polished, and more importantly printed, version of my RLHF Book is available for pre-order. It’s 50% off for a limited time, you can pre-order it here! Like a lot of writing, I’ve been sitting on this piece for many months thinking it’s not contributing enough, but the topic keeps coming up — most recently via Jasmine Sun — and people seem to like it, so I hope you do too!It’s no longer a new experience to be struck by just how bad AI models are at writing good prose. They can pull out a great sentence every now and then, particularly models like GPT-5 Pro and other large models, but it’s always a quick comment and never many sustained successive sentences. More importantly, good AI writing feels like a lucky find rather than the result of the right incantation. After spending a long time working training these models, I’m fairly convinced that this writing inhibition is a structural limitation to how we train these models today and the markets they’re designed to serve.If we're making AIs that are soon to be superhuman at most knowledge work, that are trained primarily to predict text tokens, why is their ability to create high quality text tokens still so low? Why can’t we make the general ChatGPT experience so much more refined and useful for writers while we’re unlocking entirely new ways of working with them every few months — most recently the CLI agents like Claude Code. This gap is one of my favorite discussions of AI because it’s really about the definition of good writing is in itself.Where language models can generate beautiful images from random noise, they can't reliably generate a good few sentences from a couple bullet points of information. What is different about the art form of writing than what AI can already capture?I'm coming to believe that we could train a language model to be a great writer, but it goes against so many of the existing training processes. To list a few problems at different stages of the stack of varying severity in terms of their handicapping of writing:* Style isn’t a leading training objective. Language models all go through preference training where many aspects from helpfulness, clarity, honesty, etc. are balanced against each other. Many rewards make any one reward, such as style, have a harder time standing out. Style and writing quality is also far harder to measure, so it is less likely to be optimized vis-a-vis other signals (such as sycophancy, which was easier to capture).* Aggregate preferences suppress quirks. Language model providers design models with a few intended personalities, largely due to the benefits of predictability. These providers are optimizing many metrics for "the average user." Many users will disagree on what their preference for “good writing” is.* Good writing’s inherent friction. Good writing often takes much longer to process, even when you’re interested in it. Most users of ChatGPT just want to parse the information quickly. Doubly, the people creating the training data for these models are often paid per instance, so an answer with more complexity and richness would often be suppressed by subtle financial biases to move on.* Writing well is orthogonal to training biases. Throughout many stages of the post-training process, modern RLHF training exploits subtle signals for sycophancy and length-bias that aren't underlying goals of it. These implicit biases go against the gradient for better writing. Good writing is pretty much never verbose.* Forced neutrality of a language model. Language models are trained to be neutral on a variety of sensitive topics and to not express strong opinions in general. The best writing unabashedly shares a clear opinion. Yes, I’d expect wackier models like Grok to potentially produce better writing, even if I don’t agree with it. This leads directly to a conflict directly in something I value in writing — voice.All of these create models that are appealing to broad audiences. What we need to create a language model that can write wonderfully is to give it a strong personality, and potentially a strong "sense of self" — if that actually impacts a language model's thinking. The cultivation of voice is one of my biggest recommendations to people trying to get better at writing, only after telling them to find something they want to learn about. Voice is core to how I describe my writing process.When I think about how I write, the best writing relies on voice. Voice is where you process information into a unique representation — this is often what makes information compelling.Many people have posited that base models make great writers, such as when I discussed poetry with Andrew Carr on his Interconnects appearance, but this is because base models haven’t been squashed to the narrower style of post-trained responses. I’ve personally been thinking about this sort of style induced by post-training recently as we prepare for our next Olmo release, and many of us think the models with lower evaluation scores on the likes of AlpacaEval or LMArena actually fit our needs better. The accepted style of chatty models today, whether it’s GPT-5, DeepSeek R1, or a large Qwen model, is a bit cringe for my likes. This style is almost entirely applied during post-training.Taking a step back, this means base models show us that there can be great writing out of the models, but it’s still far from reliable. Base models aren't robust enough to variations to make great writers — we need some form of the constraints applied in post-training to make models follow Q&A. The next step would be solving the problem of how models aren’t trained with a narrow enough experience. Specific points of view nurture voice. The target should be a model that can output tokens in any area or request that is clear, compelling, and entertaining. We need to shape these base models with post-training designed for writing, just as the best writers bend facts to create narrative. Interconnects is a reader-supported publication. Consider becoming a subscriber.Some models makers care a bit about this. When a new model drops and people rave about its creative writing ability, such as MoonShot AI’s Kimi K2 line of model, I do think the team put careful work into the data or training pipelines. The problem is that no model provider is remotely ready to sacrifice core abilities of the model such as math and coding in pursuit of meaningfully better writing models. There are no market incentives to create this model — all the money in AI is elsewhere, and writing isn’t a particularly lucrative market to disrupt. An example is GPT 4.5, which was to all reports a rather light fine-tune, but one that produced slightly better prose. It was shut down almost immediately after its launch because it was too slow and economically unviable with its large size.If we follow the voice direction, the model that is likely to be the best writer relative to its overall intelligence was the original revamped Bing (aka Sydney) model that went crazy in front of many users and was rapidly shut down. That model had THOUGHTS it wanted to share. That’s a starting point, but a scary one to untap again. This sort of training goes far beyond a system prompt or a light finetune, and it will need to be a new post-training process from start to end (more than just a light brush of character training).We need to be bold enough to create models with personality if we want writing to fall out. We need models that speak their views loudly and confidently. These also will make more interesting intellectual companions, a niche that Claude fills for some people, but I struggle with Claude plenty of times due to its hesitance, hedging, or preferred answer format.For the near future, the writing handicap of large language models is here to stay. Good writing you have to sit in to appreciate, and ChatGPT and the leading AI products are not optimized for this whatsoever. Especially with agentic applications being the next frontier, most of the text written by the models will never even be read by a human. Good writing is legitimately worse for most of the use cases I use AI for. I don’t like the style per se, but having it jump to be a literary masterpiece would actually be worse.I don’t really have a solution to AI’s writing problem, but rather expensive experiments people can try. At some point I expect someone to commission a project to push this to its limits, building a model just for writing. This’ll take some time but is not untenable nor unfathomably expensive — it’ll just be a complete refresh of a modern post-training stack.Even if this project was invested in, I don’t expect the models to be close to the best humans at elegant writing within a few years. Our current batch of models as a starting point are too far from the goal. With longer timelines, it doesn’t feel like writing is a fundamental problem that can’t be solved. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
--------
8:28
--------
8:28
Interview: Ant Group's open model ambitions
This is the first of a handful of interviews I’m doing with teams building the best open language models of the world. In 2025, the open model ecosystem has changed incredibly. It’s more populated, far more dominated by Chinese companies, and growing. DeepSeek R1 shocked the world and now there are a handful of teams in China training exceptional models. The Ling models, from InclusionAI — Ant Group’s leading AI lab — have been one of the Chinese labs from the second half of the year that are releasing fantastic models at a rapid clip. This interview is primarily with Richard Bian, who’s official title is Product & Growth Lead, Ant Ling & InclusionAI (on LinkedIn, X), previously leading AntOSS (Ant Group’s open source software division). Richard spent a substantial portion of his career working in the United States, with time at Square, Microsoft, and an MBA from Berkeley Haas, before returning to China and work at Ant.Also joining are two leads of the Ant Ling technical team, Chen Liang (Algorithm Engineer), and Ziqi Liu (Research Lead).This interview focuses on many topics of the open language models, such as:* Why is the Ant Group — known for the popular fintech app AliPay — investing so much in catching up to the frontier of AI?* What does it take to rapidly gain the ability to train excellent models?* What decisions does one make when deciding a modeling strategy? Text-only or multimodal? What size of models?…* How does the Chinese AI ecosystem prioritize different directions than the West?And many more topics. Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.Some more references & links:* InclusionAI’s homepage, highlighting their mission.* AntLingAGI on X (models, research, etc.), InclusionAI on X (overall initiative), InclusionAI GitHub, or their Discord community.* Ling 1T was highlighted in “Our Picks” for our last open model roundup in October.* Another interview with Richard at State of Open Conference 2025.* Over the last few months, our coverage of the Chinese ecosystem has taken off, such as our initial ranking of 19 open Chinese AI labs (before a lot of the models we discuss below), model roundups, and tracking the trajectory of China’s ecosystem. An overview of Ant Ling & Inclusion AIAs important context for the interview, we wanted to present an overview of InclusionAI, Ant’s models, and other efforts that emerged onto the scene just in the last 6-9 months. To start — branding.Here’s a few screenshots of InclusionAI’s new website. It starts with fairly standard “open-source AI lab messaging.”Then I was struct by the very distinct messaging which is surprisingly rare in the intense geopolitical era of AI — saying AI is shared for humanity.I expect a lot of very useful and practical messaging from Chinese open-source labs. They realize that Western companies likely won’t pay for their services, so having open models is their only open door to meaningful adoption and influence.Main models (Ling, Ring, & Ming)The main model series is the Ling series, their reasoning models are called Ring, and their Multimodal versions are called Ming. The first public release was Ling Plus, 293B sparse MoE in April. They released the paper for their reasoning model in June and have continued to build on their MoE-first approach.Since then, the pace has picked up significantly. Ling 1.5 came in July.Ling (and Ring) 2.0 came in September of this year, with a 16B total, 2B active mini model, an 100B total, 6B active flash model, and a big 1T total parameter 50B active primary model. This 1T model was accompanied by a substantial tech report on the challenges of scaling RL to frontier scale models. The rapid pace that Chinese companies have built this knowledge (and shared it clearly) is impressive and worth considering what it means for the future.Eval scores obviously aren’t everything, but they’re the first step to building meaningful adoption. Otherwise, you can also check out their linear attention model (paper, similar to Qwen-Next), some intermediate training checkpoints, or multimodal models.Experiments, software, & otherInclusionAI has a lot of projects going in the open source space. Here are some more highlights:* Language diffusion models: MoEs, sizes similar to Ling 2.0 mini and flash (so they likely used those as base). Previous versions exist. * Agent-based models/fine-tunes, Deep Research models, computer-use agentic models.* GroveMoE, MoE arch experiments.* RL infra demonstrations (Interestingly, those are dense models)* AWorld: Training + general framework for agents (RL version, paper)* AReal: RL training suite Interconnects is a reader-supported publication. Consider becoming a subscriber.Chapters* 00:00:00 A frontier lab contender in 8 months* 00:07:51 Defining AGI with metaphor* 00:20:16 How the lab was born* 00:23:30 Pre-training paradigms* 00:40:25 Post training at Inclusion* 00:48:15 The Chinese model landscape* 00:53:59 Gaps in the open source ecosystem today* 00:59:47 Why China is winning the open race* 01:11:12 A metaphor for our moment in LLMsTranscriptA frontier lab contender in 8 monthsNathan Lambert (00:05)Hey everybody. I’m excited to start a bit of a new series when I’m talking to a lot more people who are building open models. Historically, I’ve obviously talked to people I work with, but there’s a lot of news that has happened in 2025 and I’m excited to be with one of the teams, a mix of product, which is Richard Bian and some technical members from the Ant Ling team as well, which is Chen Liang and Ziqi Liu. But really this is going to be a podcast where we talk about how you’re all building models, why you do this. It’ll talk about different perspectives between US, China and a lot of us going towards a similar goal. I was connected first with Richard, who’s also talked to other people that helped with Interconnects. So we can start there and go through and just kind of talk about what you do. And we’ll roll through the story of building models and why we do this.Richard Bian (01:07)Hi. Again, thanks so much, Nathan. Thanks so much for having us. My name is Richard Bian. I’m currently leading the product and growth team of Ant Ling, which is part of the Inclusion AI lab of Ant Group. So Ant Group is the parent company of Alipay, which might be a product which many, many more people know about. But the group has been there for quite some time. It used to be a part of Alibaba, but now it’s a separate company since 2020. I actually have a pretty mixed background. Before I joined the Ling team, I’ve been doing Ant open source for four years. In fact, I built Ant open source from a technical strategy, which is basically a one-liner from our current CTO all the way into a full-fledged multifunctional team of eight people in four years. So it has been a pretty rewarding journey. And before that, my last life, I’ve been spending 11 years in the States working as a software engineer with Microsoft and with Square. Again, it was a pretty rewarding past. I returned back to China during COVID to be close with my family. It was a conscious decision. So far so good. It has been a pretty rewarding journey. And I really love how Nathan you name your column as Interconnects and you actually echoed when you just began the conversation just now. I found that to be a very noble initiative. So very honored to be here.Nathan Lambert (02:48)Hopefully first of many, but I think you all have been doing very interesting stuff in the last few weeks, or last few months, so it’s very warranted. And do you two want to introduce yourselves as well?Chen Liang (02:58)Me first. My name is Chen Liang and I’m the algorithm engineer of Ling Team, and I’m mainly responsible for the floating point 8 training during the pre-training. Thank you.Ziqi Liu (03:16)My name is Ziqi Liu and I graduated, a PhD from Jiao Tong University in China. And I’ve been working at Ant Group for about eight years. And currently I’m working on the Ling language model. That’s it.Nathan Lambert (03:45)Nice. I think the way this will flow is I’m going to probably transition. It’ll start more with Richard’s direction. Then as we go, it’ll get more technical. And please jump in. I think that we don’t want to segment this. I mean, the border between product growth, technical modeling, whatever, that’s why AI is fun is because it’s small. But I would like to know how Inclusion AI started and all these initiatives. I don’t know if there’s a link to Ant OSS. I found that in prep and I thought that was pretty interesting and just kind of like, how does the birth of a new language modeling lab go from idea to releasing one trillion parameter models? So like, what does that feel like on the ground?Richard Bian (04:18)There’s actually one additional suffix for that in eight months’ time. In fact, we kind of began all of this initiative in February this year. So just to begin with for the audience who probably didn’t know much about Inclusion AI, Inclusion AI basically envisions AGI as a humanity’s shared milestone, not a privileged asset. So we started this initiative back in the February of 2025, inspired by the DeepSeek Research Lab. So the DeepSeek Research Lab and their publication, in fact, motivated a lot of people. I believe not only in China, but globally. Taking one step more closer to the AGI initiative by showing it’s probably not an exclusive game for only the richest people who can afford the best hardware and the best talent. So the way we’re kind of looking at it is like why we named that Inclusion is because we actually have that gene with the company. So the decision was actually made, of course, the decision was made beyond my pay grade, but it was actually very well informed internally for the mission and vision that we want to be more like DeepSeek, which is a research lab with a dedicated effort of pursuing AGI. In fact, I mean, if you kind of think about Ant Group with our business model, like we’re a Fintech company, to some extent, very similar to a combination of Square, Stripe, and many other companies in the States, we have a very broad range of businesses which focus not only on the financial vertical, but on medical insurances and the technical services as well. So a lot of those businesses. In order for us to actually be able to support those businesses, I would say long-term success in the next five to 10 years is going to be critically important for us to be able to really focus on the fundamentals of AI. And we feel that the language model is a key to that door. We cannot give up on that initiative.Nathan Lambert (06:52)There’s a lot here and I agree with this. And I think that it’s like, the Ant Group is a big large tech company. And I think large tech companies being able to train AI as like most of the audience here is going to be like, yes, they definitely should be doing this. It’s a transformative technology. I think the two things to double click on are, we’re going to have to define like what you think of as AGI and why you’re pursuing this. Because it has to go deeper than like a term that we are doing. I know like DeepSeek is very ideological in their pursuit of intelligence. So I think it’s good to do that. And then I will also double click on the question of like, why open models and like, because DeepSeek is doing like open and as strong as they can, they’re text only. We’ll talk about this later. But it’s like, let’s do each of these individually to kind of ground the motivation.Defining AGI with metaphorRichard Bian (07:51)Sure. I guess, I mean, for AGI, the way we are looking at it is like, I don’t think there’s a definitive answer to that. I mean, if we kind of search Google or any other search engines, it will give you a line, which means something. But it doesn’t mean anything, honestly, to me personally, just by looking at the definition. I would probably use a metaphor. People are probably very familiar with the navigation era. It’s a glorious navigation era back in the 1400s. Now, I think it feels more like all the ships are just leaving Lisbon last year, or maybe like two years ago.Nathan Lambert (08:18)I like it. I agree with this more than most of the definitions, because a lot of the definitions are grounded in like work or something.Richard Bian (08:26)The one I’m kind of looking at is like, all the ships are leaving Lisbon. Some of them are heading west, knowing for a fact that, hey, India is over there. But now we all know the truth that India is on the east side. But it doesn’t matter. It’s the whole American continent. So the way I’m kind of looking into the definition of AGI right now is like, I personally have a very firm belief that human intelligence and machine intelligence, to some extent, have their similarities. Humans are trying to, to some extent, explore the limit of human intelligence with the help from the machines. So when everything was beginning, we were kind of using all of this as a co-pilot mode. But moving forward, there are all of these theories indicating that there might be an intrinsic point that the machine intelligence, it goes all the way back from the tooling time. They believe that machine intelligence might, at one point, exceed human intelligence. So I guess we’re looking to that pivoting point. Before we reach there, honestly, I don’t know where we’re going and how long we can go towards that particular direction. But clearly, there are some common consensus right now, including maybe MoE (Mixture of Experts) as architecture, including the pre-training, even to some extent, we’re seeing a diminishing return. But pre-training is still pretty important. And reinforcement learning, to some extent, is probably another general agreement that this might not be wrong. We don’t know if this is right, but it might not be wrong. So there are all of these exploratory directions that we believe in. So we’re just kind of sailing there and see how that goes.Nathan Lambert (10:20)I love this. And I think the crucial question is for Chen or Ziqi is like, the team like, how do you build team alignment around this? Is this something that you feel like you walk into the office or get on a call and everybody’s in agreement? Or is this like a vision that you’re still building or trying to sell? Like, to what extent you could say, because I think there’s a big difference between like, I buy the vision for Inclusion AI, but it’s like, how real is this when you’re across the org?Richard Bian (10:49)I can maybe share my feeling and Ziqi and Chen can chime in. Of course, at the very beginning, there’s skepticism. It’s by human nature, right? So the way we’re looking at it is like, I think DeepSeek gives a very clear indication that this might be working. There has been this hazy, chaotic era of 2024, which nobody has the tools to navigate. So people are very cautious about sailing. You see ships going out and came back crippled, and you begin to worry about what’s going on there.Nathan Lambert (11:34)I think there’s a big difference between the US because I think in the US everybody was bought in. And I’ve talked to a few more labs in China and it’s like there’s so much emotional energy on the DeepSeek moment in China that I think in the US people forget about it where it’s like, I could see this in the sequence of releases as well because it’s like everybody had a few months after DeepSeek like all these labs in China have started releasing models and I just think that it’s good to have you say this, is a shared sense of people so people can internalize like how much has been mobilized. And that’s kind of a culturally salient point.Richard Bian (12:04)It’s motivating. To some extent, there was this very famous navigator called Zheng He back in the Ming dynasty. So I think basically when Zheng He was able to pretty much pull through the trip all the way to India from China, people began realizing that, hey, not only the Portuguese can do this kind of long journey sailing, the Chinese can do that too. And we’re exploring different parts of the map. You know, toward the end of the day, nobody knows the whole picture. So the way I’m kind of looking at it is like, first, I’m very bought into the mission to some extent that it kind of feels like, you know, even though we begin sailing late, but we do have our own kind of taste to this game. So we will be able to contribute. And you did ask about the question, you know, like why we chose to be open, right? To some extent, I cannot really believe that open is a choice, just like how the leaders in this game are not the most open player in the game, right? But if you’re kind of thinking about playing poker, the trick leader has their own strategy, which is all understandable. For us, because we’re joining the game at this stage, I guess the best strategy would kind of feel like, A, really trying to follow suit to the right direction to minimize the mistakes we’re making at this moment because we’re so late. Second, stay open and stay polished. So keep a very open mind about what’s going on in the surroundings. And that’s probably the best we can do. That’s my two cents.Nathan Lambert (13:51)To provide some color and I’ll have a whole note in the page that I release with this for people listening. The first Ling model, which is like their text only model, very, you could see iterations from DeepSeek and the architecture was in April and then a big updated Ling 1.5 in July. And then in September or recently was Ling 2.0, which also came with a multimodal Ming and a reasoning Ring model. And I think like by this September release is when like me and a couple of people that work at Interconnects were like, Holy crap, like this is a, this is like very much a real deal model. And to kind of ramp in that period of time is not easy. Like there’s a lot of companies in the US that are trying to do this right now. A few companies in China have shown that they can do this. And it’s like, I guess if you want to explain this kind of Ling, Ring, Ming series of models and like if this is a clear strategy behind this or if this is what works like, how did you evolve through the first models through the summer to today to kind of get to this point?Richard Bian (14:56)Sure. So I mean, first and foremost, I think the foundation model is really important. To some extent, I’ve been working with many people on the system side, because Ant Group has a very solid cloud-native infrastructure team. So the team has been, when we talk about this, we’re kind of beginning using the metaphor. The model is really like an operating system. It’s not like the operating system itself, but it’s more like the kernel. Right, so only a few people can actually write kernel code, even nowadays. Just like how there’s the most talented people who can actually work on the model team right now. We feel that it’s not only a key leading to the technical future, but it’s also a key leading to the user experience future. Because we do see the, I personally believe in the trend of technology brings in new interactions which will lead to new product, which will lead to new business models, which will lead to potentially new organization structure, rinse and repeat. So we kind of like really choose to do the fundamental model of the Ling series because of that. And the Ring series is an obvious next, given the relationship between V3 and R1. It definitely indicates about how we can potentially take a very polished, well, actually, a very intelligent individual, unpolished, and put some reinforcement learning on it to make it a much better individual in one clear vertical direction. We’re going to be touching on some of those kind of technical aspects in our conversation next. But that has been a very clear direction.Nathan Lambert (16:48)Do you see this evolving with kind of feedback from within Ant Group, which is like, you’ve also released this diffusion language model. A diffusion language model is very interesting. I’m going to just go out on a little bit of a side rant because I’ve heard, I was talking to people about these and it’s like very hit or miss with me, whether or not I think they’re going to be big. Because we see that tool use and reasoning is a big thing. So the whole idea of a diffusion language model is you generate a very long sequence at once and that could save on costs because you don’t have this kind of quadratic memory increase and you do very long sequences. So I saw that I was optimistic. And then you see the idea of tool use, which is like, you have to be able to chop up the reasoning. And I was like, I’m really bearish on diffusion models for language again, because you have to be able to search and execute code. But then I was hearing that in like user facing products, like code diffs, where if you’re generating a website and you did take a prompt and go to a huge diff on a code base really fast, then language diffusion is actually really nice. And the motivation of the question is like, do you have this feedback loop in your modeling where Ant Group is trying to use these things for products and might like have a bit of a feedback of like this latency isn’t fast enough or like this area you need to move it to, or is this kind of like a separate play of just build the best models you can and figure it out later?Richard Bian (18:12)That’s a very perfect question. We use this metaphor that we’re probably also doing this reinforcement learning in real life by trial and error. Almost kind of feels like, so I think Nathan, you nailed a very good question. And there are some very clear consensus about coding agents, tool use and people kind of going down a path and pursuing their own business models and begin making revenues. So that’s one type of usage patterns for language models. We do that and we see some very clear, I would say feedback loops in that direction. So that’s one pillar. And the second pillar is about the not so clear aspect. By saying the not so clear aspect, it’s like, I believe everyone in the Silicon Valley and in Seattle is still scratching their heads trying to understand about, hey, when can I break even with all this investment? Are we really generating enough user values kind of back to, I’m a product person. So all of those kinds of words keep coming back into my head. And, you know, at this moment, consciously speaking, it’s very hard to come to the conclusion that, you know, all of this is valuable enough for the end user. But, you know, we’re trying to explore the directions for that. I would say a lot of the, you know, generating the whole website, you know, what Labo did, it’s an interesting form of product. But at this moment, we don’t know if it’s A, sustainable as a business model, B, if this is the best type of product we can offer to the user. So all of those are iterative. Within company, we do have some of those explorative products that use our models, not only the Ring model, but Ming as well, like the multimodal. And you mentioned about the, so that’s the second pillar. And the latter is more like the last pillar, because Ant Group does have a research institution called Ant Research. So the model is a joint collaboration between the research and the Ling Team.How the lab was bornNathan Lambert (20:16)I guess there’s another like org chart question, which is like, where in the structure of the big tech company that is Ant did this Inclusion AI slash Ling and all of this grow? Like, is this within cloud that there’s a new modeling or research org or is it kind of separate? Like, do you feel like this is a part of the bigger company or are you kind of insulated from this?Richard Bian (20:42)You can actually search on Google and find information about Ant Research which is a joint research lab focusing more on a lot of these frontier technologies like graph, deep learning, reinforcement learning, before all of this. So that’s the background of Ant Research. And second, when we begin forming the AGI initiative of Inclusion AI, we begin getting very serious. So we begin putting all of these resources together to some extent physically, but more from the organizational ways of saying that all of these teams of financial models and research lab institution and the user experience expert focusing on exploratively looking into the next big application that people will actually use. So all of this, we kind of began forming this internal, I wouldn’t call that organization, but more like this internal initiative directly driven by our CTO. So it’s very serious effort. It’s very serious to the extent that, you know, it feels more like when the team actually formed the original DeepSeek initiative. So all of these people, you do nothing else but only focusing on this and this is the only important thing for this.Nathan Lambert (22:01)It’s like so much of this is that the mystique I feel like is that in the West, we don’t get what would normally be gossip of what is happening in the Chinese tech ecosystem, which I don’t think this is hard to see if you have friends that work at Ant Group, because it’s probably you’re moving hundreds of people’s jobs around and people talk. Whereas like in my circles, it’s like, Meta is doing another reorg. And then you hear about it in the news a few days later. So it’s just like, I don’t know. That’s my reflection hearing all of this. And I’m mostly learning that all of these orgs end up similar in size. And then you have to prioritize resources per researcher and all of these normal things. I’m going to start transitioning into this section we had prepped on actual modeling things, which is mostly on pre-training, which is fun. I think that state of affairs on my pre-training knowledge from AI2 is that we’ve scaled, done plenty of dense models and some architecture things from up to like 32B, some experiments at 70B that one didn’t work out. MoE is work in progress. So I’m personally very interested in architectural decisions that enable MoEs and long context. Pre-training paradigmsI think the kind of basic thing is just like, if you’re pre-training, I mean, this is for Ziqi is like, what does your, how do you feel like your trajectory is as a researcher as you’re going through these months? This could be just like, what does your work feel like when you’re trying to boot up like a DeepSeek style, very ambitious lab building new infrastructure and getting models off the ground. And then we’ll kind of go into some more specific discussions around like Ling 1T later and stuff like this. But it’s like, how is building this?Ziqi Liu (23:45)Our architecture indeed refers to OpenAI’s scaling law or DeepSeek’s scaling law. They really do a good job. In our Ling scaling law, the non-embedding training FLOPs play the central role of our scaling law. So we set up our own framework that provides foundation for a standardized experimental pipeline. So there are many questions when we start conducting scaling law under the MoE architecture. So the first question is, can we find simple rules for finding optimal hyperparameters with respect to training FLOPs, which are not sensitive to the structure of MoE. Similar to DeepSeek, we first discovered the optimal critical hyperparameters with respect to training FLOPs and the MoE architecture. We find those optimal hyperparameters are not that sensitive to the structure of MoE, like the activation ratio and something others in a mild condition, but more related to the training FLOPs. So this is our first finding. And then we found activation ratio is critical and can consistently improve if we reduce activation ratio.Nathan Lambert (25:14)Can you say more about this? I mean, most of pre-training is a lot of different things, which you’re accumulating FLOP efficiency while getting model performance. And then it’s like Chen, you also were saying you focused on FP8 stability, FP8 and training stability in general. So I’m kind of curious of like any major, like, what is your biggest impressions of focusing on kind of this narrow thing in pre-training, which is getting more memory by using lower precision while maintaining stability. So if you have any like high level takes on pre-training stability at that precision, then I’ll zoom into more specific questions on scaling up from there.Chen Liang (26:00)At first we heard about the floating point 8 from DeepSeek. They used floating point 8 training through the training of DeepSeek. And we also tried the recipe of them, the block-wise INT8 in the Megatron. And we find that actually the MFU (Model FLOPs Utilization) is not very high. And sometimes it’s even slower than the BF16 (bfloat16) training. And we find that the main costs are the quantization and dequantization. So actually, the floating point 8 is not as fast as they claimed, actually. And we profile the whole training data and try to minimize the quantization and dequantization process.Nathan Lambert (26:50)What is getting quantized and dequantized?Chen Liang (26:53)If you want to try the floating point 8 training, it’s actually due to GEMM (General Matrix Multiply) in the linear layers. And you want to quantize the weights and the inputs to FP8 (E4M3) type. But the other structure, they compute in the BF16, BFloat16 type. So when you get into the linear layer, you need to quantize it to the floating point 8, and then do the GEMM. And the GEMM output is the BFloat16. So this is the way you need to quantize and dequantize to adapt the other structure.Nathan Lambert (27:43)And then what does your work actually look like in getting this? So you find it to be not as fast. Like, what do you actually do to change this?Chen Liang (27:50)In the MoE layer, it’s got the FC1 (Fully Connected 1) and FC2 (Fully Connected 2), right? And in the middle of them, they’ve got the switch gated function. So FC1, switch gated function and FC2. And the output of FC1 is the BFloat16. And we fuse the operation of the switch gated function and the quantization function. So we fuse them, the two operations, into one. And so it saves some time. And the MoE layer is a batched operation. So you need to actually do the activation function on all the experts. So that’s a lot of time.Nathan Lambert (28:52)For people listening, FC is fully connected, which is just the standard neural network layer. So I might be being silly, but generally the idea with MoEs is that you have the feed forward layers, take up the most parameters and you get more efficient by adding MoEs. And within the MoE, kind of gated to each expert, is it actually standard that it’s like fully connected, MoE gate, fully connected? And it’s kind of alternating because I know this normally like attention block, MoE block is like the higher level of abstraction. And it’s this fully connected, MoE gating and then fully connected, is that actually industry standard? And I just had like a lapse in my brain.Chen Liang (29:37)This structure is conventional actually. Some experiments have explained that the switch gated can make your gradient stable during training. So it’s actually a standard architecture.Nathan Lambert (29:51)When you’re actually experimenting on this, is this the sort of thing that when you’re doing it at your like first models were about 300B total and you had smaller models? Like, is this a sort of thing done where you get this performance at every scale? Or do you have to revisit this when you’re doing something like Ling 1T, which is this latest model with way more parameters? Because I think the root of my question is like, are the numerical problems you get from scaling like whack-a-mole, where it’s like an old problem that you fixed becomes a problem again? Or is it an entirely new type of thing that comes up when you’re going to big models?Chen Liang (30:26)We do the experiment on the size of 100 billion parameters first. Also the situation can be, we can learn from the situation. That size, not just the 1T.Nathan Lambert (30:43)And I remember reading, I saw that you guys did QK norm for this as well. Is this just like, you also found this to be standard and work for you because we’ve had some issues with long context and doing QK norm kind of hurting performance there. We still have some ablations to track down.Chen Liang (30:47)We actually do the experiment of the QK norm on BFloat16 and the result comes out. The loss is better than if you didn’t apply the QK norm. And actually the one big thing is that when you do the floating point 8 training, if you do not apply QK norm before the rotary embedding, the gradient of the linear QKV may be underflow. Most of the time, it’s underflow because without the QK norm. So if we want to apply the floating point 8 training, you need to add the QK norm to avoid the quantization error. Since the quantization error is propagated from the last layer to the first, and if the last layer got more quantization error, until the first layer it’s amplified error.Nathan Lambert (32:07)Let me try to talk through this because I’m mostly working post-training and I’ve heard all these terms and I want to make sure that we’re presenting a fairly clear picture to people. So in attention, you have queries, keys, and values. And these are big matrices that store many different things. And like generally with pre-training, the magnitude of the variables matters a lot because what you’re saying about like gradient flow. And if you have variables that are like too small, you might have no signal and too big or one thing. And what we’re saying is that, God, I guess what’s the order between, when you have, I guess there’s complicated things, which is like where the rotary embeddings are applied relative to the attention computation. And what we’re saying is that you have to put QK norm ahead of the rotary embeddings in this attention module, because then otherwise your gradients are too small when you’re scaling this or with FP8.Chen Liang (32:53)During the forward process, you got the QK norm and the rotary embedding, and then you go forward. But during the backward, but if you do not apply QK norm, the Q times K matrix may have large values. And during the backward, the large value may bring a large gradient. And when you do the quantization, actually divide the data by the max of the per channel, the max of the column. So some small values will be divided nearly to the zero. So when you do the dequantize, it cannot find the real value before the quantization.Nathan Lambert (33:52)That makes sense. I see. Like, what are you actually looking at to figure this out? Are you looking at like intermediate activation values when you’re scaling? Because I like training loss will only show you so much, or are you like seeing that the training loss is better or worse and then going to investigate this later?Chen Liang (34:08)The first is the loss is not right compared to the BFloat16. And we print the quantization error during the intermediate layers and find that without QK norm in the linear QKV, the gradient is too large.Nathan Lambert (34:34)I think that this is very good. It gives people a sense for like what the different things moving around when you’re looking at kind of pre-training research is. And then the other side of things, if you make a change and then you have a loss spike, you’re like, okay, then you have like a numerical stability issue. I guess like a loss spike that you can’t skip. So I’m guessing you have things where if you have a loss spike, you can skip some of them. But there’s some numerical stability you can’t get around. This is fun. I’m going to kind of keep rolling through this. I think that you’re also talking about how you have like different pipeline for training your MoE, which you described as like a heterogeneous fine-grained pipeline. I think that this is like, I would read this as matching your training architecture to your compute architecture in order to get a speed up. Because I think with MoEs and the communication bottleneck. So I think that it’s like, if you want to talk about the parallelism strategies you did to get pre-training to be efficient. I think it was also really interesting because it covers multiple layers of the stack and how you design models.Chen Liang (35:39)It’s actually a common way, not just for our model. So actually the modern parallelism is just data parallel, tensor parallel, pipeline parallel, and context parallel. And our optimization is only focused on the pipeline parallel. As you can see from the paper, we do not use TP during our pre-training. So the common way to do the pre-training is they name it one forward and one backward type. Let’s see. We just focused on one machine with eight cards. And every card, actually, we name it as a stage. So we got stage 0 to stage 7. And every stage does the forward and the backward after it does the forward and sends the forward data to the next stage and they get the backward data from the next stage, right?Nathan Lambert (36:49)So that’s like an eight step pipeline. That’s like a pipeline parallel that you’re describing.Chen Liang (36:53)And every stage, they do communication from the prior stage and do the communication with the next stage. And the 1F1B got a problem that the stage 0 and stage 7 always got the most computation load because stage 0, you have an embedding layer. And it’s an index select operation. So it’s close. And stage 7, you got the LM head layer and the loss function. And you also got a large GEMM. So you need to times the hidden states to transfer the hidden states to the vocab size. And the vocab size is always large.Nathan Lambert (37:45)How much fine-grained work are you doing to change which part of the model is on each stage? Because that seems like what it would be then. You either have to change the model or you have to change how you split up the model. It’s like your two options.Chen Liang (37:58)The common way is just you split the LM head layer and embedding layer and just divide it by the GPU number. So it’s natural that the stage 0 and the stage 7 got much more computation load, since you just ignored the balance of the system when you split the layers. So it’s the common one. So our optimization’s main concern is just to alleviate the computation load of the stage 0 and stage 7.Nathan Lambert (38:25)I see. I guess I don’t fully follow like what has happened. I’m trying to be like very clear of whether or not I understand it. Because I think that’s like in a dense model, I think pipeline parallel really makes sense, but you have like a smaller model. And then as you’re getting bigger, it’s like much less of a model. I don’t know what it means to necessarily like de-load the specifically the embeddings or the loss function and how much of a change you can make. But I think that might be like a me limitation. It might be hard to get to, but you can, I’m curious if you want to try.Chen Liang (39:14)Actually, it’s quite the same as the dense model. The only difference is that per GPU, you can imagine that during the pre-training, if we got the 32 experts and we use like four machines to gather the expert data, it’s just you can view this four machine as one machine. So in this view, it’s the same like the dense model. So just imagine the dense model. You split the layers according to your GPU cards. And let’s assume that every machine got two layers of the dense model.Nathan Lambert (40:11)So I get that. And then it’s like, it’s just like, then you have to shift things around to make it so the loss is less of a bottleneck in the last layer or the final part of this pipeline parallel being the bottleneck is kind of potentially fundamental.Chen Liang (40:24)Yeah.Post training at InclusionNathan Lambert (40:25)I see. I mean, the next question that I wanted to ask is going to be very related to this, which is like, what are your, how do you scale this to make RL work at the same scale? So the different problems that you have for doing pre-training versus RL with a large scale model. I don’t have the title of the paper, but you’re like in this Ling 1T paper, there’s a ton of RL details. And it’s like, is this kind of just like the next sequential problem that you got to? And then there’s just a lot of, not necessarily similar solutions, but like you’re doing your problem solving in the same way to make RL work rather than pre-training in terms of throughput.Chen Liang (41:03)It’s actually got some common tricks like we mentioned in the paper that the VPP (virtual pipeline parallelism). It actually means that the machine, you got double layers than the original one, than the original 1F1B, same things. But the difference is, let us assume that the stage 0 machine got four layers. But actually, during the time, two layers are doing computing and two layers are doing communication. So that’s what they call VPP.Nathan Lambert (41:47)What does two layers computing and communicating mean?Chen Liang (41:50)In other words, some layers are doing computing and some layers just prepare the data. They get the data.Nathan Lambert (42:00)I see, so it’s like some machines.Chen Liang (42:03)So when you train, during the computing, communication bandwidth is idle, right? So they utilize this to just like the exploration is the exploration. And our optimization is just to split the pipeline more precisely.Nathan Lambert (42:31)So I think I’m seeing that. So it’s within a node. You have very fast communication between eight GPUs. And then in pre-training, you’re kind of doing all sequentially, but in RL, you need to kind of sync this. You need to communicate more between your like generate, you have to move your weights to be able to generate when you’re doing RL. There’s like this sync step. And then I’m thinking what you’re saying is like, you have this chunk on eight GPUs and then you can split this. So half of them are doing compute and half are doing communication at the same time. So it kind of alleviates the bottlenecks. I see. For context and how like there’s a lot of different ways of doing RL infrastructure, it’s just the abstractions that like what we’re doing is much easier where we’re looking at approaches where we have GPUs that are set for generation and training, and that we are primarily looking at ways to make those both faster and then be able to throw the like training GPUs, we sync the weights to the generators and the generators just keep going where this is like it’s much more deeply embedded in the architecture where you have like one cluster where you’re kind of splitting the GPUs and what work is happening across each of the across like the per node basis when you’re doing this RL training. And I’m going to go look at this in more detail.Chen Liang (43:48)Yeah.Richard Bian (43:56)Just to add a little bit more flavors to this, the reason why we kind of didn’t really cover a lot of post-training details in this interview is because we have some additional technical papers or technical reports we’re writing at this moment about the system.Nathan Lambert (44:14)That makes sense.Richard Bian (44:15)So it was to some extent intentionally vague, Nathan. But I mean, first thing first, the current paper of Ling 1T and Ring 1T does have the fundamental intro for our system. It’s called a system. I believe the article has been published on ant-ling.medium.com/ on the medium technical paper as well as on Ling Team. So the paper is also available in English on Ling Team as we publish all the details. So specifically, there are several things which we did for the RL aspect. One is about the system itself. You can imagine that we do have an optimized internal hybrid engine which does all the things you described. And the second part is we’re exploring the reward model system. So this reward model system essentially requires some additional design to reach a certain level of parallelism. And the way we’re kind of looking into that is we’re really trying to set up meaningful rewards by doing a parallel structure for that. Last but not least, we have the term called LPO (Language-level Policy Optimization), right? It’s a linguistic unit. So we decided to choose sentence intentionally. So it’s kind of like a different approach from GRPO (Group Relative Policy Optimization) and the GSPO (Group Sequence Policy Optimization), like the session approaches or the token approaches that some of the other labs are using. We intentionally chose language as a linguistic unit to explore the meaning of this. So far, we’ve been seeing very motivating results from doing that. The training stability and the generalism is actually, we see some pretty clear numbers indicating that the LPO can be a very viable option for RL training. So let’s maybe save some of those interesting dessert for our next conversation. And we would love to really be able to share a lot of those details, given your background in post-training. I will try to maybe invite some of the experts from that domain into our next conversation.Nathan Lambert (46:10)I think the LPO thing is interesting, that there’s kind of a natural abstraction in a sentence. So in the language model generates, you just split every generation per sentence or per punctuation mark. It’s very linked to kind of these ideas of process reward models that people have looked at and understand to have natural inductive biases for a long time. And there is still some research doing this. So I’m happy to see that you’re doing it. And it’s kind of, I think of it as like value models and other things in RL that are just out of vogue and are likely to come back in some form in the near future, which is cool. In the ecosystem, where do you see open models going? I think it’s like, I guess the high level question is like, I mean, this weighs heavily on myself personally, it’s like, do you think that it’s like a big cake that you can eat out of and everybody does like, is it like, you see a clear path to having models that are meaningful? Does it worry you that the list of handful in China, it’s like, I mean, we know DeepSeek, we know Qwen, we know Kimi Moonshot, we know GLM 4.5, Meituan is releasing good, very strong models right now. You guys are like, the conviction that this is like a winning thing and you have your niche and there are more models coming soon. Like, is that easy for you to see? I mean, you had your metaphors at the beginning that I thought were great. So I think that’s kind of partially answered, but it’s like, it’s a very competitive space. So is that like easy for you to see through and just keep pushing ahead?The Chinese model landscapeRichard Bian (48:15)Thanks again for the invitation for really having this conversation. I did actually have my lines at the very beginning. I kind of call myself as a global citizen. Some of the current, I would say, really pains me in that regard. So when I’m kind of looking at it, so first thing first, I’ve been doing open source for years. You did ask about Ant OSS. You can actually find Ant OSS on Twitter. And there’s also a website for that. It’s opensource.antgroup.com. So Ant Group actually has a very long history of doing, as we call nowadays, the traditional or the classic open source, quote unquote, which I believe will be there forever. And you did ask a very specific question about open source models or open models. Last year, this time, it has been a very heated conversation in the open source ecosystem. So people in the open source domain are saying that, hey, this is open-weights. It’s not open source at all, which makes perfect sense. Because if you think about the nature of open source, it has at least three entities which are critically important. One is code itself, and the other one is community, aka the developers and people around it. And the last one is license, which pretty much provides a common consensus of the, I would say, the most common denominator as people agree upon, which is legally viable. But coming to that license requires years of effort. So like last year, you do see the OSI is trying to come out with a definition, and people are having a very convoluted feeling about it. And we see the Linux Foundation and data release this model openness framework, which is a very viable way of measuring the models. But that’s sad. Even nowadays, we only see one class one, which is a model from BAAI in China, which means by that standard, the rest of the models don’t meet it. And funny enough, last year, when I say we’re open sourcing our models, you’ll see people begin pointing fingers. Hey, you’re not open sourcing your models. Be careful about the words. But this year, all the labs are saying we’re open sourcing our models, and nobody is pointing fingers at all. Because it’s just like getting to a situation that we should maybe care less about this, but more about the direction, or what’s next. So I definitely want to spend more time discussing about that. So first thing first, I will say I did use the metaphor of saying the LLM is more like kernels. So if we kind of think about how many Linux kernel developers are on the planet now, it’s probably less than 1,000 people. So when people are saying that, hey, LLM is not really open source because nobody can contribute to it, yes, that’s correct. It’s very similar to the kernel. Theoretically, you can contribute to the kernel. But in reality, there’s only so few people who know about it. Most of the people are really kind of working around the ecosystem. They’re not the kernel developers, but we are currently at the stage of building the kernel itself. So that’s basically maybe my first point. It takes time. The reason why the open source definition is so convoluted at this moment, maybe just because it’s the first or the second year of a new era of neural development.Nathan Lambert (51:54)I agree. I think it’ll take like a decade. It’s like we’re in the first couple of years. I reiterate strongly with what you say where it’s like, it’s much better that people are actually using these models than just getting annoyed about definitions. And it’s like, we’ll figure out the definitions much more quickly if people actually want to use and contribute to these things.Richard Bian (51:58)And then the next part coming after this is like, I’m just sharing a very interesting story because I mean, my previous leader, he was working on Kubernetes and containers. So I have a background of being a full stack engineer as well as an engineer working on the data infra of the platform. So one day we did have a conversation about, I was saying, hey, you know, this MySQL infra, because I stopped. MySQL is not infra. MySQL is application. I was like, OK, thank you. That’s very helpful. But it’s kind of interesting, right? Because if you think about why that particular conversation actually happened, it’s because if you’re perceiving this from the infrastructure perspective or if you’re perceiving this from application perspective. My hunch feeling is we are going towards the next stage as we speak right now. I think we are at the transition period of having this MySQL moment. So other gigantic sandbox, gigantic runtime at this moment, that seems more application related. But five years down the road, they will become infrastructure. So the way I’m kind of looking at it is like, first thing first, I’m very optimistic about that. We will have open source. We will have an ecosystem in the AI era. In fact, I think Matt White from PyTorch, I think he introduced this new license called OpenMDW license, which kind of begins treating.Nathan Lambert (53:53)It’s an underrated license. It’s a very, very reasonable license.Gaps in the open source ecosystem todayRichard Bian (53:59)It’s very reasonable. In fact, I mean, we’re writing some Chinese articles trying to, I mean, I’m working with Art Eagles to do that. It deserves better visibility and more promotion. But kind of back to our original topic, I guess is, again, it will take quite some time for this information to rinse and repeat and consolidate. But I guess at this moment, I do see three gaps, which will prevent us from going to the next step. One is a proper license structure and a proper governance around the license. I think the OpenMDW is a good start, but it will take time. Second, I do believe data is the new code. So I guess how you’re contributing to the LLM is really through the data of pre-training and your data and reward models in post-training. But at this moment, there’s no Git for data. And the Git for data is not as straightforward as a Git for code because data can really be something which is very fundamental. So for instance, I mean like.Nathan Lambert (55:07)It’s often impermanent too. So like a lot of multimodal data sets are released as links and then the links die. So it’s like even like we try to, like people at AI2 try to release a fully reproducible data set and 10% dies in the first three years or something like.Richard Bian (55:12)And you might be having a lot of, I would say, overhead cost behind the scenes. So I mean, thanks so much for doing that. I mean, when people do that, we’re kind of raising our hands and saying hallelujah, right? Thank you. But it’s a difficult job, right? Because there might be legal battles behind the scenes. There might be a lot of, I would say, data cleansing. And the worst come to worst is really just more like, so I sometimes use this metaphor like, you know, I say, buy Coca-Cola stock. And Warren Buffett is saying, buy Coca-Cola stock. It’s literally the same word, but they mean something intrinsically different. I can’t really get my buy Coca-Cola. But I mean, that’s also a legal problem. So it’s like, in Git, we can say this, your public study was in main is before my public study was in main. But in data, you can’t really say that. So there’s definitely some technical challenges associated with that. Last but not least, the reward model associated with our contribution and the causality of our contribution to the model to the actual rewards. So for instance, if I’m writing a PR on GitHub, people see my PR and they merge my PR, great. I did my contribution. But you know.Nathan Lambert (56:43)I see.Richard Bian (56:46)Our conversation today is really meaningful. It can be a good, I would say, data corpus for reinforcement learning to some extent. But when people do that, they will not tell you, they will not tell me, they will not tell anyone of us.Nathan Lambert (56:54)I’m lucky enough to be big enough and visible where I accept that like me being in it is now good because it reinforces that I’m visible. Just a technical note on language, you were saying reward model as in the thing that rewards people for participating. Reward model is also like a technical thing, which I’ve done a lot of work on. So I was slightly confused, but if there was anybody else that was confused, that’s been clarified. To kind of zoom out, I think that listening to you, it’s like, wow, you’re one of a few people that is totally up to date on the open source definition stuff in the world. And I’m sure there are people all over that are thinking about this. I think you’ve spent a lot of time in both cultures and it’s like, where do you feel like people in my seat versus your seat may see things differently with like what open source AI means, what AI means generally, or like anything in this space that you feel both in your job or your life with respect to AI.Richard Bian (57:58)It’s a lovely question. I think it might be too big of a question, too. So I’ll probably answer that through two focuses. One is about open source ecosystem overall, like my feeling of being an engineer by training and global citizen, how I perceived open source ecosystem in general. And the second part is about the Chinese AI ecosystem. So we can tailor on that. So I will say first thing first about the open source ecosystem in the West and in the East. The first thing first, there are definitely more similarities than differences. I’m not sure if you read the book called Alchemist. It’s one of my favorite small books.Nathan Lambert (58:42)I haven’t actually read it. I do own it, unfortunately.Richard Bian (58:45)Well, congratulations. You have a nice book on your waiting list. It totally worth it. Another fun fact is I used to be working at Square. And the Square’s core payment system is called Esperanto. When I was looking at the word, I was like, what does that mean? And days after, I learned that Esperanto is basically this terminology related to world language. So there was a time people are inventing this term called Esperanto, hoping to connect the human beings altogether by speaking the same language. But clearly, it didn’t work. But now, Python is probably the real Esperanto to my best knowledge. So that’s why I’m saying that there are definitely more similarities than differences, because in open source domain, people are working together. Python code, JavaScript, speak English, they share their ideologies and meanings about technology. It’s all good.Why China is winning the open raceNathan Lambert (59:47)A spicier way to phrase this question is like, why are there so many more open research labs in China than the US? I think like both, US arguably has like a bigger market cap, but fewer in people tech ecosystem. And it’s like, why is, it’s like, I listed what I thought was like 20 reasonable, like there’s like twice as many reasonable contributors in the Chinese ecosystem than in the US. Do you think there’s a reason for this or is it just kind of how the dice fell?Richard Bian (1:00:11)Well, I mean, I have my perceptions. Allow me to maybe use a disclaimer. So this is only my perception, not my company’s. So it kind of feels like there are definitely, there might be as many AI research labs in the States too. For instance, I mean, only through you, I learned about AI2. And I mean, I used to be living in Bellevue for years, but I didn’t know such an institution exists. So this is how uninformed I was. And I would imagine that there will be very much similar people like myself who are underinformed in that regard. Truth being told, we do see more open AI labs in China this year. I would say there are two reasons behind that. One is model effect. I would say that people are kind of perceiving the success of DeepSeek as a role model. That’s, I would say, a general consensus. It’s probably also a global consensus at this moment. People appreciate their engineering excellence and their willingness to share their findings. Because again, if we’re just out of Lisbon, we would appreciate the ship who came back and tell us, hey, this is the wrong way. Go that direction. We’ll probably appreciate that. So it’s not a zero sum game. So we cannot really speak on the other’s behalf, but we clearly see Alibaba with Qwen and Ant Group with Inclusion AI, we’re doing the same thing. We know it’s a long journey, it’s all the same. So when you’re outside of Lisbon, the best strategy to do is to be open and be helpful. And people appreciate the individuals who actually help you journey rather than the individuals who applaud you after you became famous.Nathan Lambert (1:02:05)I think I approach AI with this sort of curiosity. I think the, I don’t know how this would be a good test is like, there’s a very, the colloquial term of the hour in the Bay area and like tech circles in the US is like locked in. And if you apply this to what the AI companies think, it’s like the AI companies in the US are really, really like, at least acting as if they are locked in on a discovery in the near future that’s going to be transformative. A lot of it is probably for fundraising, but it’s like, I think that’s like, I have a lot more to learn and I will talk to more people like yourself to pick up more of this from talking to Chinese researchers. But I think this might be a recurring theme of like a lot of the US companies have this marketing that is really just different as how you’re describing it. And it’ll be interesting to see if that keeps coming up. Because if you’re so focused on like a one to two year thing, you’re not going to like sharing is a very different action to give. And then it’s like, it’s very different.Richard Bian (1:03:07)From a single perspective, I mean, just being told that by spending quite some time on both sides, I would say what we observe nowadays is reasonable, but definitely not ideal. So I would say first and foremost, you know, the chip leader is actually having a different way of playing the game, which is reasonable. I would say that, you know, if you’re the chip leader, there’s no guarantee that, you know, we’re going to be playing the same game. That being said, you know, it’s, we don’t talk about such a hypothesis because you cannot prove or disprove it. But that’s basically the first thing. And the second thing is we’re definitely seeing there are intrinsic, I would say, risks with the direction we’re going. So you hear people talking about the transformer architecture, we’re actually raising. You know, with all the names, they begin claiming that the pre-training might be dead. We hear terms like that. Reinforcement learning is the way to go. But in the latest interview with Andrej Karpathy, he shared this in a very humble and noble way, saying that, hey, this might be a good way to go, but let’s not mythify this. It might not be the golden desire, or it might not be the silver bullet. It’s a good methodology. Let’s go down that direction and explore, rinse and repeat, hoping that we’ll be able to find it. So if we’re at this stage of the game, I would say I would definitely choose the game to be more, I would say, open-minded. That’s one thing. And from a strategy perspective, be less about zero sum and more about where. So in game theory, there are all these kind of different games. One very typical mistake people make is they will treat a stag and hare game as a prisoner dilemma game. Those games look very similar in their own Nash equilibrium, but they’re different. So I guess, I mean, we do see certain companies are playing more like, hey, you know, you win, I lose. Can’t comment on that because, you know, there are a lot of reasons behind it. But, you know, the way we’re kind of looking at this, there are definitely more rooms, even as like Columbus was the first one finding the American continent. But then we begin to know that there’s this kind of North America and South America. And there are a lot of settlers, a lot of places. Right, so you don’t want to be the first pirate on Atlantic Ocean to kind of begin shooting down the other ships before you even reach and disembark. So that’s basically my way of seeing it. Last but not least, I guess I mean like.Nathan Lambert (1:05:37)There were a lot of settlers out there other than just Columbus. To finish your metaphor.Richard Bian (1:06:03)I think at this moment, there’s also another intrinsic risk associated with the whole business model. We hear a lot of those discussions regarding how Nvidia is actually making a lot of money by just selling the hardware. I also saw a line yesterday which I really like. It’s like, hey, do people still remember Cisco in 2000? I was like, that’s a very powerful line.Nathan Lambert (1:06:27)I think a lot about how Claude Code is very different than the likes of GitHub Copilot. And it’s like the different products that you can make with a given model has very, very big Delta in terms of what the user gets out of it. So mostly the floor is yours to comment on anything fun with product, which is probably a lot of your actual day job. I get, this is not my day job. And I get the sense that people that care about AI have to do a lot of work like this of like vision, creating a vision. And I’m guessing product might be closer to what you spend your time on.Richard Bian (1:07:01)Thanks so much, Nathan. I really enjoyed the conversation today. So the Model as Product team is very new. It’s brand-new. It’s only one month old. And as far as I know, we are the first company building such a team in China, if I’m not mistaken. But I have a hunch feeling that’s how people in OpenAI are working nowadays. So people are kind of working in small squad teams with seven to eight people. It’s a combination of algorithmic engineers, system engineers, UX engineers, product developers, evaluators, and so on. So we’re all working together.Nathan Lambert (1:07:41)Did you launch this before or after Sora? Because Sora is a complete vindication of this, which is like the genius of Sora is adding your friends to the videos versus just having a good video model. So you may not have realized it, but I think you have a great example of reinforcing this hypothesis. And I think more of them will come because I think, I don’t know, I’m soapboxing, but I think 2026 will be there will be more things that we can’t predict like Claude Code and Sora every year that start to work. So I think it’s a good approach.Richard Bian (1:08:12)That’s precisely how it works, right? Because working in open source for years, I guess one thing I learned is like, you know, if you just begin selling, I mean, there’s, you know, one of my favorite speakers is Simon Sinek, and he has a very popular YouTube video talking about leadership. So in there, one of his lines is like, leadership is volunteering. I really love that line. So basically, I’m pretty much one in my time and my predictions of trying to build such a team. So what our team does is like, because we are the Ant Ling team, right? So we care a lot about the model itself. That said, there are a lot of models out there. So in order for it to promote the model nowadays, it’s intrinsically difficult because people will say, oh, OK, here’s another model. Oh, it’s an open model from China. Oh, there are so many open models from China. It’s big, great. I remember that. But what’s next? So how can we use it? So we were kind of looking at just how we discuss about MySQL. If MySQL is a platform or an infra or product, I would say that we really want to think model as product now. Because you have all these models. But the good news is you also have the infrastructure, which allows you to switch models very easily, like open routers and all these model service providers, they actually allow you to do that very easily with very low overhead. You can use one model for part of a scenario and another model for the other part, which is good. It essentially means that if you have a good enough model, so I mean, thanks so much for our engineers who are actually building such a model for us to use and, you know, pretty much work upon. Without such a model, it’s impossible to do anything. So now with such a model, it almost feels like you have a very smart individual with IQ equals to 120, but he’s not very well-trained with anything. So what we’re trying to do is we’re trying to really find, during the interview with the model, and say, hey, what are you good at? But do we really know what the models are good at? Honestly, at this moment, it kind of feels like the evaluations are not really there. There’s a long way for benchmark evaluation. We don’t have enough time for that. But I believe that eval-driven heuristic is probably going to be very interesting in 2026. We’re going to essentially use an eval-driven way of finding what the models are good at. It can be very specific. It can be very niche for creative writing, for example, in drama, storyline. It’s very specific, but it can build a very good product on that. We’re trying to find all of those. But at this moment, we need the evaluation data set. We need all of this in order for us to be able to find it. And on the other hand, we need to find the user value. Because even as of 2025, you begin seeing a lot of new products coming out, but only a few things settled. So it almost kind of reminds me at the very early stage. A metaphor for our moment in LLMsRichard Bian (1:11:12)I don’t know, Nathan, if you remember the product called Foursquare from the very early days.Nathan Lambert (1:11:32)I don’t think I was a man of the internet at the time, but I’ve heard of it as being like a canonical reference many times. It comes up in a lot of the readings that I do.Richard Bian (1:11:38)So the TLDR for that is Foursquare is basically one of the earlier applications when you have an iPhone. All it does, it gives you a location of your current phone, and you’re able to do a check-in action in there. So for instance, if you go to a restaurant, you can do a check-in at this restaurant. So what it does is actually it’s a demo of the location API of iPhone. And all it does is data labeling and a demonstration of how you can use the location API to be useful. But without Foursquare, you would not really have Uber or like DoorDash and all of those. So Foursquare was pretty much the demo, which led to all of these new products. And another way of putting that is like, you don’t have to be a taxi driver to build Uber. So that’s basically how our team is. We have a very small team. We have a very small team with engineers, product managers, and operational folks. So what we’re trying to do is we’re trying to essentially build Foursquare by really focusing on what the model is good at and what are the core capabilities. So I think there are definitely some of these demoable core capabilities which kind of begin surfacing. One of them is unlimited memory. Unlimited memory is basically this new capability which only AI and gen AI can fully utilize. But do we, so for instance, you have this kind of new products like the cloud note which you can put behind your phone, right? You can put a note there. Oh, I think there’s a company called unlimited.ai (editors note: called limitless) if I’m not mistaken, which is basically the necklace you can put. And people kind of building like watches, rings, glasses, and all of this in hoping to gather the data and trying to pretty much put all these kind of new contexts into the model. I kind of condense those into two core capabilities. One is unlimited memory. It memorizes everything. But in order for us to do that, you can’t really save all the data, right? The data is huge. You have to compress it, being able to find out a nice way of compressing them, and a very nice way of retrieving them. So data compression, data retrieval, called hot storage for all of this data, they’re all new challenges. But the capability is real. So with Unlimited Memory, it will really enable this contextual engineering work, which you can use in Model 4, but it’s not there yet. So it can be a Foursquare moment for the LLM. And the second one is, I would say, the proximity awareness. So for instance, we’re speaking in the room. There are a lot of these kind of new applications which are recording our meeting. What they’re really recording is the meeting, yes. But what they’re also recording is who is sharing the meeting with you. So theoretically speaking, you have sufficient amount of data. You can begin building the new LinkedIn in the gen AI era. It’s all possible, but we’re not there yet. So my team.Nathan Lambert (1:15:00)I think there’s a lot of pushback on privacy in the US to these things, but demonstrating the capability is obviously a huge merit of like, if we can figure out the privacy concerns, you have X on the table of new potential things. And I think it’s good. I encourage a lot of people to, it’s the right approach to things, which is like as the models get better, what potentially can work. I’m not a new person to saying this. A lot of people have.Richard Bian (1:15:27)Maybe just like two final words. One is like, I guess now is probably the best time to be more, I would say, first principle. Like, people say that a lot, but I actually have a three-year and ten months old boy at home. I guess one thing which really motivates me, what kept me being optimistic is my boy, because his growth is very well aligned with the timeline of the model. I’m seeing a lot of similarities in terms of how the revelations of human beings are kind of aligned with how the models are being trained, both pre-train and post-train. So I’m seeing there’s a long way to go. We don’t really have any understanding about, I would say human intellectual intelligence about where that’s coming from. So it’s a long journey and it’s good to really kind of think more fundamentally as the first principle. And the second line is I would say Inclusion AI and Ant Ling team, we’re being very serious about this. We don’t think this is a zero sum game and we don’t think this is Red Ocean. So I would say we’re open. We’ll stay open for as long as we can. And we’re doing all this kind of explorative approaches and I will probably make a call to action as someone who I’ve been benefiting a lot from globalization, including education and being able to work with smart people like you, Nathan. I hope the world will stay that way, at least as far as technology and open source is concerned. So that means work with us and Inclusion AI will be here. We’ll keep exploring and appreciate everything you’ve been doing for us. Thank you so much, Nathan. I really, really enjoyed this conversation today.Nathan Lambert (1:17:15)I look forward to seeing your new models. I have this, I’ve been so busy. I have one of these DGX Spark computers on my desk and I haven’t downloaded any real big model to it. And it’s like, I have to try downloading something like a hundred billion parameter model to see how it works. So maybe one of them will be your model. Thanks! This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
--------
1:17:49
--------
1:17:49
5 Thoughts on Kimi K2 Thinking
First, congrats to the Moonshot AI team, one of the 6 “AI Tigers” in China, on the awesome release of Kimi K2 Thinking. One of the overlooked and inspiring things for me these days is just how many people are learning very quickly to train excellent AI models. The ability to train leading AI models and distribute them internationally is going to be pervasive globally. As people use AI more, those who can access supply for inference (and maybe the absolute frontier in scale of training, even if costly) is going to be the gating function.K2 Thinking sounds like a joy to use because of early reports that the distinctive style and writing quality from their original Kimi K2 Instruct model have been preserved through extended thinking RL training. They released many evaluation scores, for a highlight they’re beating leading closed models on some benchmarks such as Humanity’s Last Exam or BrowseComp. There are still plenty of evals where GPT 5 or Claude Sonnet 4.5 tops them. Rumors are Gemini 3 is coming soon (just like the constantly pending DeepSeek V4), so expectations are high on the industry right now.TLDR: Kimi K2 Thinking as a reasoning MoE model with 1T total, 32B active parameters, 256K context length, interleaved thinking in agentic tool-use, strong benchmark scores and vibe tests.The core reaction of this release is people saying this is the closest open models have been to the closed frontier of performance ever, similar to DeepSeek R1‘s fast follow to o1. This is pretty true, but we’re heading into murky territory because comparing models is harder. This is all advantaging the open models, to be clear. I’ve heard that Kimi’s servers are already totally overwhelmed, more on this soon.What is on my mind for this release:1. Open models release faster. There’s still a time lag from the best closed to open models in a few ways, but what’s available to users is much trickier and presents a big challenge to closed labs. Labs in China definitely release their models way faster. When the pace of progress is high, being able to get a model out sooner makes it look better. That’s a simple fact, but I’d guess Anthropic takes the longest to get models out (months sometimes) and OpenAI somewhere in the middle. This is a big advantage, especially in comms, to the fast mover.I’d put the gap at the order of months in raw performance — I’d say 4-6+ months if you put a gun to my head and made me choose specifically — but the problem is these models aren’t publicly available, so do they matter?2. Key benchmarks first, user behaviors later. Labs in China are closing in and very strong on key benchmarks. These models also can have very good taste (DeepSeek, Kimi), but there is a long-tail of internal benchmarks that labs have for common user behaviors that Chinese labs don’t have feedback cycles on. Chinese companies will start getting these, but intangible’s are important to user retention.Over the last year+ we’ve been seeing Qwen go through this transition. Their models were originally known for benchmaxing, but now they’re legitimately fantastic models (that happen to have insane benchmark scores).Along these lines, the K2 Thinking model was post-trained natively with a 4bit precision to make it far more ready for real serving tasks (they likely did this to make scaling RL more efficient in post-training on long sequences too):To overcome this challenge, we adopt Quantization-Aware Training (QAT) during the post-training phase, applying INT4 weight-only quantization to the MoE components. It allows K2 Thinking to support native INT4 inference with a roughly 2x generation speed improvement while achieving state-of-the-art performance. All benchmark results are reported under INT4 precision.It’s awesome that their benchmark comparisons are in the way it’ll be served. That’s the fair way.3. China’s rise. At the start of the year, most people loosely following AI probably knew of 0 Chinese labs. Now, and towards wrapping up 2025, I’d say all of DeepSeek, Qwen, and Kimi are becoming household names. They all have seasons of their best releases and different strengths. The important thing is this’ll be a growing list. A growing share of cutting edge mindshare is shifting to China. I expect some of the likes of Z.ai, Meituan, or Ant Ling to potentially join this list next year. For some of these labs releasing top tier benchmark models, they literally started their foundation model effort after DeepSeek R1. It took many Chinese companies only 6 months to catch up to the open frontier in ballpark of performance, now the question is if they can offer something in a niche of the frontier that has real demand for users.4. Interleaved thinking on many tool calls. One of the things people are talking about with this release is how Kimi K2 Thinking will use “hundreds of tool calls” when answering a query. From the blog post:Kimi K2 Thinking can execute up to 200 – 300 sequential tool calls without human interference, reasoning coherently across hundreds of steps to solve complex problems.This is one of the first open model to have this ability of many, many tool calls, but it is something that has become somewhat standard with the likes of o3, Grok 4, etc. This sort of behavior emerges naturally during RL training, particularly for information tanks, when the model needs to search to get the right answer. So this isn’t a huge deal technically, but it’s very fun to see it in an open model, and providers hosting it (where tool use has already been a headache with people hosting open weights) are going to work very hard to support it precisely. I hope there’s user demand to help the industry mature for serving open tool-use models.Interleaved thinking is slightly different, where the model uses thinking tokens in between tool use call. Claude is most known for this. MiniMax M2 was released on Nov. 3rd with this as well! It’s new.5. Pressure on closed American labs. It’s clear that the surge of open models should make the closed labs sweat. There’s serious pricing pressure and expectations that they need to manage. The differentiation and story they can tell about why their services are better needs to evolve rapidly away from only the scores on the sort of benchmarks we have now. In my post from early in the summer, Some Thoughts on What Comes Next, I hinted at this:This is a different path for the industry and will take a different form of messaging than we’re used to. More releases are going to look like Anthropic’s Claude 4, where the benchmark gains are minor and the real world gains are a big step. There are plenty of more implications for policy, evaluation, and transparency that come with this. It is going to take much more nuance to understand if the pace of progress is continuing, especially as critics of AI are going to seize the opportunity of evaluations flatlining to say that AI is no longer working.Are existing distribution channels, products, and serving capacity enough to hold the value steady of all the leading AI companies in the U.S.? Personally, I think they’re safe, but these Chinese models and companies are going to be taking bigger slices of the growing AI cake. This isn’t going to be anywhere near a majority in revenue, but it can be a majority in mindshare, especially with international markets.Interconnects is a reader-supported publication. Consider becoming a subscriber.This sets us up for a very interesting 2026. I’m hoping to make time to thoroughly vibe test Kimi K2 Thinking soon!Quick links:* Interconnects: Kimi K2 and when “DeepSeek Moments” become normal, China Model Builder Tier List (they’re going up soon probably)* Model: https://huggingface.co/moonshotai/Kimi-K2-Thinking* API: https://platform.moonshot.ai/ (being hammered)* License (Modified MIT): The same as MIT, very permissive, but if you use Kimi K2 (or derivatives) in a commercial product/service that has >100M monthly active users or >$20M/month revenue, you must prominently display “Kimi K2” on the UI. Is reasonable, but not “truly open source.” https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main/LICENSE* Technical blog: https://moonshotai.github.io/Kimi-K2/thinking.html* Announcement thread: https://x.com/Kimi_Moonshot/status/1986449512538513505 This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
--------
7:37
--------
7:37
Burning out
One of the obvious topics of the Valley today is how hard everyone works. We’re inundated with comments on “The Great Lock In”, 996, 997, and now even a snarky 002 (midnight to midnight with a 2 hour break). Plenty of this is performative flexing on social media, but enough of it is real and reflecting how trends are unfolding in the LLM space. I’m affected. My friends are affected.All of this hard work is downstream of ever increasing pressure to be relevant in the most exciting technology of our generation. This is all reflective of the LLM game changing. The time window to be a player at the most cutting edge is actually a closing window, not just what feels like one. There are many different sizes and types of models that matter, but as the market is now more fleshed out with resources, all of them are facing a constantly rising bar in quality of technical output. People are racing to stay above the rising tide — often damning any hope of life balance.Interconnects is a reader-supported publication. Consider becoming a subscriber.AI is going down the path that other industries have before, but on steroids. There’s a famous section of the book Apple in China, where the author Patrick McGee describes the programs Apple put in place to save the marriages of engineers traveling so much to China and working incredible hours. In an interview on ChinaTalk, McGee added “Never mind the divorces, you need to look at the deaths.” This is a grim reality that is surely playing out in AI.The Wall Street Journal recently published a piece on how AI Workers Are Putting In 100-Hour Workweeks to Win the New Tech Arms Race. The opening of the article is excellent to capture how the last year or two has felt if you’re participating in the dance:Josh Batson no longer has time for social media. The AI researcher’s only comparable dopamine hit these days is on Anthropic’s Slack workplace-messaging channels, where he explores chatter about colleagues’ theories and experiments on large language models and architecture.Work addicts abound in AI. I often count myself, but take a lot of effort to make it such that work expands to fill available time and not that I fill everything in around work. This WSJ article had a bunch of crazy comments that show the mental limits of individuals and the culture they act in, such as:Several top researchers compared the circumstances to war.Comparing current AI research to war is out of touch (especially with the grounding of actual wars happening simultaneously to the AI race!). What they really are learning is that pursuing an activity in a collective environment at an elite level over multiple years is incredibly hard. It is! War is that and more.In the last few months I’ve been making an increasing number of analogies to how working at the sharp end of LLMs today is similar to training with a team to be elite athletes. The goals are far out and often singular, there are incredibly fine margins between success and failure, much of the grinding feels over tiny tasks that add up over time but you don’t want to do in the moment, and you can never quite know how well your process is working until you compare your outputs with your top competition, which only happens a few times a year in both sports and language modeling.In college I was a D1 lightweight rower at Cornell University. I walked onto a team and we ended up winning 3 championships in 4 years. Much of this was happenstance, as much greatness is, but it’s a crucial example in understanding how similar mentalities can apply in different domains across a life. My mindset around the LLM work I do today feels incredibly similar — complete focus and buy in — but I don’t think I’ve yet found a work environment where the culture is as cohesive as athletics. Where OpenAI’s culture is often described as culty, there are often many signs that the core team members there absolutely love it, even if they’re working 996, 997, or 002. When you love it, it doesn’t feel like work. This is the same as why training 20 hours a week while a full time student can feel easy.Many AI researchers can learn from athletics and appreciate the value of rest. Your mental acuity can drop off faster than your physical peak performance does when not rested. Working too hard forces you to take narrower and less creative approaches. The deeper into the hole of burnout I get in trying to make you the next Olmo model, the worse my writing gets. My ability to spot technical dead ends goes with it. If the intellectual payoffs to rest are hard to see, your schedule doesn’t have the space for creativity and insight.Crafting the team culture in both of these environments is incredibly difficult. It’s the quality of the team culture that determines the outcome more than the individual components. Yes, with LLMs you can take brief shortcuts by hiring talent with years of experience from another frontier lab, but that doesn’t change the long-term dynamic. Yes, you obviously need as much compute as you can get. At the same time, culture is incredibly fickle. It’s easier to lose than it is to build.Some argue that starting a new lab today can be an advantage against the established labs because you get to start from scratch with a cleaner codebase, but this is cope. Three core ingredients of training: Internal tools (recipes, code-bases, etc.), resources (compute, data), and personnel. Leadership sets the direction and culture, where management executes with this direction. All elements are crucial and cannot be overlooked. The further along the best models get, the harder starting from scratch is going to become. Eventually, this dynamic will shift back in favor of starting from scratch, because public knowhow and tooling will catch up, but in the meantime the closed tools are getting better at a far faster rate than the fully open tools.The likes of SSI, Thinky, and Reflection are likely the last efforts that are capitalized enough to maybe catch up in the near term, but the odds are not on their side. Getting infinite compute into a new company is meaningless if you don’t already have your code, data, and pretraining architectures ready. Eventually the clock will run out for company plans to be just catching up to the frontier, and then figure it out from there. The more these companies raise, the more the expectations on their first output will increase as well. It’s not an enviable position, but it’s certainly ambitious.In many ways I see the culture of Chinese technology companies (and education systems) as being better suited for this sort of catch up work. Many top AI researchers trained in the US want to work on a masterpiece, where what it takes in language modeling is often extended grinding to stabilize and replicate something that you know definitely can work. I used to think that the AI bubble would pop financially, as seen through a series of economic mergers, acquisitions, and similar deals. I’m shifting to see more limitations on the human capital than the financial capital thrown at today’s AI companies. As the technical standard of relevance increases (i.e. how good the models people want to use are, or the best open model of a given size category), it simply takes more focused work to get a model there. This work is hard to cheat in time.This all relates to how I, and other researchers, always comment on the low hanging fruit we see to keep improving the models. As the models have gotten better, our systems to build them have gotten more refined, complex, intricate, and numerically sensitive. While I see a similar amount of low-hanging fruit today as I did a year ago, the efforts (or physical resources, GPUs) it can take to unlock them have increased. This pushes people to keep going one step closer to their limits. This is piling on to more burnout. This is also why the WSJ reported that top researchers “said repeatedly that they work long hours by choice.” The best feel like they need to do this work or they’ll fall behind. It’s running one more experiment, running one more vibe test, reviewing one more colleague’s PR, reading one more paper, chasing down one more data contract. The to-do list is never empty.The amount of context that you need to keep in your brain to perform well in many LM training contexts is ever increasing. For example, leading post-training pipelines around the launch of ChatGPT looked like two or maybe three well separated training stages. Now there are tons of checkpoints flying around getting merged, sequenced, and chopped apart in part of the final project. Processes that used to be managed by one or two people now have teams coordinating many data and algorithmic efforts that are trying to land in just a few models a year. I’ve personally transitioned from a normal researcher to something like a tech lead who is always trying to predict blockers before they come up (at any point in the post-training process) and get resources to fix them. I bounce in and out of problems to wherever the most risk is.Cramming and keeping technical context pushes out hobbies and peace of mind.Training general language models you hope others will adopt — via open weights or API — is becoming very much an all-in or all-out domain. Half-assing it is becoming an expensive way to make a model that no one will use. This wasn’t the case two years ago, where playing around with a certain part of the pipeline was legitimately impactful.Culture is a fine line between performance and toxicity, and it’s often hard to know which you are until you get to a major deliverable to check in versus competitors.Personally, I’m fighting off a double-edged sword of this. I feel immense responsibility to make all the future Olmo models of the world great, while simultaneously trying to do a substantial amount of ecosystem work to create an informed discussion around the state of open models. My goal around this discussion is for more real things to be built. ATOM Project is a manifestation of me feeling that both the U.S. ecosystem generally and the Olmo project are falling behind.It doesn’t really seem like there will be an immediate fix or end goal at this, but looking back I’m sure it’ll be clear what the key moments were and whether or not my efforts here and elsewhere met my goals.Will it all be worth it? How long do you plan to go on like this? It’s not like we’re really going to suddenly reach AGI and then all pack it up and go home. AI progress is a long-haul now.For me, the only reason to keep going is to try and make AI a wonderful technology for the world. Some feel the same. Others are going because they’re locked in on a path to generational wealth. Plenty don’t have either of these alignments, and the wall of effort comes sooner.Thanks to Ross Taylor, Jordan Schneider, and Jasmine Sun for feedback on this post. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
--------
10:09
--------
10:09

More Science podcasts

About Interconnects

Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, and telling stories. www.interconnects.ai

Podcast website

Science Technology