The transcript discusses the progress in AI research, particularly focusing on the challenges and advancements in achieving Artificial General Intelligence (AGI). The conversation touches upon the importance of scaling test time compute, the limitations of pre-training models, and the potential for further algorithmic improvements. The speaker, a research scientist at OpenAI, emphasizes the need for continued progress and innovation in AI research to overcome existing challenges and move towards AGI.
back in 2021 I had coffeee with il he was asking me about my AGI timelines and I told him like you know to be honest I think it's going to take a very long time and I told him like look we're not going to get to Super intelligence until we can figure out how to scale in for this computer very general way and I think that that's an extremely hard research problem so I thought it would take at least a decade it took like 2 or 3 years I have no doubt that there are others in fact I know that there are other research questions that aren't solved but I don't think that any of them are going to be harder than the problems that we've already solved no Brown's a research scientist at open AI where he was a key part of their work on 01 gnomes at the Forefront of reasoning in llms had a really interesting past track record at Fair where he worked on problems in diplomacy and poker and we hit on the biggest questions in llms today on unsupervised learning we talked about whether these models are hitting a wall how far test time compute can scale we hit on how gome defines AGI and what he's changed his mind on in the last few years in AI research this was a really fun one to do right after the general release of 01 and I think folks are really going to enjoy it without further Ado here's no well Noom thanks so much for for coming on the podcast of course great to be here uh I've been looking forward to this for uh for a while and uh certainly well timed uh with some some exciting launches going on with ship Miss uhhuh yeah I'm looking forward to it um we're going to be releasing 01 tomorrow which I guess by the time this podcast is out it's already going to be out there uh I'm pretty excited for it I think the community is going to love it but um I guess we'll see well I'd be remissed not to start around like what I feel like has been the question of the past month um around you know have we hit a wall with model capabilities and I think you know there's obviously different parts of that question and so maybe to start would just be the extent to which you feel like they're still more juuse to squeeze on scaling pre-training so my my view on this um and I've been pretty public about this I think there's more room to push um across the board and that includes pre-training um I think the right way to think about it is that every time you want to scale these models further there's like a there's there's a cost to that and so you look at gpd2 you know it costs like between 5 $50,000 depending on how you like measure it you look at gbd4 like obviously there's a lot of improvement but the fundamental thing that the most important thing that's changed is the amount of resources that have gone into it and so you go from spending for Frontier models like thousands to tens of thousands of dollars to you know hundreds of thousands to Millions to tens of millions to you know for some Labs possibly hundreds of millions of dollars today and the models keep getting better and I think that that will continue to be true that if you throw more money more resources um more data all this stuff into that you're going to get a better model um the problem is that okay well if you want to like 10x it each time then at some point that becomes like an intractable cost and so okay the next if you want to make it you know better you get want to do another 10x now you're talking about billions of dollars and you want to do another 10x and you're talking tens of billions of dollars and at some point it's no longer economically um worth it to to push that further so you're not going to spend presumably you're not going to spend trillions of dollars a on a Model um and so there's no hard wall it's more of like a soft wall that eventually the economics just don't work out for it right and it seems like obviously there's you know uh in many ways you're able to push this forward with test time compute and like a you know like there's lower hanging fruit there from a cost perspective to to push that forward exactly and so this this is why I'm really excited about test time compute and I think um what why like you know a lot of people are excited about it is that we're still it's kind of like we're back in the gpt2 days like when gbt2 was figured out and like the scaling laws were figured out it was pretty obvious that like oh you just scale this up by 1,000x and you're going to get a better model um and you could do that it's a little harder now to scale things up by 1,000x when pre-training but with test time compute we're still pretty early and so we have a lot of room a lot of Runway to scale that further there's a lot more low hanging fruit for ALG algorithmic improvements so I think there's just a lot of exciting stuff to be done in that direction that's not to say that that pre-training is done but um it's just that there's so much more uh Headway to just like push the test time Compu Paradigm further and I should also say like you know I it's not like even with even going back to pre-training for a second it's not like um you know there's uh you know two mors magnitude or something that you can push in and then you're done you're still going to be Mor's law think costs are going to continue to come down it's just a question of like um how quickly can you can you scale it like there was this huge overhang where it was very easy to scale it very quickly um and that's becoming like a little bit less true I realized this is probably an overly broad question but like how high is the ceiling on test time computer like how do you think about you know where that could go again I think about it in terms of dollar value so you know how much does a chbt query cost today um ballpark a penny um what cost could you spend on a query that you care a lot about and like what cost would you be willing to pay I think there are some problems out there that people would be willing to pay a lot of money for and I'm not talking about like a dollar or $5 I'm talking like a million dollars for some of the most important problems that Society cares about so how many orders of magnitude is that like that's what eight orders of magnitude so I think there's a lot of room to push it further and I also think there's a lot of room for algorithmic improvements so it's not just like oh we're just going to like dump more money into the query and then you get a better output it's like no we actually we can improve this Paradigm further and and make this scaling a lot better you know one thing I thought was interesting is uh I guess maybe a month ago you know Sam malman had tweeted you know we basically know what we've got to do to build AGI um and I think you tweeted like his view matches like the median view of of open AI researchers today um can you say more about that because obviously there's so many people now talking like oh we've hit a wall like what do you think they're missing I feel like we've been pretty open about this that we see things continue to progress pretty rapidly I think that that's um my opinion I think that Sam expressed his opinion I think you know I've heard some people um say that like oh Sam is just trying to like you know create hype or something um and I'm kind of surprised by that because like we're saying the same thing and um you know yeah I think it's a common opinion in the company that things are going to progress quickly and do you think like pre-training and test time compute alone kind of get you most of the way there or is there also it seems like this algorithmic bucket as well I it's not by any means that we've like we're done it's not like we've cracked the Cod to Super intelligence now we just have to like you know um that' be pretty cool if you came on the podt and announced that you had though uh but I think okay so the way the way that I think about it um back in 2021 like late 2021 I um had coffee with ILO sber and you know he was asking me about my AGI timelines and I told him like you know to be honest like I think it's going to take a very long time I'm pretty skeptical that we'll get there within the next 10 years and the main reason that I gave him was that we don't have a general way of scaling inference compute a scaling test time compute like I saw how much of a difference that made when it came to games and the fact that it wasn't there in language models in a very general way um just to me it seemed kind of like silly that we're going to get get the super intelligence just by scaling pre-training because you know you look at these models and like yeah they're doing pretty smart things but also you know back then they couldn't even draw a Tic Tac Toe board you know and like yes you get the gbd4 and suddenly they can draw the board and like make mostly legal moves but sometimes they still make illegal moves they make suboptimal decisions in tic tac toe and like I have no doubt that if we scale pre-training another order of magnitude or two it's going to start playing tic tac toe really well but like if that's the state of things that we're spending like tens of billions of dollars to train a model and it can barely play tic taac toe you know that's like pretty far from Super intelligence so so I and I told him like look we're not going to get to uh super intelligence until we can figure out how to scale in forence computer very general way and I think that that's an extremely hard research problem and it's going to take you know probably at least a decade to figure it out um to my surprise by the way he agreed with me like he agreed that um scaling pre trining alone would not get to Super intelligence and and I think uh I didn't realize it at the time but he was also like thinking about um very very carefully like this scaling uh test time compute Direction um so I thought it would take at least a decade it took like two or three years so and I I thought that was the hardest unsolved research question at the time I have no doubt that there are others and in fact I know that there are other um problems that aren't solved research questions that aren't solved but I don't think that any of them are going to be harder than the problems that we've already solved yeah and uh for that reason I think that things will continue to progress obviously you've had just a massive impact in uh in this test time compute work and you know your research career had obviously been in in Search and planning you know games like poker and diplomacy and I mean from others accounts it seems like when you joined open AI you were pretty clear like this was the direction to push in um it seems to have really paid off I'm curious like how consensus was that approach when you joined like you know uh you know maybe talk about like getting the kind of research organization oriented behind that yeah it's interesting uh when I um went on the job market and I was like you know interviewing in a bunch of places people in general were quite receptive to the idea that like okay like it the research Labs I think everybody actually uh for the most part was among the frontier research Labs believed that pre-training alone would not get us the current Paradigm would not get us to Super intelligence and that there was something else that was needed uh so there was a lot of reception to this idea of okay um yeah maybe need to figure out how to scale test time compute some Labs were more bought into it than others and I was actually kind of surprised that open AI was really really on board with this because you know they're the ones that pioneered um large scale pre-training and had scaled it farther than anybody else um and but they they were very on board with it and I didn't know it at the time when I was talking with them that they had also been thinking about this um for a while before I joined so um when I when I did join you know it's interesting because I think the motivation was was different um the motivation that they had in mind was more about like overcoming the data wall um not so much about we need to figure out how to scale a test time compute it's more about like how do we how do we like Get over the the data wall um but the techniques um ended up being you know the the the agendas ended up being pretty compatible and um yeah so it actually wasn't too hard to get you know look when I when we started it was still this like exploratory research Direction um and uh there were some people working on it but it wasn't like you know half the company is dedicated towards towards this like large scale effort by by any means um but a few months after I joined uh we know we I and various other people were trying things um that some many of which didn't work um but you know there was one thing that one person tried that like ended up getting some some signs of life and and um people were like oh that seems interesting maybe we should like you know try some more things and like you know get more and more signs of life and um and eventually like I think the leadership recognized that like okay there's actually something here that seems different and valuable and we should really scale this up yeah and um I was supportive of that but I think others were and I think it was like u a testament to actually open AI um and its organizational uh Excellence that it was able to recognize that there was a lot of potential here and was willing to invest a lot to scale it up I mean I think people like really under I think it's like an underappreciated point that it many ways it's really surprising that something like 01 came out of open AI um it's disruptive to the Paradigm that open AI paradig uh you know um it's disruptive to the Paradigm that open AI uh pioneered and I think it's like a really good sign that opening eyes and getting getting trapped in the you know innovator's dilemma um and is willing to invest in a risky Direction and um and and I think in this case it's going to pay off yeah no it's really interesting because obviously if if the you know script had continue to play out of just scaling pre-training you know continuously and raising more money to do that open eyes a great position to do that and so any sort of like orthogonal approach uh you know yeah it is different and so it's it's cool that it came out of the uh of the same place obviously your original timeline was hey it's going to take 10 years to do this uh you did it in two uh what was the first thing you saw that kind of like yeah okay actually this might be way faster than I thought so first of all it's not just me like a lot of uh it was me and a lot of other people that that managed to do it in a shorter period of time than I predicted um what's the first thing that I saw I think when I joined we had a lot of discussions about the kinds of behavior that we would like a um we would like the model to do and that included things like we want to be able to see it uh try different strategies to solve a problem if like a strategy isn't working out we want to see it take a hard problem that involves many steps and break that step break that problem down into smaller into smaller pieces that it can like tackle one by one um we want to see it recognize when it's making mistake and correct those mistakes or avoid making it in the first place um and there were a lot of discussions around how do we get those individual things and that kind of bothered me the fact that we would like even try to tackle them individually because it just seems like okay well ideally we just get something that like figures out all this stuff on its own and you know we got the initial Signs of Life and and then um you know one of the things that that we tried that I was a big fan of that uh yeah that I advocated for was like why don't we just have it like think for longer yeah and when we had it think for longer it would just do these kinds of things emergently you could it wasn't like um it wasn't like oh suddenly we have like a one but it was like oh there's indications here that it is doing things that we wanted that we were strategizing about how to enable to do these things um and it's just figuring out on its own that it should be doing these things and it was also clear that we could scale it a lot further so I think for me that was that was the big moment where we just like had it think for longer and suddenly you see a qualitative difference like you see this qualitative behavior um that we thought we would have to somehow add to the model and it fig that on its own and of course the performance was better um but the performance wasn't that much better it was it was really seeing the qualitative change seeing um the those behaviors um that really gave me the conviction that like okay this is this is going to be a big deal yeah I think that was like probably October 2023 yeah wow and it got out pretty fast after could have been faster I guess how how would you kind of contextualize for our listeners today where planning in an 01 type model is helpful and where it's like you know you should stick with gp4 40 or you know it's not as helpful and how do you expect that I guess to you know obviously you're constantly working on improving this how does that change kind of going forward I I think eventually there's a single model um I think right now we're in this we're in this state where you know gbd 40 is better for for many things um and 01 is better for for many things uh certainly 01 is more intelligent so if you have like a very hard problem 01 is extremely good for that uh I have talked with researchers at universities like you know have a friend that's a professor who loves A1 like the is a real power user because it just like can tackle these hard research questions that normally you would need somebody with a PhD to to be able to handle I think for some tasks um I think like creative writing might be one of them though actually I'm not sure if I I know that for for something like creative writing 40 is better than 01 preview I'm not sure what the comparison is for 01 um but certainly like the big benefit of 40 is that you just get a faster response so if you just want a response immediately and um it's not a very hard reasoning task you you know I think 40 is a reasonable thing to try yeah but I I should say that eventually where we want to end up is like there's a single model and it can you just ask it everything and if it requires like a lot of deep thinking then it can it can do that um if it if it doesn't and it can respond immediately with a quite good response it does that as well what is the intersection of like multimodal models and these models look like going forward so 01 is takes a as input images yeah um I think that's going to be pretty exciting it's going to be exciting to see what people do with that um yeah I I don't see any blockers to this like having them be as multimodal as like 40 and these other models one of the fascinating parts of o01 is I feel like a lot of the previous work that you had done in reasoning was built you know uh on reasoning that was kind of specific to that problem like I you know as I understand it like go was you know Monte Carlos search that maybe wasn't as relevant for poker and like you know obviously one of the things that uh is so impressive about you built is you know you scaled kind of inference compute generally um could you talk a little bit about like what's required to do that versus maybe some of the more specific work that have been done in the past toward like specific types of problems well I think it requires I mean I can't go into details but like the you know the actual technique but I think the important thing is that it requires like maybe a change in in mindset that I think think when I was a PhD student and then afterwards um once I saw how much of a difference scaling test time compute made uh in poker uh I was like okay this is great but unfortunately it only works in poker so how do we extend this algorithm to be able to do more and more domains uh and so you know there's a question about like how do you get this technique to like work for both poker and go or uh poker and diplomacy or something like that and so um you know we developed techniques that work in Hanabi we developed techniques that worked in in the diplomacy and one of the things that I was considering doing is like just trying to get this algorithm to play as many games as possible like try to come up with an algorithm that would work for as um is a similar as similar to what was done in poker but be able to like more broadly work um and I think the diplomacy work actually convinced me that that's kind of the wrong way to think about it that you really need to start from the Endo which is like okay we have this extremely General domain and language is actually really good example of this where you have uh such breadth and instead of trying to extend a technique that worked in one domain to do more and more domains and eventually do everything we should in we should instead start from everything and figure out an you know figure out some way to scale test time compute and my guess and like initially of course it's not going to scale very well it's not going to be like a very good technique to scale a test time compute but then can you have it scale better and better and I think that Chang your mindset I mean the diplomacy work is really what convinced me to um have that change of mindset because trying to take the techniques that we developed for poker and go and apply them to diplomacy we to apply to apply it to like the actual General full General game of diplomacy like we we managed to apply it to um diplomacy with some constraints on like what it could actually do and there was a ceiling to how how much you could achieve and we actually only got to human level like strong human level performance in diplomacy and it was pretty clear that if we pushed that Paradigm a lot further we weren't going to get to super human performance so to actually tackle the full game of diplomacy and reach super intelligent uh like super human performance in diplomacy it was clear we needed something that would actually just like work for pretty much anything yeah and so I thought like okay let's just you just got to jump to the end point and and try to tackle everything it's so interesting I me you kind of mentioned that uh you know you kind of expect everything to converge on you know kind of a single model um I guess in what time frame like in the medium term do you think that like we have one model that rules them all or or you know obviously there's lots of folks out there building specialized models for different use cases like do you think building your own model like makes sense I guess there's folks building legal models or Healthcare models or some of these things so uh it's a good question I get ask this a lot I don't have a great answer for this but like one thing I have been thinking about is you know you can ask you can ask o1 to multiply two large numbers and it can do it like it'll um work through the arithmetic to figure out how to like you know carry the digits and all that and like actually multiply two large numbers and tell you the answer it doesn't make any sense for it to do that like the optimal like really what it should do is just call a calculator tool or write a python script that multiplies the two numbers runs the script and then tells you the output so I think that that calculator tool is like one extreme end of the spectrum of like very specialized very simple but very fast and cheap and on the other end of the spectrum you have something like A1 that is very general very capable but also pretty expensive and I think it's quite possible that you'll see a lot of things that essentially act as tools um in between those two extremes and that 01 or a model like o1 can use to save itself and save the user a lot of cost yeah it's really interesting that the tools don't end up being capability enhancing they're more just like to not require massive amount of of compute costs to solve something that could be much more easily Sol yeah it's also entirely possible that some of these tools just do like a flat out better job than a one um so I think the way I think about it is like kind of the same way that I would think about how a human would act like you know you could ask a human to do something but like maybe they're better off just using a calculator or you know um do doing some other kind of like specialized using some other kind of specialized machine or uh something well I guess on the o1 side uh any like you mentioned kind of your friend who's a professor using it like any other kind of unexpected use cases that you've seen in the wild or personal favorites this is Jacob hope you're enjoying the episode with Noom just wanted to take a quick break to let you know that on unsupervised learning we have conversations like this every week with Founders and AI researchers if if you're enjoying the episode please consider sharing with a friend and subscribing without further Ado back to Noom any other kind of unexpected use cases that you've seen in the wild or personal favorites I think one thing I'm really excited for is to see how 01 is used for coding um I think 01 preview like people were pretty impressed uh its coding ability but it was good in some ways for coding and and not as great for others and so you know it wasn't like strictly dominant in term among models for coding um I think that 01 is going to do a lot better and I'm pretty excited to see um how how that changes the field yeah um if that changes the field and um yeah I I I'm yeah I'm just really curious to see like you know we we I use 01 internally um other people do we've had some people play around with it give us feedback but I don't think we really know how it gets used until we actually deploy in the wild yeah how do you use it I use it for a lot of coding tasks or like you know if I yeah have something and and fre what I'll do is like if I have like something that's pretty easy I'll give it to 40 but if I have something that I know is really hard or that I need to write a lot of code I'll just give it to 01 and like have it just do the whole thing on its own um and also frequently if like I have a tough problem that like for whatever reason 40 isn't getting um I'll just give it to 01 and it'll usually give me an answer it's not doing corei research yet 01 is not doing corei research you know you mentioned on the path to o1 obviously there were some things that you saw you know um Milestones that were really meaningful around uh you know kind of the ability to reason things as you think about obviously you're continuing to work on this class of models like what are the Milestones that are meaningful to you going forward things that if you saw as as you guys contined to scale up that would be you know important to you like Milestones is in like among among benchmarks or something or I mean it could be specific benchmarks or like you know even just how you think about like the next set of capabilities that are important in you know that you that you'd hope that like an O2 would have uh I'm really excited to see these models um become more agentic um I think I think a lot of people are so I I think one of the major challenges one of the major barriers to actually like achieving a agents people have been talking about agents for a while ever since chbt came out people were always talking about agents and like you know they would come to me and like ask like oh why are you working on agents and like my feeling was that they were just the models are too brittle that if you have a a long Horizon task and there's a lot of intermediate steps you need the reliability and you need the uh coherence to be able to like have the model figure out that it needs to do these individual steps and also like execute on them and you know yes people tried to like prompt the models to be able to do that and you could kind of do it but it was always like kind of fragile um and not General enough and the cool thing about o1 is that I think it's a real proof of concept that like you can give it a really hard problem and it can figure out the intermediate steps on its own and it can figure out how to tackle those intermediate steps on its own so the fact that it's able to do things that we're complet are completely outside the realm of what something like 40 can do without like really excessive prompting I think is um I think it's a good proof of concept that it it can start doing things that are agentic so yeah I'm excited for that direction there's obviously a lot of folks today that are working on agents and I think they take basically you know the current limitations of models and find ways around them right there whether they have you know they'll they'll chain you know six model calls together to check outputs or they'll you know find some smaller fine tune model that just checks you know whether something ties exactly back to you know the original data source it feels like there's all these kind of like orchestration and Scaffolding that's built to make this work does that feel like some of that stuff persists or ises that eventually all just become part of like the underlying model you know okay so there's this great essay called The Bitter lesson I knew we couldn't get through this podcast without the bit lesson coming up you know because I'm surprised like whenever whenever I like uh give talks at uh various events like AI events um you know sometimes I'll PLL people and ask them how many have read the bitter lesson and surprisingly few have I think people that have been in the field for anyone listen to a podcast with you or follows you on Twitter they would have been exposed to the bitter lesson times great okay so for those that haven't I mean I think it's a great essay I highly encourage people to read it it was written by the creator of the field of RL um Richard Sutton and he talks about this and he says that every you know basically every time there's you know you look at the history of chess for example um the way people triy to um tackle chess was to like kind of code things up like in a very um like code their knowledge into the models and and try to get them to to do humanlike things and the technique that ended working really well was like techniques that scaled really well with with more Compu and more data and I think the TR the same is true now with these language models that okay we've like reached a certain level of capability and you know it's really tempting to like try to push it okay there's things that they're just unable to do and you'd like to them to be able to do those things and so there's a big incentive to then add a bunch of scaffolding and add add all these prompting Tri tricks to push it a little bit further to be able to do those things um and you encode a lot of like your human knowledge in that in order to to get the models to to go a little bit further um what's ultimately going to work in the long run is a technique that scales well with more data and more compute and there's a question about like are those scaffolding techniques do they scale well with more data and more compute and I think the answer is no um I think something like o1 is something that scales really well with more data and more compute and so I think that in the long run I think we're going to see a lot of those scaffolding techniques that push the frontier a little bit further uh I think that they're going to go away and I think it's an interesting question for Builders today of like you could solve a Here and Now problem with that and then evolve over time with what's required yeah it's it's a tricky thing especially for startups because I I know that they probably face a lot of demand for some task and you know there's something that's just Out Of Reach of the models and they think like okay well if I invest a lot into the scaffolding and customization to make it be able to do those things then I'll like you know I'll then I'll have a a company that's able to do this that thing that nobody else can do um but I think it's important and this is actually one of the reasons why we're we're telling people like look these models are going to progress and they're going to progress quickly is that you don't want to be in a position where um the model capabilities improve and suddenly the model's can just do that thing out of the box and now you've wasted six months building scaffolding or or some specialized um you know agentic workflow that now the models can just do out of the box talking about kind of what's happening in their broader LM space I mean Beyond you know test time compute like what are other research areas you're paying attention to I was really excited by Sora I think a lot of people were I thought it was really cool um I wasn't really keeping uh too up to date on like the state of video models and so I was like when I when I saw it I was pretty surprised at like how capable it was you obviously you know cut your teeth in in Academia um you know I think there's a question now that a lot of folks are thinking about about the role of Academia and AI research today given obviously access to a completely different level of compute like how do you think about like the the role of Academia today yeah it's a real tough question um I've talked to a bunch of PhD students and they you know it's they're in a tough situation where they they want to help push the frontier further and it's hard to do in a world where uh so much is dependent on data and compute if you don't have those resources then it's hard it's hard to push the frontier forward there's a Temptation I think among some PhD students to um try to do you know what I said shouldn't be done like add their human domain knowledge add these like little tricks to try to push the frontier a little bit further so you take you take a frontier model you add some clever prompting or something you push it a little bit further and you get like .1% higher than everybody else on some eval and you know it's the problem is that it's actually not I don't I don't blame the students so much as like I think Academia incentivizes this I mean it's it's prestigious to have a paper accepted to a prestigious conference and it's much easier to get a paper accepted to a conference if you're able to show that you're like at least slightly better than everybody else on some eval um so the incentive structure set up in a way that encourages that behavior at least in the short term but in the long term that ends up not really being the most impactful research so my suggestion is don't try to compete with the frontier industry research Labs on Frontier capabilities uh I think that there's a lot of other research that can be done and I've seen really impactful research that that can be done um so one example is just like investigating novel architectures um or novel novel approaches that scale well and if you can just show that like okay well you can show the scaling Trends and show that is it's showing a promising um path as you throw more data and more compute into it um then that is good research even if it's not getting stud performance in some eval and the you know people are going to pay attention to that I mean it might not be that the the people that casually pay attention to the field are going to like pick up on that and might not like you know make it into uh the news cycle or something but the people it it will have an impact if it's showing promising Trends and I guarantee you that industry research Labs look at those kinds of papers and if they see something that is showing showing a promising trend line they're willing to like put in the resources to see if it actually like pays off at Large Scale what what evals are like still meaningful to you or like like you know when when uh you're playing around with a new model like what are you looking at I think there's like a lot of Vibes questions that I ask and I'm sure everybody has go Vibes question I mean my go-to is really Tic Tac Toe always games I guess that makes sense yeah it's like you know it's it's shocking to see how um how challenging it is for some of these models to play Tech taaco um I I joke that it's I think it's just because there's not enough 5-year-olds on the internet getting Tic Tac Toe strategy on Reddit we haven't populated the world with tons of uh of tic tac toe data yeah and I just like see how these models um do with like the kinds of dayto day questions that I have and um it's it's pretty cool to see the the progress in things like going from 40 to 01 preview to 01 yeah yeah I mean you mentioned obviously you know it sounds like since 21 you changed your mind uh and then showed it uh with what was possible a test time compute anything like in the last year that you've changed your mind on in the AI research World um and I shouldn't say it was it wasn't like I changed my mind in 2021 I was like pretty bought into this even um basically when we got the poker results in like early 2017 yeah um I think for language models I think the shift of language models like I I started think about that like more like 2020 2021 yeah no sorry I meant more like the the you you had in 2021 thought it would take 10 years to scale stuff and and now I think it's two anything in the last year that like you've you know kind of done a 180 on of something you thought um you know I think the main thing that I've changed my perspective on is like how quickly I think things would progress so like I said I I I remember I I've been in the AI field for a pretty pretty long time is you know um by today's stand standards so I started grad school in 2012 um I saw the Deep learning Revolution happen and I saw people like talking very seriously about AGI and super intelligence back in like 2015 2016 2017 and you know my my view at the time was that you know just because alphao is superum a go it doesn't mean that we're going to get to Super intelligence anytime soon and I think that was actually the correct assessment like I think people didn't look at the limitations of alphao enough and the fact that like okay it can play go it can even play chess and chogi um but it can't play poker and nobody actually has a good idea about how to actually get it to be more General than that um and two players Z some games are these like very ideal situations where you can do this like unlimited selfplay and keep hill climbing in some direction that gets you to superh Human Performance and that's not true for for the real world so I I was I was pretty uh I was on the more skeptical end I was probably like you know actually more optimistic than the average AI res Archer that we could like progress um towards uh very very intelligent models that would change the world but I think compared to you know people like at open a ey or some of these other places I was on the more skeptical end um and I think my perspective on that has changed quite a bit I think um seeing the seeing uh the ability to scale test time computer in very general way like that changed my mind and like and I kind of became like increasingly optimistic actually I think like the conversation that I had with Ilia back in 2021 was the start of that that he like kind of convinced me that like yes we don't have the entire Paradigm figured out but maybe it's not as far away as like 10 years maybe we can get there sooner yeah um and I think seeing that actually happen change my perspective and I think things are going to happen faster than I originally thought I mean obviously there's a bunch of folks out there that are trying to compete with Nvidia you know I think Amazon recently has been pretty aggressively investing in trinium having anthropic use it like what do you think about some of these other Hardware efforts uh I'm pretty excited for um you know seeing the investment in Hardware I mean I think one of the cool things about 01 is that I think it really changes the way that people should be thinking about Hardware so I think before people had this mindset that like okay there's going to be these like massive pre-training runs but then actually like the inference cost is going to be like pretty cheap and you know very scalable um I don't think that's going to be the case I think that we're going to see a major shift towards inference compute and if there are ways to optimize around inference compute I think that's going to be a big win so I think 's like an opportunity for a lot of creativity on the um Hardware side now to adapt to this new paradigm kind of hitting on some questions outside of llms you know I feel like your work with diplomacy is like incredibly interesting obviously it's like this game that involves negotiation predicting how others will act Etc um it's hard not to think about the implications of that for like you know simulating Society to test policies or like even having AI like as part of a government in some way how have you kind of thought about this and and what are your kind of intuitions as these models get better and better about like the role they play and the those parts of society at large well I think um there's I guess two questions there but like kind of answering like one of them I think one of the directions that I'm pretty excited about for these models is like using them uh for a lot of soci U like social science experiments and um and and also things like Neuroscience that I think you can learn a lot about humans by by looking at these models that were trained on vast amounts of human data and are able to like imitate humans quite well so um and of course the great thing about them is that they're much more scalable and cheaper than like hiring a bunch of humans to these experiments so I I am curious to see how um the social sciences use these models to do cool research in their fields yeah what are some ways you could imagine that happening um I think you know normally you would do a bunch of like if you want to do um I mean I'm not a social scientist so it's you know I haven't I haven't thought about this like that well um but I think like economics for example um there's a lot of um you did work at the FED before right I did work at the FED yeah I guess the social science so I guess game theory is actually a good one where you know I've been in these uh I I when I was in undergrad I did some of these experiments where like you know they would hire um they would bring in a few undergrads pay them a small amount of money and like have them do these like small game theory experiments to see like oh how rational are they um how do they respond to incentives like how do they how much do they care about like um how do they how much they care about like making money versus like getting revenge on people that like you know wrong them and a lot of these things you can do now with uh with AI models and I it's not obvious that it would translate it would be like a match for Human Performance but that's something that could be Quantified like you could actually see like okay in general do these models do things that humans would do and then if you have a much more expensive experiment then you could you know maybe extrapolate and say like okay well we are this is like not cost- effective to do with human subjects but we can use this AI model or things that are like ethical concerns as well you know maybe you can't do this experiment because it's not ethical to do with humans but you could do with AI models so I guess one example is um the ultimatum game are you familiar with that no okay so the ultimatum game is that you have two participants let's call them A and B A has like $1,000 and they have to give some percentage of that to B and then B can decide um whether to accept that split or to just say that neither player gets anything so you know if a has $1,000 they give like you know $200 to B if B accepts then b gets 200 a gets 800 if B rejects then both them get zero and you know there's um experiments showing that like if people um get offered less than roughly 30% then they'll reject and of course there's the question about like okay well if it's a small amount of money then that's pretty understandable that you know if it's $10 and you're only offered $3 then you might just be like annoyed of that person and reject to spite them um are you still going to do that if it's $10,000 and you're being offered $33,000 to kind of a different question and so the only way that you of course it's like super expensive to actually run that experiment and so the way they've done it historically is they would go to a very poor Community um in like a different country and you know offer them what to them would be a very large amount of money and see how they would um act differently um but even then you can only like push that so far so with AI models like now maybe you could actually like get some insights on how people would react to these kinds of situations that are like cost prohibitive it's I mean and also for you know Neuroscience these other things I always think you know I think a complaint of of the social sciences has been all those experiments are done on like you know college kids that need to get credit in their in their intro site class or something and so also you know getting exposure to a broader uh you know the internet at least is probably a broader swwa of society that been trained on than most of these experiments which are basically like 19y olds at at top institutions yeah that's a great yeah that's a great point and I should also say that like look if you're doing these experiments like gbd 3.5 like gbd 3.5 is not going to do a great job of like imitating how an actual human would do in a lot of these settings but this is a very quantifiable thing that you can actually just measure um how closely these models are matching what humans would do and I I suspect I haven't actually like looked at these experiments myself but I suspect that as the models become more capable they do a better job of how of imitating how actual humans would do in these settings yeah you know and then obviously you know your work in diplomacy was focused on kind of an AI player among a bunch of humans um how if it all does that change like I feel like we're about to enter some world where we have ai agents interacting with other AI agents and negotiating and and whatnot like how does that change uh change things if at all how does how does that change what uh the kind of like the underlying work you have to do to make a really effective AI agent I guess is it l literally the same problem or I think one of the things that I'm really excited for about um llms is that you know there was always this question in AI about how do you get AI to even communicate with each other um so there's this whole field of AI called emergent communication where people would try to like teach AIS to like be able to communicate with each other yeah and that problem is now effectively solved because you you have um a language built in that conveniently also humans use and so a lot of these problems are just like kind of conveniently like out of the box just like answered so I I it's quite possible that maybe you don't need to change that much what do you think what's uh happening in the AI robotic space like where do you think that space goes in the next few years I think in the long term it makes a lot of sense I you know I did a masters in robotics um I didn't actually like work with robots that much but I was in the program and like I had a lot of friends and that that we're working in robotics and one of the main takeaways that I got is that Hardware is hard and it takes longer to iterate with Hardware compared to software so I suspect that robotics is going to take um take a little while to progress just because iterating on actual physical robots is is hard and expensive um but I think that there's going to be progress obviously you're about to release a one it's the Wild and people are going to build all sorts of things on top of it uh that that neither of us could possibly imagine um but are there some areas generally that you feel like are underexplored applications today or places where you wish there were more Builders messing around with these models um I think I'm really excited to see these models advancing scientific research I think we've been in kind of a weird State up until now where the models were like broadly very capable but they weren't necessarily surpassing uh expert humans in hardly any domains and I think increasingly as time goes on that's that's going to stop being true and we're going to start seeing the models like surpassing what expert humans can do in first just like a few narrow domains but then increasingly more and more domains and that opens up the possibility that you can actually like Advance um the frontier of human knowledge and and use these models like not as replacements for for researchers but as um a partner that you can use to uh do things that were not otherwise possible or do them a lot faster so I think that's the um the application that I'm most excited about and it's not something that like has really been possible yet but I think that we're pretty uh we're going to start seeing happening you think it's possible with this current set of models I don't know and this is actually one of the reasons why I'm excited to to see 01 release because you know I'm not a researcher um in i a researcher in one domain but I'm not a researcher in like all these different domains and I'm I don't know if it will be able to improve chemist research the state of chemistry research or the state of biology research or theoretical mathematics and getting the model into the hands of these people and seeing what they can do with it I think will um give us some feedback on like what what where where where it's at on those domains you mentioned that it might start you know more narrowly first uh before expanding out any intuitions on like the narrow subset of things that might be particularly well suited to it or is that for the uh Community to find out as they as they mess around with I think it's I think it's it for the community to find out um it from 01 preview it looks like it does particularly well on math and coding yeah um those are very impressive results yeah and I mean it's like improving things pretty broadly but we're seeing like quite noticeable U progress on those two I I wouldn't be surprised if that like continues to be true and like we see the performance improving very broadly but like because math and coding is already ahead that it will continue to progress more quickly on those two um but I I think it's it's going to be um a broad Improvement across the board well no it's been a fascinating conversation we always like to end with a quick fire round where we get your uh quick take on things um maybe to start what's one thing that's overhyped and one thing that's underhyped in the AI world today oh jeez um supposed to be a quickfire round that's a hard question this is where I just just stuff all my overly broad questions you know I mean I guess like overhyped I would say a lot of these um kind of like prompting techniques and like uh scaffolding techniques that you know like I said I think are going to um be done away with in long term uh underhyped I I think I mean I'm I'm a huge fan of 01 I got to say 01 um I think that for people that are paying attention to the field it has been a big update um I think that for the broader world I don't know if people have recognized what what it means yet um to the extent that they should um yeah I think I'll go with those hopefully the release tomorrow uh starts getting into that yeah well uh yeah we'll see um do you think model progress in in 2025 will be more or less are the same as as 24 uh I think that we will see progress accelerate how do you define AGI I don't just I I I've been trying to shift away from using that term as much as possible um I think an AI that can do I mean I think that there's going to be a lot of things that an AI will not be able to do that humans can do for a long time and I think that that's the ideal out the ideal scenario um especially things like physical tasks I think that um humans will have an edge for a very long time and so I think an AI that can do that can like accelerate uh human productivity and um make our lives easier I think is like the more important um term than AGI well no uh I always like to leave the last word uh to our guest and I feel like there's a million places you could Point people uh to your work what's going on in open AI but but the floor is yours uh anything you want to say to our listeners um or or things you want to call out yeah I mean I guess the main thing is that like I you know to the to Skeptics out there like I I get it like I um I've been in this field for a long time I was very skeptical about like the state of things and where where and um the hype around uh the progress in AI um I I you know I kind of recognized that like AI was going to progress but um I thought that it would take us much longer to even reach this point like I think it's really important to to recognize that like where we are right now is complete science fiction um compared to like even five years ago let alone 10 years ago so so um the progress has been astounding and I think that there are reasonable uh concerns about like oh are we going to hit a wall is progress going to stop but um I think it's important to recognize that the test time compute Paradigm um in my opinion really addresses a lot of those concerns and so I I think for people that are still skeptical of like the progress in AI I would just recommend like take a look for yourself uh I we've been pretty transparent with the block post and our results about like where things are where we see things going um and I think the evidence is is pretty clear uh well no this has been uh absolutely fascinating a real pleasure of this job to get to sit down with you thanks so much for for taking the time of course thanks a huge thanks again to Nome for a just fasinating conversation uh if you enjoyed that please consider subscribing sharing with a friend we're always trying to get the word out about the podcast we have a bunch of great conversations coming up with leading AI researchers and Founders 2025 is going to be an incredible guest lineup thanks so much for listening uh and I'll see you next week [Music]