In this episode of "Navigating Forward," Kevin McCall is joined by Dr. Kalyan Basu to discuss generative AI and its future impact. Kevin and Kaylan cover the challenges in model training, the potential of synthetic data, and how businesses can prepare for the changing landscape of AI. Dr. Basu, with his extensive experience from academia to Microsoft and as a technical advisor, provides practical insights into the progression from large to small language models, multimodal experiences, and the advancements in AI accessibility. Stay tuned to learn about the intersection of policy, legislation, and AI development from a seasoned expert.
Learn more about Kevin McCall
Learn more about Kalyan Basu
00;00;03;08 - 00;00;34;26
Narrator
Welcome to Navigating Forward. Brought to you by Light Consulting, where we explore the ever-evolving world of technology, data and the incredible potential for artificial intelligence. Our experts come together with the brightest minds in AI and technology, discovering the stories behind the latest advancements across industries. Our mission to guide you through the rapidly changing landscape of tech, demystifying complex concepts and showcasing the opportunities that lie ahead.
00;00;34;29 - 00;00;45;19
Narrator
Join us as we uncover what your business needs to do now to prepare for what's coming next. This is navigating forward.
00;00;45;21 - 00;01;05;10
Kevin McCall
And today, I'm delighted to have Doctor Kalyan Basu with us here to talk about the new frontiers in generative AI. There's obviously activity every single day in this domain, but today we're going to go beyond this quarter or even this year and talk about challenges that will take potentially years to push through. So Kalyan, thank you so much for joining us.
00;01;05;11 - 00;01;06;29
Kevin McCall
Why don't you introduce yourself?
00;01;07;01 - 00;01;35;27
Kalyan Basu
Yeah. Thank you Kevin. It's a pleasure. And it's a pleasure because this is a topic very dear to my heart and spent almost lifetime in. to introduce myself, I'm going to break it into three phases. So I started my life and academics and research, and, and, you know, as a, as a teacher in the top universities in India, I, I that next part of my life is mostly in Microsoft.
00;01;35;27 - 00;01;36;24
Kalyan Basu
Long stint in.
00;01;36;24 - 00;01;38;01
Kevin McCall
Microsoft where we met.
00;01;38;03 - 00;01;46;17
Kalyan Basu
Where, where met. And, a lot of that stint was in AI and, you know, the precursors to AI because things have been changing so fast.
00;01;46;18 - 00;01;49;20
Kevin McCall
Even the definition of AI dramatically changed lately.
00;01;49;24 - 00;02;10;17
Kalyan Basu
Totally, totally. And then this, this third phase of my life, which is I think the happiest phase for me is, is as a technical advisor, and as a consultant to startups and other companies that are really wishing to unlock the great value of this technology. So, so that's where I am. And, looking forward to the to the discussion.
00;02;10;20 - 00;02;37;21
Kevin McCall
Perfect. Well, I want to thank you because when we work together, all of the best, most interesting, most influential papers that you sent over to me. really hit the mark. So thank you again for the time you took to curate those, those reading lists for me personally. So now, before we get to the big topic of today, the the new frontiers in generative AI, I want to maybe a little bit comically go back to, list, a top five list that I was discussing with customers late last year.
00;02;37;21 - 00;03;13;28
Kevin McCall
I just want to read you this list and get your reaction to it. They were arguably guesses I had regarding extensions of the activity in 2023 into 2024, so the ones I was talking about were if I were to give you the top five. Clearly, generative AI captured the public's imagination with open AI, ChatGPT, etc. last year. And while generative AI in 2023 was really about chat bots and LMS, I think in 2024 we're going to see a lot bigger and more widespread breakthroughs in the generation of audio, video, speech, etc. so I talked about that as number one.
00;03;14;05 - 00;03;34;12
Kevin McCall
Number two was while 2023 was really the year of the large language model, the frontier language model, I think 2024 is going to be more the year of the small language model, right? So I think we're going to see more activity there. Third was leading chat bots were supported by these frontier large language models. But most of these modes of interaction were very narrow.
00;03;34;12 - 00;04;06;26
Kevin McCall
Right. They were single modal experiences. Text and text out in 2024. I think we're going to see much more multimodal experiences right in both, not just the digital world, but also the physical world. And then I think a fourth and fifth were leading AI technology to date has really required deep technology pros and specialized domain experts. But I think we're going to see dramatic expansion in AI consumer ability, you might say, where we're going to see an explosion of empowerment and scale for office workers, professionals.
00;04;06;28 - 00;04;21;05
Kevin McCall
and the fifth is this, I think, inevitable expansion of a higher level of urgency and maturity and things like policy and legislation, you know, focus on IP and implications on that. So that seemed like a reasonable list. It feels like ancient history now, doesn't that?
00;04;21;11 - 00;04;46;00
Kalyan Basu
Well, it's at least medieval. It's not it's not ancient. But yeah, it's reasonable. It's reasonable. But, you know, even though we use those words, the meetings have shifted from under our feet. And I think we're now looking at frontiers in challenges, which, you might still describe them in those kinds of words, but they mean different things and they mean different approaches.
00;04;46;06 - 00;04;47;16
Kalyan Basu
Yeah. This is so. Yeah, part of the.
00;04;47;16 - 00;05;11;04
Kevin McCall
Excitement of working where we do right now. Yeah. Yeah. Okay. Well let's jump right into let's move beyond that, Lisbon. And let's jump back to our primary topic today. Let's start with more effective approaches to scaling model training. Right. set up the problem for me, and let's talk about how we might be able to fundamentally better scale model training in the future.
00;05;11;07 - 00;05;33;06
Kalyan Basu
This is a good onramp to this new brave frontier, I think, because, you know, it's it's still very relevant to, to the, to the world that you described. but and I'm, I'm going to try to sketch, you know, some of the biggest challenges, see, and I kind of break it up into three basic parts data challenges, context challenges, and scale challenges.
00;05;33;06 - 00;05;50;20
Kalyan Basu
Okay. And all of them have to do with fine tuning, I think the pre-training stuff, the, you know, the frontier models and the big companies are going to handle a lot of that for us. But the fine tuning world is in a lot of effort, is going to see a lot of changes. So let's look at these three categories okay.
00;05;50;21 - 00;05;51;18
Kevin McCall
Sounds good.
00;05;51;21 - 00;05;57;23
Kalyan Basu
So data I mean, weird as it may sound, we're running out of data to train these models.
00;05;57;28 - 00;05;59;00
Kevin McCall
This seems strange.
00;05;59;00 - 00;06;13;28
Kalyan Basu
Is it. It's just mind blowing. I mean I think that this touches on pre-training, but it's one of the biggest challenges because we are running out of data and particularly in the fine tuning stage, when you need a lot of rich data to to have task specific fine tuning.
00;06;14;00 - 00;06;25;10
Kevin McCall
Yeah. I mean, pre-training often is trillions of tokens. Yeah. But the inevitably the fine tuning phases, the instruction tuning phases requires highly curated supervised training that does.
00;06;25;10 - 00;06;44;04
Kalyan Basu
But even in the pre-training phase, we are running out of data. I mean, there's just so much of tokens that the internet has, right? And then there's a lot of, you know, ring fenced, you know, walled garden areas of the internet, which, which make it difficult. So I think one of the biggest, frontiers of next, next couple of years is going to be synthetic data generation.
00;06;44;10 - 00;06;45;15
Kevin McCall
Okay. Tell me more about that.
00;06;45;20 - 00;07;02;18
Kalyan Basu
So synthetic data generation is just data generation. But you know, artificially. So you know, you're actually training data will be generated by a class of lambs of lambs which are going to create this data for training the real live lambs.
00;07;02;18 - 00;07;08;00
Kevin McCall
And does that include some of the work we did, for example, in in high fidelity digital simulation?
00;07;08;02 - 00;07;31;08
Kalyan Basu
Oh yeah. In fact simulation is is right up, you know, on that on that list. So simulations particularly for the multi-modal model simulation simulations will become a main source of synthetic data generating. But even for language models you're going to have a lot of synthetic data. And for that you need models of a certain class and character. So you have to see a lot of that.
00;07;31;10 - 00;07;50;27
Kalyan Basu
Okay. Let's look at context. So context is actually a big constraint. I mean even the biggest models have 1 million token contexts. Right. And those context windows are not going to cut it because, you know, if you the moment you get into the enterprise scenarios, product manuals, you know, they're going to just exhaust that context window.
00;07;50;29 - 00;08;09;10
Kevin McCall
This seems amazing to me because it wasn't very long ago when we had models that had two K, eight K, you know, context windows, and we went to sliding windows of context. Now some of the new models are 128 K 256. Okay. Let's see. That's so much information. It it feels interesting to kind of consider that that still might not be nearly enough to solve certain problem.
00;08;09;10 - 00;08;35;24
Kalyan Basu
Exactly, exactly. And so that's going to be the challenge. So what we going to see is that we're going to see many different tactics for context efficiencies. We're going to see sliding window contexts. We're going to see striped attention as you know, but ways of dealing with striped attention, which means that you're really not looking at the whole context at one single point of time, but you striping it into, sure, you're going to see advanced strategies like hyena, you know, it's a whole new class of model architectures.
00;08;35;28 - 00;08;37;05
Kevin McCall
Okay. Tell me more about hyena.
00;08;37;08 - 00;08;59;24
Kalyan Basu
Hyena is is is very mathematically principled. It actually has this whole hierarchy of operators called triplets operators. And, it's, it's been developed by a whole cast of, you know, luminaries like, you know, Bengio is it's one of the writers of that paper. So so you're going to see a lot of interesting developments in this hyena structure for, for context, efficiency.
00;08;59;28 - 00;09;22;12
Kalyan Basu
Okay. Then you're going to see a whole lot of parameter efficient fine tuning strategies. And if you've had this, you know, noise with Laura, Laura was a fantastic, you know, step forward for, low rank adaptation where you're not you're not fine tuning the whole model, you're fine tuning a, kind of a, ancillary rate structure for that, and you're going to see a lot of advances.
00;09;22;12 - 00;09;36;17
Kalyan Basu
Then we got quantum, Laura, and various other variants of that. Okay. and then finally, I think we're going to see, you know, move away from the reinforcement learning layer of, of fine tuning today. I mean.
00;09;36;19 - 00;09;38;03
Kevin McCall
It's using like RL.
00;09;38;06 - 00;09;45;07
Kalyan Basu
You know, Aurelia, Jeff and Aurelia, and, you know, that's inefficient. Reinforcement learning is fundamentally sample inefficient, right?
00;09;45;07 - 00;09;46;24
Kevin McCall
Very famously, Sam.
00;09;46;26 - 00;09;53;20
Kalyan Basu
Famously famously model, you know, in particularly model free ones. So we're going to see things like direct preference optimization.
00;09;53;23 - 00;09;55;26
Kevin McCall
approaches. Thank you for sending me that paper. Yeah.
00;09;56;02 - 00;10;12;10
Kalyan Basu
Okay. Yeah. And we're going to talk a little bit more about that. But you know, those kinds of things are going to become more prominent. So yeah. So that's a broad brush across all of these, you know, very tactical but very important improvements. And if you don't do that, we're not going to unlock the power of this technology.
00;10;12;10 - 00;10;36;03
Kevin McCall
Sure, sure. So a very there are several interesting things you just said. One of the themes that I'm hearing over and over again, in many of the examples you just offer is this theme of sample efficiency, right? And many, many researchers, of course, you know, Yann LeCun famously talks about the fundamental differences between sample efficiency in how humans learn and these models learn.
00;10;36;06 - 00;10;55;17
Kevin McCall
So let me throw you kind of an an amusing example maybe to illustrate this. And you tell me how this fits into the experience. when I was nine years old, being the breakthrough thinker that I am, I made up a new game, and I offered to beta test this game in front of all my best friends in New Stanton, Pennsylvania.
00;10;55;22 - 00;11;23;11
Kevin McCall
And the name of the game was throw the Metal Bucket up in the air. Okay. And when I beta tested this game, I had to go to the hospital. And after coming back from the hospital, I after kind of reflecting on what I learned in that one shot learning experience, you might say, I then crossed a number of other brilliant game ideas off of my list, like throw the lug wrench up in the air, throw the hacksaw up in the air.
00;11;23;11 - 00;11;44;01
Kevin McCall
If there were others. So, if I think about that learning experience I learned in one shot, you might say that maybe that game idea was a bad one. I learned zero shot, you might say, about a whole bunch of other games that were bad. I generalized that knowledge as well. And clearly my friends observed, you know, this whole thing go down and, they learned in kind of a zero shot manner.
00;11;44;03 - 00;12;04;00
Kevin McCall
this is this a kind of a I realize it's kind of kind of, you know, silly. But humans learn in an insanely sample efficient manner, right? compared to these models. And so is it fair to say that some of these themes you've offered help us at least move up this curve in far, far, far more sample efficient learning?
00;12;04;00 - 00;12;19;17
Kalyan Basu
If are going to talk a lot more about this, this is a good example, because conceptually, it kind of tells us that we need a different structure of learning. And it's not learning, which is just just throw data at something and it learns. Right. We need certain richer.
00;12;19;17 - 00;12;20;15
Kevin McCall
Representations.
00;12;20;15 - 00;12;35;26
Kalyan Basu
Some, some, you know, rich semantic representations and priors. We're going to talk about that. But I'll just mention one thing in this regard. you know, recently there was a there's an interesting model in a paper by I'm from Microsoft. Textbooks are all you need. So instead of writes.
00;12;35;29 - 00;12;36;11
Kevin McCall
Back.
00;12;36;16 - 00;12;48;27
Kalyan Basu
And those those guys and you know so so the idea there is that instead of training with all this stuff on the internet, let's just train with these high quality, curated, examples.
00;12;48;28 - 00;12;49;24
Kevin McCall
This was part of the fire.
00;12;49;28 - 00;12;55;09
Kalyan Basu
And this is the fight. This is the fine models, you know, underscores the fine. But those are small models. I'm going to talk about small, which.
00;12;55;09 - 00;12;55;22
Kevin McCall
We're going to talk.
00;12;55;22 - 00;13;11;18
Kalyan Basu
About. Yeah, we'll talk about later. But yes, but this is all in the spirit of you know, let's give very semantically compact and rich, examples so that the models can learn really fast and not do this whole trial in the wild kind of thing. Right?
00;13;11;18 - 00;13;37;26
Kevin McCall
Sure, sure. Yep. Okay. That makes good sense. Okay. Let's pivot this in a different direction now. So given how much it's in the public spotlight, let's talk about large language model problems. Now everyone's very aware of LM hallucinations and the like. But I'd like to expand the aperture a bit and talk about model misbehavior that goes way beyond hallucinations.
00;13;37;28 - 00;13;48;21
Kevin McCall
We've talked about you and I have talked about alignment and grounding of models. Take me through a summary of model limitations in this domain that goes way beyond hallucinations.
00;13;48;23 - 00;14;18;29
Kalyan Basu
Yeah. And it's good to kind of, clarify a taxonomy here. for me, there are two, two categories alignment and grounding. Those are good categories to kind of, you know, kind of break up the problem. Space alignment is all about human preferences. you're trying to make sure that the model outputs are aligned to preferences of culture, preferences of their be, preferences of, you know, cultural preferences.
00;14;19;01 - 00;14;22;06
Kevin McCall
As an overall values, values and systems. Okay.
00;14;22;06 - 00;14;30;07
Kalyan Basu
Yeah. And, you know, a lot of that has been tackled with, you know, reinforcement learning, you know, just giving a lot of our life.
00;14;30;07 - 00;14;30;26
Kevin McCall
And. Yeah.
00;14;30;28 - 00;14;39;07
Kalyan Basu
So I think I already mentioned that one big jump there was that, you know, this human feedback is not going to scale, right?
00;14;39;09 - 00;14;39;23
Kevin McCall
Indeed.
00;14;39;23 - 00;15;10;16
Kalyan Basu
Yeah. So so one branch of this, this next gen stuff is looking at AI generated preferences. So you're not actually giving human preferences. The RL, AI which is reinforcement learning from AI inputs. And you're using AI to create these preferences in the first place. Right. that's that's gained steam. there's other approaches using reinforcement learning. Raft is a fine tuning approach based on reinforcement learning and somewhat different from the EGF aspect.
00;15;10;16 - 00;15;14;24
Kalyan Basu
So you've got a whole lot of, stuff going on with reinforcement learning.
00;15;14;24 - 00;15;28;11
Kevin McCall
But the goal here is the same. It's really to tackle this expensive and important challenge of realigning the model with what we want and need the model to do from a personal, organizational, let's just say a value system preference.
00;15;28;11 - 00;15;44;16
Kalyan Basu
Yeah. And I think a realigning is a good word. But there's also complementary sets. It reaches guardrail. So you know you you know when you teach kids for instance, you're obviously wanting them to align, but you're also trying to put hard guardrails on things they shouldn't do that.
00;15;44;16 - 00;15;44;26
Kevin McCall
Good.
00;15;44;26 - 00;15;50;28
Kalyan Basu
Parents do that. You know, your bucket example is is an interesting one. Maybe they maybe they should have.
00;15;51;01 - 00;15;51;20
Kevin McCall
Didn't have the.
00;15;51;20 - 00;16;14;20
Kalyan Basu
Guardrails. They didn't have the gutter. Yeah. So you have to penalize the models. And this comes back to the reinforcement learning aspect. you have to penalize the models for actually going too close to the, to the guardrail. So, you know, you have this notion of safety which is built into the model. And here some people are starting to look at safety approaches from totally different domains, like, you know, physical safety.
00;16;14;22 - 00;16;24;28
Kalyan Basu
you know, control system safety. You know, these sure. Statistical dynamics. So, you know, we are bringing in those concepts of guardrail things. So that's the whole alignment bucket, okay.
00;16;24;28 - 00;16;48;27
Kevin McCall
And does this include, like Yann LeCun, famous example of, you know, the sample efficiency is just pitiful when it comes to, let's say, in autonomous vehicle learning in simulation, that it shouldn't drive too close to the cliff. It requires 20,000 samples just to realize that a it's fallen off the cliff, and then it needs another 20,000 samples to, to start to figure out how to avoid driving off the cliff.
00;16;48;27 - 00;16;54;19
Kevin McCall
We're talking about really much more sample efficient training that allows it to to not have to go through those 40,000 samples.
00;16;54;19 - 00;17;03;13
Kalyan Basu
That's a good example that actually is related to this issue, but it actually goes into the next topic that we going to discuss deeper, which is the world models topic.
00;17;03;16 - 00;17;04;00
Kevin McCall
Okay.
00;17;04;00 - 00;17;34;24
Kalyan Basu
But I'll tell you how you know, this to me is more in the grounding direction, because right now you're really trying to ground these models with phenomenal understanding or phenomenological understanding of the world. And it's constraints, right, that comes to, you know, maturity with the world models ideology. But in, in just at this level, in grounding levels, you're talking about things like, you know, what is the semantic, stakes that the whole model operates within, right?
00;17;34;27 - 00;18;02;01
Kalyan Basu
It has to do a lot with, you know, reducing hallucinations. and there are, you know, you know, I think we are still very early in that game, so I won't go too deep into it. Everyone's talking about it, but actually, solutions are really sparse there. we're having grounding through extraneous, you know, knowledge. For example, rag rag is a strategy that allows you to ground to data universes, particularly from the enterprise space.
00;18;02;02 - 00;18;03;17
Kalyan Basu
Yeah. You're you're pull.
00;18;03;17 - 00;18;11;10
Kevin McCall
The kind of factual look, exactly out of the hands of the large language model. So you have to do what it's good at because it's a separate retrieval model. Do when it's good at.
00;18;11;10 - 00;18;34;14
Kalyan Basu
Absolutely. Because it's going to be disastrous in enterprise scenarios. If the model is on ground to the reality of enterprises and enterprise data versus then you have grounding through knowledge graphs, you know, you need to have some understanding of the relationships and constraints of the basic objects and the entities in your world. so, you know, people are using knowledge graphs as a way to ground things.
00;18;34;21 - 00;18;37;06
Kevin McCall
For you, which is sometimes hard to find in the training data.
00;18;37;08 - 00;18;44;16
Kalyan Basu
Right? It's hard to find. It has to be extracted and has to be constructed. So this is a pre this is a pre phase right.
00;18;44;19 - 00;18;48;19
Kevin McCall
It's actually a concentrated set of data. Exactly. That allows more efficient learning.
00;18;48;19 - 00;19;03;27
Kalyan Basu
Exactly. So so people are examining those approaches. I think we are in for an exciting right here because grounding is a wide open topic. And that's really going to be the word models, revolution, a part of the world models revolution that will talk about some.
00;19;03;29 - 00;19;30;10
Kevin McCall
Okay. Yeah. So let's let's now pivot to world models. Now, I feel like based on the two things we've just talked about, we've we're more ready now to tackle what I feel like is the biggest and arguably most exciting, you know, thing going on right now, which is the work around world models. Now. It feels like a lot has changed in this space since you sent me that Schmidhuber paper a number of years ago.
00;19;30;10 - 00;19;33;08
Kevin McCall
It feels like eons. When was that 520?
00;19;33;08 - 00;19;33;17
Kalyan Basu
18?
00;19;33;18 - 00;19;37;15
Kevin McCall
2018? Somehow it feels much older than that. But a lot's.
00;19;37;15 - 00;19;38;29
Kalyan Basu
Changed since ancient.
00;19;39;01 - 00;19;55;29
Kevin McCall
It feels ancient, right? almost as ancient as the attention is all you need paper, but I digress. We'll come back to that too. Yeah. So what are the big rocks that we need to move as a community in order to achieve a real breakthrough in world models? When are we going to see the attention is all you need?
00;19;55;29 - 00;19;57;26
Kevin McCall
Moment in world models.
00;19;57;28 - 00;20;25;11
Kalyan Basu
So world models itself. And this is, in my humble opinion, this this is a paradigm shifting moment. and we have to talk a lot more about it, but I think it's a paradigm shift, because if you cut through the hype of AGI and you're talking about common sense, intelligence, which is arguably the hardest thing to to endow models with, you're really talking about world modeling.
00;20;25;14 - 00;20;45;21
Kalyan Basu
You're here talking about knowledge of the world, a lot of which is intrinsic. It's implicit. It's almost, I would I would venture and say genetically coded in some sense because you don't explicitly learn, you know, falling off a cliff 50 million times to know that, you know, Cliff is not something good that you should get to the edge of.
00;20;45;22 - 00;21;10;16
Kalyan Basu
Right? Right. So, so I think there's going to be a lot of focus on learning causality, on learning relationships between entities, physical and phenomenal entities in your world, in learning counterfactuals, I'm going to talk about this, you know, to some extent, your counterfactuals are things which arguably are not there in the world, but you need to have knowledge of them.
00;21;10;16 - 00;21;22;13
Kalyan Basu
Sure. Things which, you know, things which don't change, things where you say that if something had changed, what would have happened? It's a counterfactual in that sense.
00;21;22;15 - 00;21;44;18
Kevin McCall
Something that's really simple for humans to deduce and understand. But you don't see a lot of of this in the training data. Yeah, right. It's in this large common sense bucket. Yeah. Even with the trillions of tokens that are consumed by some of these models, a lot of this causality information is not explicitly represented in the training data.
00;21;44;25 - 00;21;45;11
Kevin McCall
Is that fair?
00;21;45;11 - 00;21;53;07
Kalyan Basu
That's fair. It's the duh moment for human beings, right? You don't don't even want to talk about it, but because it's simply said it's understandable. Yeah.
00;21;53;07 - 00;21;58;22
Kevin McCall
A nine month old understands object permanence and gravity and things like that.
00;21;58;25 - 00;22;00;17
Kalyan Basu
object permanence is a great example.
00;22;00;17 - 00;22;04;16
Kevin McCall
Yeah, but, you don't see a lot. Yeah. Expressed in the training data on.
00;22;04;16 - 00;22;28;07
Kalyan Basu
The exact, exact. And, you know, in the old AI world, you know, prior to the AI winter, we talk about things called the frame problem. This is a big thing in the old AI world where, you know, there's a whole frame of propositions, logical propositions which hold of the world, and they do not change, even though there's action and changes going on in in the narrower space of your reasoning.
00;22;28;10 - 00;22;50;18
Kalyan Basu
That's not easy for these models to understand. So, you know, you still have this these problems of, of the frame. And then, you know, finally physics, the knowledge of physics, like, you know, people have been critiquing the Sora models for video generation recently from OpenAI. and it's very clear that they don't really have a fine understanding of physical reality, you.
00;22;50;18 - 00;22;52;09
Kevin McCall
Know, interesting. Okay.
00;22;52;09 - 00;22;56;15
Kalyan Basu
Yeah. So it's this is, again, a part of the world model problem statement.
00;22;56;17 - 00;22;57;23
Kevin McCall
Right? Sure, sure.
00;22;57;23 - 00;23;20;17
Kalyan Basu
So this is all about the problem statement. But I think what we are going to have to see is that we need to look at semantic priors and semantic priors, say, you know, prior distribution or a prior set of conditions on any probability distribution is what actually holds before you start inferencing it with specific things. So okay, I said prior knowledge.
00;23;20;17 - 00;23;42;13
Kevin McCall
So if I yeah. So if I understand what you're saying, what you're suggesting is that it will be of critical importance to distill some of this information, the priors, the causality, the counterfactuals, and somehow distill this into information that can be easily consumed by these models to dramatically accelerate their learning. Am I on the right track?
00;23;42;13 - 00;23;46;22
Kalyan Basu
Not just accelerate ground learning in in common sense reason at.
00;23;46;22 - 00;23;47;03
Kevin McCall
The same.
00;23;47;03 - 00;24;00;07
Kalyan Basu
Time, at the same time, right? Yeah. So so that's the main point. I mean, the acceleration is almost a byproduct, which is great, but the real stuff is that you're having things which are grounded in the knowledge of the world, which all human beings have.
00;24;00;07 - 00;24;00;26
Kevin McCall
Right, right.
00;24;00;26 - 00;24;24;03
Kalyan Basu
Okay. So I think that's going to be very important as semantic priors is a big fuzzy thing. Part of it is biasing, you know, there's been a lot of work of inductive biases, which essentially is the structure of the domain itself in terms of relationships and constraints. Can that be reflected in the probability distribution, that useful inference. That's one of the things of semantic priors.
00;24;24;03 - 00;24;34;27
Kalyan Basu
But that's really important. I think we are seeing a fruition of this idea come to come to kind of a point with, you know, Yann, like Kuhn's ideas of jeopardy!
00;24;34;28 - 00;24;37;12
Kevin McCall
Yeah. You sent me that paper, too. Thank you. The Jep a paper.
00;24;37;12 - 00;25;08;10
Kalyan Basu
Yeah. And I think that's a very important philosophical stake here, because, you know, he's saying that if you're going to do reasoning over pixel space, he's obviously kind of exaggerating a little bit because all models, really good models work on the latent space. But he's looking at the latent space as a semantic latent space. And he says that if you, basically look at, you know, inferences in that rich semantic latent space, you get a lot more in terms of grounding and this, world model, richer understand?
00;25;08;11 - 00;25;09;09
Kalyan Basu
Richer understand.
00;25;09;09 - 00;25;24;08
Kevin McCall
Seriously. Oh, and so just, just to paraphrase this, instead of predicting the next token like a large language model does when it learns through masking, it really is about predicting what the next frame may look like in a video based on its understanding of prior frames.
00;25;24;09 - 00;25;34;25
Kalyan Basu
Well, it would go even one step further. So instead of frames, they would actually get semantic representations of frames and then predict semantic graph representations.
00;25;34;25 - 00;25;38;10
Kevin McCall
Yeah, because it was doing all the masking in the actual representation space.
00;25;38;10 - 00;25;48;26
Kalyan Basu
Yeah. So now you're really trying to say that what's the the loss so to speak. it at that semantic representation space between frames.
00;25;48;26 - 00;25;49;28
Kevin McCall
Right, right.
00;25;50;00 - 00;26;09;23
Kalyan Basu
But of course there's a deeper problem that is, is, is self-supervision everything that you need. And you know, I, I think that would be the next question that, you know, direction would tackle that. You know, you might actually need even more than that. Self-supervision doesn't have the priors and the bias things, you know, aspects that are talked about.
00;26;09;28 - 00;26;33;05
Kalyan Basu
So you may actually have enrichment. You need to have enrichment of that semantic space even before you compute the loss due to prediction, reconstruction or prediction. Right. Okay. So, yeah, a lot of this is coming to a head there. But bringing this all back, world modeling is going to be, in my opinion, a really big moment. proper world modeling.
00;26;33;05 - 00;26;36;11
Kalyan Basu
Right? For the next, advancement of these,
00;26;36;14 - 00;26;52;20
Kevin McCall
It feels would it be too dramatic to say that it feels like an existential requirement in order to achieve, like, the next decade or even maybe even what might end up being the next millennium? Breakthroughs in in, in in AI.
00;26;52;23 - 00;27;17;12
Kalyan Basu
I think you have the license to be dramatic here because, yeah, I mean, self-driving, for instance, impossible without word modeling. Impossible. I wouldn't even I wouldn't get into a car which doesn't have a world model. Sure. Yeah. you know, if you're looking at, physical manufacturing assembly lines, you're looking at physical moving things. I wouldn't even get in the way of, a warehouse robot which doesn't have a world model of some kind.
00;27;17;13 - 00;27;22;05
Kalyan Basu
Right, right. So I think it's in some sense deeply, deeply existential here.
00;27;22;07 - 00;27;36;17
Kevin McCall
Okay. Got it. So, beyond the jep of research in the paper, the papers now, I guess Yann LeCun published. What else might a caveman like me pay attention to in order to really stay in tune with the the progress in world models?
00;27;36;20 - 00;28;06;05
Kalyan Basu
I think there's good work going on in the nearer symbolic, you know, realm of thought. I mean, you know, the Austrian school particularly, it's talking about, you know, neural symbolic inferences overlaying, you know, model inference of the kind we know with, you know, logical inference or symbolic, constrain of, of inferences. So I think neuro symbolic is a very good, you know, a way to actually bring in all those constraints.
00;28;06;07 - 00;28;09;00
Kalyan Basu
yeah. And we'll see what work where this ends up in.
00;28;09;03 - 00;28;18;14
Kevin McCall
Okay. Very good. It's still, it feels to me. Correct me if I'm wrong, but it still feels to me like the world model moment is still a ways out. Is that fair?
00;28;18;16 - 00;28;32;18
Kalyan Basu
I think so, I think there's, I think it's in the alchemic stage, where, you know, there's a lot of good, thinking, but I don't think we've actually seen a really efficient a moment, like, you know, we've seen with the Transformers.
00;28;32;18 - 00;28;33;16
Kevin McCall
Right, right.
00;28;33;18 - 00;28;46;07
Kalyan Basu
I think, we're going to talk about this a little later, but I think with, state space modeling, we are getting a little bit deeper into that and coming to a point where we can say, this is probably the attention is all you need kind of moment.
00;28;46;07 - 00;28;47;05
Kevin McCall
That moment. Right? Yeah.
00;28;47;05 - 00;28;51;09
Kalyan Basu
Yeah, yeah. But it might be next year. It might be two years later. Who knows?
00;28;51;10 - 00;29;10;00
Kevin McCall
Okay. Yeah, definitely keep an eye on that. So then let's talk about a more near-term breakthrough that is distinct from the three topics that we just covered. What would you bring up as a as a as another exciting, another important area of development that's complementary with those topics we just covered?
00;29;10;03 - 00;29;31;10
Kalyan Basu
Yeah. I mean, you know, there's going to be overlaps of approaches as they evolve and emerge from the, you know, the existing ones due to new ones. But if you, if you, if you press me on that, I think a couple of interesting ideas, a mixture of experts and this is what, you know, the Mistral. right. Folks have actually brought into the world.
00;29;31;10 - 00;29;32;18
Kevin McCall
In the creation of the mixed role.
00;29;32;20 - 00;29;56;08
Kalyan Basu
Yeah, actually. And, you know, I think it's a really good moment because one thing we cannot ignore is that most of these models are highly inefficient in terms of power efficiency, in terms of, just the, you know, the the scalability of the graph itself, the neural graph. what I think Mistral or Mistral, advances do for us is it allows us to divide and conquer.
00;29;56;15 - 00;30;20;18
Kalyan Basu
It allows us to break, you know, the input space into conceptually distinct subspaces and let loose, you know, a more specialized models or even sometimes do ensembling of modeling. So I think the mixture of experts, movement will actually progress a lot, lot more, than we've seen. So I would take that as a, as a big shifter.
00;30;20;20 - 00;30;28;08
Kalyan Basu
I'm actually a big fan of, structured state space model, and this is in some sense regressing.
00;30;28;09 - 00;30;29;19
Kevin McCall
You sent me a paper on this, too.
00;30;29;19 - 00;30;32;04
Kalyan Basu
I did, and this may have been the mamba paper.
00;30;32;04 - 00;30;34;17
Kevin McCall
The mamba paper? Yes, that was it, I think.
00;30;34;17 - 00;31;07;01
Kalyan Basu
And it might appear to be regressing behind. attention is all you need. Paper, because it goes back to RNNs, but it actually enriches RNN. And with the notion of a state space, which is a richer construe than simply a latent space. you know, kind of, rolling up of latent space state. there's a lot here, but, I think that's going to be a big movement forward, particularly as we enrich this state space with the kind of semantic priors and things that we talked about.
00;31;07;04 - 00;31;28;06
Kalyan Basu
It also has dynamic and physical, I mean, some kind of causality built into the model itself. So, you know, this gives me hope that it's actually a great step forward and more into the word modeling direction that we've talked about. But that's definitely different. And, I think I would also mention this idea of continual learning and unlearning.
00;31;28;06 - 00;31;51;20
Kalyan Basu
And, you know, some people are just starting just kind of scratch the surface of that. And the idea is that as we kind of, you know, we look at the way that our biological minds learn, it's very different from these models because we are constantly overlaying layers of learning where the semantics of the words keep changing, but we never keep forgetting or resetting our learning.
00;31;51;20 - 00;32;13;23
Kalyan Basu
And right now, the most of the models do that. You know, they have catastrophic forgetting. And as you get more specialized into semantic domains, you're going to have advances in either unlearning what you have learned. And sometimes that's a positive thing, and sometimes it's a negative thing in the sense that you have to unlearn bad things and you have to relearn good things.
00;32;13;25 - 00;32;27;20
Kalyan Basu
but you also have to continuously learn and update your, your, your weights, so to speak, or your biases. And this is something that, you know, we have a very rigid difference between training and inference. Yeah, yeah.
00;32;27;22 - 00;32;39;04
Kevin McCall
And before you even get to inference, this seems to have obvious implications, though for the one or more fine tuning phases of these models. Is that is that fair?
00;32;39;04 - 00;32;44;24
Kalyan Basu
Actually, you kind of, you know, the the title of this section of our talk, I called the post Fine tuning.
00;32;44;24 - 00;32;46;03
Kevin McCall
The post fine tuning.
00;32;46;05 - 00;32;47;14
Kalyan Basu
It is postmodern.
00;32;47;14 - 00;32;48;24
Kevin McCall
Verify that for me.
00;32;48;24 - 00;33;03;07
Kalyan Basu
Yeah. So I think fine tuning itself as a distinct phase will rob me. Fine evolve and become a lot richer than what it is. It's going to have all these interesting, phases, multiple phases.
00;33;03;07 - 00;33;16;24
Kevin McCall
Right. So today's there. There's, there's this big conspicuous pre-training phase, expensive, time consuming trillions of tokens. And now there's a obviously much more narrow, much more surgical fine tuning phase, which essentially says.
00;33;16;24 - 00;33;17;04
Kalyan Basu
Yeah.
00;33;17;07 - 00;33;19;08
Kevin McCall
Right. Yeah. It's expensive for different reasons.
00;33;19;12 - 00;33;26;23
Kalyan Basu
Right now it's, you know, fine tuning is also a monolithic phase, and it's quite expensive for different reasons of being because.
00;33;26;23 - 00;33;30;24
Kevin McCall
Of the curated data set. Yeah, the supervised data sets, you normally need to do it. Well yeah.
00;33;30;24 - 00;33;42;28
Kalyan Basu
Yeah. I think this world of this second phase is going to just ram ify into richer and more calibrated and nuanced phases of continual training.
00;33;43;00 - 00;33;43;17
Kevin McCall
Okay.
00;33;43;17 - 00;34;02;09
Kalyan Basu
And you're going to see a gradual merging of training and inference happening in this post fine tuning world. Some of the things I talked about, including structured state space model, you know, and some of these things about continual learning and unlearning is something that you're going to see more off in this in this post fine tuning world.
00;34;02;10 - 00;34;23;20
Kevin McCall
Okay, so beyond this theme of more continuous learning, you've also used another word with me. You've used the visceral in the representation of some of the information in more subtle, more nuanced representation of information that is relevant here. Could you go into a little bit more detail on that for me?
00;34;23;22 - 00;34;50;16
Kalyan Basu
Yeah, I mean visceral, that word is dramatic. And for the right reasons, because when human beings look at particular scenes or video or even sometimes reading very evocative descriptions, they, they kind of extract the richness of the, of the scene that they're looking at or they're reading about in ways that are models, simply don't understand the art models cannot reach.
00;34;50;18 - 00;34;58;29
Kevin McCall
today. Yeah, there's kind of inherent limitations in how these models not just learn, but how they can comprehend certain things.
00;34;58;29 - 00;35;25;26
Kalyan Basu
Yeah. Yes. And one of the ways of trying to do this is through elicitation. Like if you, you know, you have these approaches where you ask the model to kind of reflect a retrospect on its own, introspect on its own inference process. You have things like chain of thought or tree of thought, and those approaches where you're trying to extract real semantic salience from just the words.
00;35;26;00 - 00;35;27;02
Kalyan Basu
And this is where.
00;35;27;02 - 00;35;30;01
Kevin McCall
Which is a kind of a fool's errand in a binary.
00;35;30;03 - 00;35;45;21
Kalyan Basu
Is in some sense, yes. And so you have to get beyond that. But I think this is a good starting point, because when you make the, the, the model reflect upon itself, it's does try to extract some semantic sense, but then you have to enrich it a lot.
00;35;45;23 - 00;36;11;17
Kevin McCall
Sure. So would it, would it be fair? I'll use a personal example and tell me if this is a fair illustration of some of the inherent limitations of the information we use today to train these models. So as a father, if you sent me an email message and said, Kevin, I understand you have a daughter, tell me about your daughter.
00;36;11;20 - 00;36;33;26
Kevin McCall
If I were to talk about my daughter in an email I send back to you, I feel as if that email would have to approach infinite length in order to adequately describe how much I adore her and some of the, you know, wonderful attributes that she has. I feel like I would need to keep adding to that email before I would feel comfortable sending it.
00;36;33;27 - 00;37;06;25
Kevin McCall
Now let's pivot this in a different way. If instead of sending you this infinitely long email, I were to send you a an email of moderate length, but I were to attach 32 pictures and 64 videos with audio of her riding her bike. You know, dancing, you know, doing other things. I, I think I would feel more comfortable sending you that email and feeling like those images and all the video and audio captured her essence and captured how I feel about my daughter.
00;37;06;25 - 00;37;12;02
Kevin McCall
Is that is that a fair representation of some of the things we're talking about?
00;37;12;02 - 00;37;36;17
Kalyan Basu
Yeah, yeah. I mean, I think it's a very good kind of example because this, this notion of visceral, it is really about semantic salience. And words tend to diffuse semantic salience unless they're very powerful words and, you know, some literatures etc., think they have this, this skill, but most of us don't. And for us, pictures and videos give a much more semantically salient.
00;37;36;17 - 00;37;38;18
Kevin McCall
Much richer represent.
00;37;38;21 - 00;38;04;06
Kalyan Basu
So I think what we are hitting at to this example is the compactness of encodings. we are very early looking at richer representation learning. This kind of brings us back to some of the ideas we've talked about before. And how do we understand what's the best representation to use in different contexts.
00;38;04;06 - 00;38;06;26
Kevin McCall
Certainly super relevant in this world model.
00;38;06;26 - 00;38;43;09
Kalyan Basu
Absolutely. It's super relevant. It's very it's very relevant in this post fine tuning work. because we have to, you know, encode these in certain ways which are semantically very relevant, very salient. And, you know, this is where things like the structured state space model and a more general idea of what is referred to in the theory is information bottleneck, because we have to ditch, get rid of redundant information, which all input formats have and really retain that information which is directly relevant in inference or, you know, predictions.
00;38;43;11 - 00;38;55;11
Kalyan Basu
And this is the world that I think what your what your example actually, is really pointing towards. But this is still, you know, we are very much early in this, in this whole process of learning this.
00;38;55;11 - 00;39;08;21
Kevin McCall
And that feels like a theme that, is common across all of these domains, across all of these topics, is that it still feels like we're very early in trying to crack the code on a lot of these. Yeah.
00;39;08;23 - 00;39;12;26
Kalyan Basu
And if we talk next year, I'm sure this will become attention to write.
00;39;12;28 - 00;39;33;13
Kevin McCall
Inevitably. Well, hopefully we talk again before next year. Maybe we revisit this at the end of this year. So always a pleasure to see you, Kalyan. so nice to see you again. Thank you so much for being here. I, I'm sure I certainly hope we continue this conversation moving forward.
00;39;33;17 - 00;39;36;16
Kalyan Basu
You're welcome. And likewise looking looking forward to that.