Valerii Babushkin - Mastering ML Careers & System Design

February 17, 2025

In this episode, Valerii Babushkin, Senior Director at BP and former Meta E7, shares insights on career transitions, ML system design, and navigating high-level IC and management roles. He discusses why he moved between IC and leadership positions, the challenges of onboarding at top tech companies, and how to identify what truly matters in a company. Valerii also dives into engineering communication, the importance of documentation, and why a strong writing culture is key for success. We also explore ML system design principles, including how to write an effective design doc, choosing the right loss functions and metrics, and avoiding common pitfalls like data leakage. Valerii shares his thoughts on Kaggle, when it’s useful for career growth, and what differentiates staff, senior staff, and principal engineers. If you’re aiming to level up in ML, this episode is packed with actionable career and technical advice! 🚀 🔗 Resources Mentioned 📖 Valerii’s Book: Machine Learning System Design https://amzn.to/4hGaX5l 🔗 Connect with Valerii on LinkedIn: https://www.linkedin.com/in/venheads 📄 Staff Engineer Book: https://amzn.to/4hCfq9l

Episode Transcript

Ilya (00:01.404)
Well, it is my pleasure to welcome Valery Babushin today on the podcast. Valery has been kind of everywhere. He's currently a senior director at BP. Previous to that, he worked at blockchain.com. He was a senior staff engineer over at Meta before my time there. He's worked for Alibaba. He's worked for Yandex, which to those who are not familiar with it is kind of Google plus.

Uber plus a whole bunch of other companies all at once and Yandex really gave even Google a run for its money in certain regions in the world early on. So Valeri has a ton of experience. He's been in leadership roles. He's led organizations of over hundreds of engineers before and recently for some reason he's written a book. Usually this is the point where I hold up a book. Unfortunately, I only have a digital copy of this right now.

I think it's out now, but yeah, it's a book about ML system design. We'll definitely talk about that in a few minutes too. But, Marilie, just want to give you a little moment to highlight, not necessarily the resume, but what were kind of the transition points, like when you went from one place or one job title to another, what kind of moved you to do that?

and how do you think about your career and your story so far?

Valerii (01:29.422)
Well, there were many, many transition points. First of all, if you take a look into my career, you can see that I moved from leading a team of around 140 people and being a senior director to working as an IC, as an engineer. So I moved from Russia to the UK to work in matter.

And then I moved back to management and blockchain, and now I'm managing even a larger team at British Petroleum and BP. So some of these transition points are being an IC, being a manager, being an IC, being a manager. So some of these points are, and they overlap with the previous moving from one country to another.

So for example, let's take this, I think the biggest transition, was when when I, when I were in Russia, I worked at X5 retail group and Alibaba. was VP of machine learning in Alibaba Russia, which is basically Aliexpress Russia, which is a big chunk of Aliexpress. By that time it was like around 50 % of 35 % of total GMV of Aliexpress.

So, and I moved to Facebook, so the stock difference, completely different roles, completely different companies. So the biggest drive of that was for me to try and compete with best people from the world. Or work together. I mean, not necessarily compete, I mean, it's always you can, let's say compare, let's say compare. It's not because...

Ilya (03:12.475)
Mm-hmm.

Valerii (03:14.158)
In competition, you also compare yourself, right? But it's not your main goal. You want to win. But you probably want to work with the best people and understand where you are compared to them. So when you work in any country, in local business, it might be the best business in this region, in this country. But it's still a part of the bigger system. So like if you work in Russia,

Like what 2 % of the global population are from Russia and like that so that that means like the pool of candidates and full of people is just 2%. Yes, Russia has a strong engineering culture. So it's probably if you apply some weights, it would be bigger than 2 % but still you work within the local port. So I wanted to work and to see how things are done in a global player.

And there aren't that many companies in the world which you can truly say are that global. Actually, as far as I know, and to that moment, and I think it was the same five years ago, there are only two companies in the world which you can say have billions of customers or users. It's Google and by that time, Facebook, now it's called Meta.

Ilya (04:34.534)
Mm-hmm.

Valerii (04:35.138)
I don't think anyone else like you have WeChat or whatever, but it's like a billion and something. So it's not billions. And it's still, it's kind of, very big, but it's not truly global. If think about that, because it's work within boundaries of a single country or region. So I wanted to see how, how can I fare and where I am if I compare myself to people pulled from, from anywhere, from everywhere.

Ilya (04:42.193)
Hmm.

Ilya (04:47.324)
Mm-hmm.

Valerii (05:04.046)
And also people who passed a specific bar and work on something with a high impact and also something that people are actually using. So that was just basically a mixture of vanity, curiosity, and drive for adventure and some luck. I think this actually, this is the one combination which helps to try different things.

and not to be complacent, not to become complacent because probably the worst thing that might happen to the career if you want to advance it is becoming complacent.

Ilya (05:49.276)
Yeah, no, I totally agree, especially in machine learning, right? Like I often talk to people about this and I say, if you get complacent for too long, like the new grads are going to know more about the techniques we actually use than you are. yeah, it's definitely, it's an interesting transition because oftentimes when people talk to me about, especially like above E6, above that staff level, they're like, so I basically have to choose.

one side or another, I have to be a manager or I have to be, you know, on the technical side. You've jumped back and forth, not once, but a couple of times. And I want to get your perspective on that on the career. Like, are you really that locked in? And if you're making those jumps, like, are you sacrificing things in between? Right? Like, do you think if you stuck to like the management ladder, you could have been further or does it even matter how far you are? Like you, talked about comparisons, you know.

Valerii (06:45.23)
No, you're you're stuck and you have to sacrifice, but you can pick what you sacrifice. You can always pick both engineering and management, but then you need to sacrifice time. You can technically work like 12, 16, 14 hours a day. Right? So, and you can always, I think to be honest, it's easier to...

easier to come back to management from engineering, then do the reverse for the simple reason, even as an engineer of the high level, especially of the high level, you still do a lot of stuff that managers do. You speak with people, you try to solve conflicts sometimes, and you're working with stakeholders, you're trying to find the projects, you are to some extent managing the team, like you were in Meta, you know, if you're...

E6 Plus, you're of not a manager, but you have a team which you are managing like a tech lead. So you go to calibration, it's like a mashup. So it's like very, very related. Now, if you're manager, you do all these, you do this more, you also work like a therapist for your team, you do. actually as a tech lead, you also work as a therapist for your team. But you focus on that, right? So that's why I think it's harder to go back to engineering.

Ilya (07:48.378)
I went to calibration, so.

Valerii (08:13.486)
from management and it's relatively easier to go back to management from engineering at high position because you kind of nearby work with these people directly. Now, if you still want to be...

in shape and have this opportunity to go back to the engineering, you need to sacrifice something. That means you need to have time, so you need to sacrifice other activities, which can be like whatever, going out, watching movies, playing games, spending time with your loved ones. There is no secret sauce here. Well, of course, you can be more efficient to an extent. You probably can't be hundred times more efficient.

and do in a minute what you need to do, what will take 100 minutes in terms of remembering something and applying your knowledge. You need to apply it to stay in shape. You need to work on some stuff. So it still requires time. So that's the answer. You can do both. You need to sacrifice something. It's usually easier to go back from engineering to management for the reasons I just mentioned. But then you need to do... Go on. No, go on.

Ilya (09:20.444)
So let's talk about, sorry, go ahead. So let's talk about the harder transition that you've made, right? Especially when you went into Meta, because Meta E7, like to people who don't understand, like that's an insane level to onboard at, because an E7 needs to like basically draw up plans for the next few years, needs to set up an entire area like you did. And you're coming into it with the best people in the world. Everybody has context you don't have yet.

So like in addition to like changing job functions, right? Like you have to learn all the things that are uniquely meta or uniquely Facebook back when you were there and and like that alone is really hard and and so like what tips would you give somebody transitioning from management into? into back into IC Given that transition that you've got

Valerii (10:13.144)
Don't go to matter. I mean, what you mentioned is actually very relative to matter first, right? Because matter is very specific place. First, you don't have job titles in matter. There are many things which some people can do just because of the social capital they built over the years. They know people, they know staff, they know this, they know that. And so this is this informal capital which you as a new joiner at the high level, you don't have it. Well, the thing with matter that it doesn't give you enough time

Ilya (10:15.099)
Okay.

Valerii (10:42.574)
to ramp up. Like usually in many other places, they're standing like, like at this level, especially if you don't have any official leverage. You don't have a title. You can't tell people what to do. Like if you're an intern at Meta and you're a software engineer, your title is a software engineer. If you're like a principal engineer, are a software engineer, so you have the same titles. Over the time, people probably realize who is who. But again, you don't have this time. You just join who you are.

So you don't know the system, don't know the people, that's hard. So Meta is not a very good place for a new join us of six plus level. I remember I was a member of the six plus level group and the people who went Meta for many, many years were admitting like we aren't very good in hiring six plus people outside, not because we're hiring bad people, but because like our internal process built like that. You don't have the social capital you need that you built over the years.

And I don't know if that's true or not, but I've heard from the friend of mine who is in MATA that the attrition rate within 18 months after joining at 7 plus or at 7 is 72%. So just to, well, maybe the sample size is not that big. don't know how many people MATA hiring this level, probably not many, but that's exclusively MATA's problem.

Like you don't have leverage, you don't have official title, you don't have time to ramp up, everything. that inside is also something like a social network. It's obviously social capital. One advantage or one good thing when I joined was that it was a completely new team, user data privacy. And as a new field, it's obviously easier. So it's a new manager, a new team, and new team members.

a green field and it's usually easier to do things in the green field. So that's, that's, that's what actually helped me. Uh, but, uh, and again, uh, you'd have to work probably 14 hours a day for a first, uh, I don't know, a couple of months or maybe a year. So you have to, to put some work, uh, if you can't compensate it with your social capital. mean, it's another thing if you're somebody like John Cormer.

Ilya (12:58.149)
Yeah.

Valerii (13:06.062)
or a Nobel Prize win and you join. So you have this external credibility already. Who is no longer there and who, as I understand, because the company has been bought Oculus. But the vast majority of people like 99.69 don't have this external credibility. Maybe if you are the creator of Python, you join the

Ilya (13:11.43)
Who's no longer there, by the way.

Valerii (13:35.082)
Meta, yes, fine. So probably people would realize that you have this capital. So it's hard. have to be lucky enough either with a project or maybe with team members, maybe with manager. And plus, you have to put an extra work. But again, numbers like saying everything for themselves. 72 % of people are living meta who join at higher levels within 18 months.

Ilya (13:58.246)
Yeah, no, and that's been my experience too, you know, and for me it was that I wasn't willing to sacrifice that much time with the family after a while. you know, the compensation was amazing. I probably will never make that much money again, but it's like you said, it's all a trade off. It's all knowing what you're trading off for what. I would love to talk to you a little bit more about onboarding at higher levels. So you've done this several times.

And so how do you, so like you started a new company, you've got an offer, congratulations, you know, this is what everybody cares about, right? And then the reality hits that like now I actually have to do the job and I have to do the job at a very high level. How do you approach that?

Valerii (14:43.842)
Well, the general approach I have is that you need first to gather, once you join, you need to spend first four to six to eight weeks just speaking with people and looking at the code and documents and listening to gather context, meaning that we need to gather as much context as possible. So you need to speak with the tech people, you need to speak with the...

product managers, need to speak with business. If you have business, need to read as much as you can. And the most important thing is to understand what is important. It might sound a bit strange, but the vast majority of people at many different levels have no idea what is important. And that was especially evident in MATA because in MATA you have performance review over six months and people are terrified.

Because performance review can really affect your compensation and you're staying at matter like look two days ago The couple thousands of people will let go over fire that matter because of performance review results So it's like a terrifying even more terrifying now and why it's terrifying. It's because the vast majority of people don't understand If the work they've done is important work or not is it valuable or not? Is it impactful or not? That's why they're trying to do as much stuff as possible

But to some extent, like sieved gold through the water, you're trying to do a huge volume of work in the hope that some of this work is valuable. This is ridiculous. So this is like not the way of samurai. And the way of samurai is to find a point, to find a proper fulcrum where you can apply your force. That means you need to understand what is most important. And there is some caveats because

What is most important for whom? For company, for your manager, for your skip manager. Well, let's give you this. Those are the same things, right? They are not necessarily the same thing. is a, so, but still what is important in the short term? How feasible it is to do given your current limitations that, okay, if you manage it, but you don't have...

Valerii (17:04.43)
much of the trust, you need to build this trust, you need to build this relationship, you need to get a context, and you need understand what is important and how doable it is. It might be super important, but not doable within the specific like, it will take three years to fix. Probably that's not the best way to apply your resources. You need to find something which is relatively easy to do, which is relatively important and can help you to build this reputation, social capital, and you can go from there. So in a sense, you're trying to...

Ilya (17:19.825)
Yeah.

Valerii (17:34.218)
to optimize for feasibility and impact. That's actually, think about that, that's a nature of prioritization. You have like a limited set of resources, unlimited set of desires, and you want to maximize some output. So if you know what output is, what is important, and then you can, you already get at the context, so that's...

What is important? Then you need to be very transparent. So everybody needs to understand what are you working on and why. Well, first, it will help you to understand if it's really important or not. Second, it will help them to understand what you're working on and will prevent many potentially perilous situation to you because like what this new guy is working on, like we have no idea. I really like the reason.

transparency and people know the current projects, why, etc. And it's clear. And then it's also easier to explain to people, look, you always have some requests from many different people, like, can you do this, can you do that? It also helps you to find them off. Okay, like, look, we're working on that because of this. Like, then we can review your request. That's a general framework.

Ilya (18:50.161)
Yeah.

I'm gonna have to press on that a little bit. you said, you know, try to understand what's important by talking to people. But then you said, most people don't understand what's important, right? So how do you sift that through? Because like, especially when you start, you don't have the context. So I get that you're trying to build it as quickly as possible, but building it while trying to prioritize, like what kinds of things do you look for to say, this is gonna be feasible?

Valerii (19:10.734)
you

Valerii (19:14.722)
Well, okay. I see what you're saying. So first, most of the engineers don't understand what is important or not. Because business people, are closer to the business. They kind of understand, okay, why, like if you ask many people, why are you working on that? They probably wouldn't manage to tell you why. Like maybe my manager told me to do that. Like this is in our road.

So there is probably some end goal of your activity, right? So usually if you speak with like a business stakeholders, they have a pretty decent set of metrics. So you kind of can understand what's important for a company. Usually let's say the most important thing, it might be growth, you use a day, or it might be revenue, it might be profit, it might be cut the costs. So those are pretty easy to understand metrics in terms of the importance, right? So.

If you can see that we're spending $20 million of infrastructure, which you know can be squeezed down to two, that's easy to explain why it is important. But now if you have an engineer working or data scientist working on some kind of EDA for a completely different thing, or data engineer working completely different things, like those things can be disjointed.

So you, you, but what I'm saying about the performance reviews, it's usually for, engineers who like very terrified, they don't understand how important these, cetera. Cause if you like, imagine you're ahead of Instagram, you probably understand what is important, right? So like no problem. But if you are a senior engineer in Instagram, you might work on some stuff, which is like might even be important, but nobody understands why.

So you can't explain, so that's a bit different. as a senior leader, you're trying to understand from the business or product manager or whatever you have, what is like. And of course, again, if you work in a customer support and you're a representative customer support, you would say that the most important thing right now is to, I don't know, reduce the time from ticket being opened to ticket being successfully resolved or having an NPS or whatever. If you go to the product manager for a specific...

Valerii (21:33.406)
field, so that this program manager would say, right, now the most important thing is to increase our monthly active users database or like increase retention, increase churn. You go to CA4 or like whatever CA4 will tell you probably that, the biggest issue right now is that we're spending X million dollars on cloud.

Right. So, then you try to understand what is feasible, what is not. So of course, some things are very easy to... Everybody understands money, right? So actually, like we already listed three goals here. Like if you go and tell anyone in the company that you managed to reduce the infrastructure caused by $10 million, everybody would understand, well, $10 million is like kind of very easy to appreciate.

If you reduce the average time for TKTB Resolve from two hours to 90 minutes, so what? I mean, the customers were already happy. You can then try to derive something from that. Okay, probably monthly active user base is somewhere in between, right? Because everybody understands that. because you are not sure that there is actually any real value in going from 120 minutes to 90 minutes.

Ilya (22:41.276)
Mm-hmm.

Valerii (22:49.976)
But you kind of understand what's the value of going from 2 million active users per month to 3 active million users per month until those are bots.

Ilya (22:55.718)
Mm-hmm.

Yeah, a little detail.

Valerii (23:00.206)
But again, there is an assumption that this extra million is also good users. will do whatever we need them to do. We'll use our app, we'll generate some revenue, or it's just a good number. But they also will create some load of our infrastructure. And we will actually increase the cost of infrastructure. So going back to CFO who wants to decrease cost of infrastructure.

Ilya (23:20.476)
Yeah.

Yeah, no, that's terrific. And in fact, that reminds me of working with, you know, multiple cross-functional partners across various things where everybody has their own metric that they're trying to push. And as a technical lead, it's your job to kind of make sure that all of those are aligned and move in the same direction, sort of. And yeah, so I think one thing that you're bringing up here is the importance of communication. And what are kind of some things that you've seen that kind of make the

communication, especially from engineering, better? Because I know how to make it worse. But yeah, how do you make sure that that communication flows well?

Valerii (24:04.696)
think there are so many times people assume something implicitly instead of being explicit. So I think one of the biggest flaws of engineers and people overall is not being explicit. And what's the best way to be explicit is to put something on paper or in written form. Why? Because like, even if four people will meet and we'll discuss something and then someone will formulate something in like three lines and everybody would say yes.

months later or two weeks later they'll start to diverge. there are two very important things here. To be explicit and make as few assumptions as possible when you deal with people. And this assumption might be anything, oh I thought you will do this, oh I thought you will do this, oh now that's what was important, no I thought you realized that that's not actually not that important. Like many different things.

to be very explicit. And for many people in tech, I found out that it's very hard to be explicit. It's very hard to say, okay, you will do A, I'll do B, you'll do C. So it's kind of, you don't want to be pushy, want to be nice and friendly, et cetera. And if you be nice, some people think that if you're very explicit, you're not nice.

And now, okay, let's say first, we agreed that it's very important to be explicit and build as few assumptions as possible because like assumptions have a tendency to be wrong and you build something wrong assumptions. It goes further and further from point of optimum or what you wanted to build. Well, the second thing is that you want to be explicit in the written.

or you want to put that in written form, whatever it is, like email, document, chat message, something you can then reuse and refer to. Well, partly that's why I wrote a book about machine learning system design and how important it is to create a design document which you can later use and refer to and update and solicit feedback and provide feedback and reread. Because even, you can even to some extent diverge with your previous self. Like any person who's...

Valerii (26:18.414)
who wrote code in their lives can remember probably situation when you literally don't understand why did you write this code three months ago and you spend time trying to understand why did you do that. And usually it's easier to do if your code is well documented, well formatted, you have a proper variables names, et cetera. And ideally if you have some decent set of documentation where you describe why specific things been done in a specific ways.

So that's one of the biggest, biggest problem you have. And it's not actually about engineers. The same is happening like in the leadership level, right? So people met, they will discuss something, everybody is happy. And then it turned out that everybody realized something differently. Even if you put it on paper, sometimes people tend to understand it differently, but at least you can always refer. So even when you write something, you probably need to be very explicit. Like, we need to do A. Okay, cool, we'll doing.

Ilya (27:03.003)
Yeah.

Valerii (27:18.318)
Who will be working on it? So, okay, you had a meeting. There are different types of meetings, of course. I usually try to bucket them in three categories. The first one is just to socialize. Second one is to find some information. But the majority of meetings need to have an action points at the end.

who is doing what and by when. And not that many meetings actually have an action point list at the end, like follow-ups. that's a problem. So like, what was the purpose of this meeting? Not to mention that many meetings are pointless in a way that they could easily have been replaced by messaging. Again, when you're messaging, when you send a written message. And for many people, it's actually harder to send a message because you need to formulate.

Ilya (28:07.793)
Mm-hmm.

Valerii (28:16.728)
You need to be concise, right? So instead of providing the verbal, I'm trying to find a proper word instead of, nobody's a correct word. So instead of providing the flurry of words, instead of providing the word salad, which people tend to do when they speak to each other, when you write, you have to be concise. I think it was Churchill who once said, sorry, I don't have much time. That's why my letter is long.

Ilya (28:34.79)
Mm-hmm.

Valerii (28:46.38)
because you probably can formulate the same in a more concise way, but once you write, it naturally forces you to be more concise, more precise, et cetera. So I think one of the big advantages is having a proper writing culture. And it's important, it's hard to overestimate.

Ilya (29:09.178)
Yeah, no, that's terrific. And in fact, you bring me right to the point that I wanted to come to out of this. think the thing that stuck out to me in the book that is quite unique, I don't see it in other ML system design books. I don't even see it in like DDIA, right? Like in things that we use for design elsewhere is the focus on the design doc. And truly like as somebody who's designed really, really complex systems that required, like when I was at Adobe, one of the projects that I was on,

that I was leading from the beginning had like 80 people in the end working on it, 80 engineers. Like that's an insane number of people to manage, especially across multiple time zones, multiple regions. And that's when I learned the importance of documentation, but I kind of learned that through trial and error, right? And like a lot of things that went wrong. And so assuming that most of the audience has not read the book yet, they will go out and buy the machine learning system design book right after this. Can you give us a little bit of an understanding of

What makes a good doc? What makes a good design doc, of all? Yeah.

Valerii (30:14.254)
That's a hard question. mean, first, the show, you see that good design doc has two competitive properties. First, it needs to be as short as possible. Second, it needs to be as comprehensive as possible.

And third, a design doc is a living thing. has to be updated frequently. Because no matter, like as soon as you design something, put it in a paper, you solicited feedback from people. They provided feedback. You probably had, you have to rewrite 30 % of your document or 20 or 60. But once you start to implement, you also would have to either provide, I don't know, code pointers, like for example, you have a design doc and machine learning system.

you have a chapter where you are describing why did you pick specific loss function and specific metric. Maybe once you implement it, you have to provide code pointers to this implementation, or you probably realize that it's not the best for a specific reason, you have to update it. So it has to be concise, easy to read, comprehensive, and...

relevant or actual whatever word you want to use. So has to represent the reality as it is now, not the reality it was a year ago, two years ago. I mean, even that it's not bad because like it still gives you some ideas, but all these things. then the good design document machine learning system design, think, so the book is structured in, it has 16,

chapters, every four chapters, they are grouped by like, whatever you call that, that mega chapter, whatever. and starting from a chapter number five, actually, every chapter is like a section you would expect to have a proper machine learning system design document, like starting from what's the problem we're trying to solve, why, what's like, what's the possible gain if we solve this problem.

Valerii (32:20.066)
Why can't we buy a solution? Why do we need to build? Well, because ideally, one of the metrics of the good design document is how many projects you have in start after you outline the design document. Because when you realize there is no problem to solve, or it's too expensive to solve the problem, or you don't need to build anything, you can buy. And if you think it's much better to spend two weeks writing a document and canceling the project than spending six months

on building the project and then canceling. So like, literally, one of the success metrics of a proper design document is how many projects never started because we're outlining design document and realize that there is no sense in doing that. Then so there is like when this is the problem, why we're trying to solve it was a possible gain and why can't we buy it or why don't we want to buy it? Then you can say, okay, here's how we plan to solve it. And we start this machine learning route, okay, loss function metrics.

Ilya (32:59.132)
you

Valerii (33:16.462)
there are like many different metrics you can list offline, online, proxy metrics, like the actual metric you want to have. because for example, let's take for example, a building a spam classifier. Okay, for a spam classifier, you obviously need to have a loss function. You need to have some kind of offline metrics. It might be recall at a given specificity, or if you're doing it not properly, it might be precision or might be recall to give it precision.

But actually it's offline metrics you can calculate on like on the past history Then you launched your your modern production have an A-B test your online metric might be Number of spam complaints or number of spammer banned or number of spammer appealing to the ban Which is again not your end goal because your end goal is that users are not being

bored or harassed by spam, right? So if you receive, let's say if you receive one spam message, you receive one spam message in a month, and now it's only once per two months, it's not probably big difference for you. You could say that you improved all the metrics by like 100%. But in a sense, and you can prove it, you have an A-B test, you run it, you improved, but so what? Like, did you actually achieve anything? Like you probably spent a lot of time, but so what?

So you see there is a difference between starting with a loss function, then offline metric, online metric. But even this is a proxy metric to the actual thing you want to achieve, because it's not even reducing the spam, probably. Because you can always allocate this time and engineers you have to solving something more important, coming back to understand what is more important and what is the end goal. Because if you review your end goals like, I need to maximize my.

recall the given specificity. Okay, but it's a proxy metric of a proxy metric. So that's why, it probably needs to be in a good design doc. You have this, it helps you to understand why you're working on that. So what you're trying to solve. You're not trying to solve the binary classification problem. The binary classification problem you're trying to solve is just an approximation of a real problem of the real thing you're trying to achieve. So a design document.

Valerii (35:34.862)
connects all these things together and helps you to build systems step by step. Okay, I know that's why I this loss function, why I picked this metric, why I picked these online metrics. This is what my data set looks like. This is how I will collect it. Those are important data checks. This is how I would have my validation scheme and how I will validate the results. This is how I will build my first baseline because you probably don't want to spend time on the most sophisticated solution from a first.

iteration or if you already have some system this existing system is your baseline so as soon as you build something and shipped it and it works it's your new baseline but if you're building from scratch you probably would like to compare okay how better is my super sophisticated system compared to the baseline that let's say everything is spumped or nothing is spumped like what's my my margin because like you might say have an accuracy 99 let's say whatever metric like 99 percent and then 99.9

Ilya (36:14.332)
Yeah.

Valerii (36:33.486)
But it's not just 0.9%. It's actually 10 times improvement. Because instead of having 1%, let's 1 % of a billion is a million. You now would have 10 times less, which would be 100,000. So you reduce something from million to 100,000, so 10 times. So that's why it's very important to compare, just, oh, 99 looks good. 99.9 looks better. You need to understand.

Ilya (36:37.606)
Right.

Valerii (37:02.936)
what you can pay with where you started from.

Ilya (37:06.786)
It's interesting because I see, especially machine learning engineers get into the pattern of like, and I fall into that as well, know, numbers, numbers are improving. I'm happy, right? But like, I like the concept of stepping away from this and being like, but in reality, even if like you say, this spam classifier, right? Like a hundred percent improvement. Okay. But like, does that do anything? I think we don't ask ourselves that question enough. And I see this in careers too, because like,

Theoretically after senior engineer you could stay at any level like you there's no

Valerii (37:38.882)
Well, supposed to stay at your level.

Ilya (37:42.426)
Yeah, and so, but people are like, no, I'm gonna try for the next level. And I'm like, do you understand that the job is different? Like, it's okay. Like if that's what you wanna do, great, go for it, right? But like the job is very, like senior staff engineers and principal engineers do very different things. So yeah, no, that's terrific. So let me take that a step beyond the design doc then.

So the next step after you've created the design doc and you're ready to implement, but before you implement, you probably want to get it reviewed. And so now look at it from the perspective of somebody coming from the outside, right? you know, a senior director who maybe wasn't in on like the fine detail of it. When you review a design doc, what do you look for and what feedback can you provide that will be more useful to the team?

Valerii (38:34.766)
So first of all, design document is constructed in a that the first couple of chapters are very relevant to everybody. Because like you're a senior director, no matter of anything, you go and you read, OK, what's the problem? Why is it a problem? What's the potential value we solve this problem? And they will solve this problem with that. And the second chapter, can we buy it? Or do we?

Can we build it and what do we need to build it? like it's actually, if you think about that, it's interesting for everybody, right? So, so you created the document, actually you start from creating the first two chapters, which everybody can review and you try to invite as many people as possible who can provide you valuable feedback. because first of all, someone will pay for the team. Yeah, probably many engineers don't realize that, but it's always someone within the company who is paying.

Ilya (39:30.652)
Mm-hmm.

Valerii (39:30.798)
for their work. There is some internal accounting or whatever, but it's always like at money. Unfortunately, I'm not just flowing from helicopters, though there was a person in the Federal Reserve who proposed this idea and never implemented it, unfortunately, or fortunately. And then the second thing, like you probably can't have a single person who is a pro in all stages, like starting from...

Ilya (39:43.26)
You

Valerii (39:56.302)
picking a proper loss function and metrics, then creating proper ETL pipelines and data checks, and how to build a validator, and how to integrate that, et cetera, et cetera, et cetera. So you try to involve as many people who are specialists in the field, sometimes by intentionally tagging them, or inviting them, or listing questions you don't have answers for, and inviting everybody to review so people can pick. Literally in my book, I have a chapter which tells you what's

I think is a good way to write the doc. It has to have a clear structure so people can go into, or instead of like, there is five pages or 10 pages, and it's hard to read five pages or 10 pages, there is like a clear structure. Okay, I'm interested only in this, or I'm interested to see what's the loss function, or I'm interested to see why do we do that. Okay, then you can ask some questions. That's why it's very, very good that you have like something like Google Doc. My preferred way is to have a Quip. I really like Quip. My time in Meta and Facebook, we're using Quip, but I know that it's been...

Ilya (40:53.936)
Discontinued, yeah.

Valerii (40:55.158)
Yes, put on the shelf, but it's been put on the shelf. Many people were still using it for many years because it's so good. And because you can have a version control. Of course, it's not as good as a GitHub or something like that, but still it's something. You can see comments, you can ask them, you can see the previous comments. But your goal is to make something which is very structured so everybody can go there and select a subsection which they can read which way they are interested in or they think they...

Ilya (40:59.568)
Yeah.

Valerii (41:24.782)
are good in the specific area. the second thing you need to be, the worst thing you can do is to be defensive. Your goal is actually to solicit as much feedback as possible. And the worst feedback you can receive is, looks good to me, or it's OK. Because is it actually OK? Is it actually good? Or just someone said so you won't bother?

this person anymore. That's the worst type of feedback you can receive.

Ilya (41:58.246)
So as a reviewer, what's the feedback that you want to give them?

Valerii (42:04.294)
Well, depends. So for example, I personally like to review three sections. That's value case. Like, how did we come up with this number? Can you show me some calculations? Who told you that? Especially if it's a big number. Because it's important, right? If it's it's wrong. Everything is wrong. My next favorite sections is lost function and metrics because people are usually very bad at picking metric and lost function.

And this is like super important, right? Because if you are improving the wrong thing...

Ilya (42:37.148)
And you're

Valerii (42:37.934)
Yeah, so, and most of the time people just put the metrics, they know, oh, it's a regression task. Okay, I know that there are two loss function. let's, so why? I mean, you can literally show that picking loss function can drastically change the output of your model, everything else being fixed. And the same with metric, right? Because like a metric is actually it's a proxy metric, so a flying metric. And the third thing I like is to check is,

validation because it's very connected to the metric. You can have the proper set of metrics and proper set of loss function and whatever, but you receive the false results because you have some data leaks or whatever. It's like a wrong schema. So if you think about that, all these three things are very much connected to each other. So what's the value case, how we measure that we are close to this value, and what will we use to measure it?

So those are my three favorite things. And then I can take a look into what actually kind of solution we're trying to build. Is it some specific model? Like why it's so complicated? Or why don't we try something like this? But that's more optional. So those are my favorite chapters to review.

Ilya (43:50.588)
That's terrific.

Ilya (43:54.606)
Yeah. And what kind of comments do you usually have? Like this number is wrong because of XYZ.

Valerii (43:58.702)
Usually I ask, OK, you picked this metric and this loss. Why? I mean, it's a very frustrating question, like why? But you can't answer me why. And what have you also reviewed as other alternatives? And why have you actually had any thought process behind that? That's like, why this loss function? Why this metric? And who told you that this is a value case? Like, can I see the calculation behind that?

With validation, it's easy. can just show it's wrong or I think it's correct. So I kind of trying to... So the bad thing you can do for validation is trying to understand how the system will be applied. And your validation has to be as close to the application, the real one, as possible. Plus it needs to be a schema, not just something done once, because then you can obviously overfit for a specific validation, but it's kind of the way it's done. So here I usually provide some either, it's good or like this is...

This is not good, and that's why it's not good. With loss function and metrics, I either ask why, or if I see that it's wrong, just provide, this is wrong because of that. Think about what do need to do.

Ilya (45:07.77)
Yeah, no, that's terrific. think I have somewhat of a similar opinion about asking why. And people are like, that's, you know, I got him. I figured it out. He's going to ask why, so I'm just going to put it in the document. And then I asked the why of the why. And so it doesn't really matter how many levels you're going to go. I'm going to ask the next level to make sure that you've considered that. And so people sometimes don't like me for that. But I'm like, hey.

Don't like me on the design doc review. You'll like me later. You pay for it now or you pay for it later. It's got to come out somewhere. I do want to switch gears a little bit. You mentioned proper validation and leakage. And that actually brings us very nicely to your past as a Kaggle Grandmaster. Would love for you to talk a little bit about that. What got you into that in the first place? And is it really valuable for people to do Kaggle type comp?

politicians.

Valerii (46:05.806)
Okay, so let's start from the final part of question. I think it's real valuable for people to do Kaggle competitions. I don't think it's really valuable for people to become Kaggle competition grandmasters. The number of skills can be learned through Kaggle competitions, like building a proper validation. And Kaggle is very good in teaching that, trying different stuff and trying to squeeze as much as possible because again,

Ilya (46:16.092)
Hahaha

Valerii (46:35.502)
Your baseline is other people, and they're constantly improving. So you have to squeeze, and you have to really think hard about everything, feature engineering, model and semblance, how you can extract an additional signal, sometimes about proper metrics, maybe loss function, et cetera, et cetera, cetera. Now, why did I do that? Well, because many people in my environment were doing Kaggle, and I also thought

very addictive to competitions and I thought it is something worth doing and I started to compete and I'm very competitive and I decided okay I need to to get this highest rank and as soon as I got it I thought okay now I need to go from top 20 or top 30 to to top 10 and I thought no no no no that's I need to break up with this addiction so Kaggle it like there is this cohort of people which are really competitive and

easily addicted to things like competition. So that's how.

Ilya (47:37.658)
So when you see an engineer who comes to you applies for a job with you, who, you know, maybe not a grandmaster, but who has some Kaggle competitions to show, what do you look for? Are you looking necessarily for like a certain place? Are you?

Valerii (47:52.686)
Look, so it really depends. So first, if it's a Kaggle Grandmaster, you don't have that many Grandmasters in the world. Like out of two or three million people competing in Kaggle, over the course of what, 17 years, you have less than 300 competition Grandmasters. So it's really something like it shows you determination. This person can work long hours for free. Good quality.

Ilya (48:19.836)
Yeah.

Valerii (48:20.59)
Right. I work on the problem. it's a good set of, not even skills, right? Because being able to work for a long time on a specific problem and not giving up, it's not a skill. It's a feature of your character. So it shows you that. Plus it definitely shows you something. So if it's, if it's, you're not a Kaggle competition Grandmaster, even if you're a master, it's, it doesn't tell you much.

until that's a master who won a gold medal solo and like, in something very relative to what you're hiring this person for. So for experienced people, it doesn't matter much, because like you already have an experience, like if you worked at Meta for X years and you're coming, so probably more relative than your silver medal at Kaggle. Now, if you're in the beginning of your career and you're like intern or junior, that's something which can help you to stand out and something

to discuss your interview. So like when you're scanning the CV, you're looking for something so you can have a discussion, something which can distinct you from others. So for early careers, it's a good example for more experienced people until it's something like top 1 % of top 1 % of top 1 % then it's like, doesn't matter much.

Ilya (49:44.784)
Yeah, no, and it's really hard to get there. Like there are other ways to demonstrate some of those things that are considerably easier in some ways. So, yeah.

Valerii (49:54.83)
But look, if you're hiring, don't know, maybe it's the wrong analogy, you're hiring courier, you don't necessarily need to hire Olympic champion in Marathon or whatever, right?

Ilya (50:09.948)
I wish people thought that way, but every time I'm part of the hiring, it's like, just get the single best person. I'm like, you don't have money to pay the best person. Like, what are you asking for here? But yeah, no, I totally agree with you. Let's talk a little bit about this topic then. Let's talk a little bit about various levels, And especially like staff, senior staff, and then like...

distinguished, whatever you want to call them, the principal, guess, is next. What to use as a differentiator between them? So at this point, you definitely have led projects with multiple teams, and you've been able to do design docs, you've been able to launch products at maybe at big companies even. But what do you look for and say, this person is definitely staff, this person is definitely principal?

Valerii (50:42.664)
principle.

Valerii (51:08.61)
Well, first we need to realize that we're at different archetypes within staff level engineers. There is even a book called staff level engineers. I haven't finished this book because I didn't like it much. It's not, that's just my impression, but I think it is worth buying this book just for like two pages there describing like four different archetypes and providing their imaginary schedule.

Ilya (51:21.884)
You

Valerii (51:38.198)
over the week. So there is this archetype being used at Facebook for level seven coding machine, like basically when you have a super deep expertise in this very specific domain look like, for example, right now, everybody's talking about LLM optimization, inference optimization, training optimization. I mean, if you're super good and you can code on CUDA level or even behind the CUDA level and do this super deep level optimization,

Right. So you, can't be considered as like senior staff or like principal or whatever. So I think that's to some extent is relatively easy to understand. If you can do things that few other people can do, like if you can do that on things like 10 people in the world can do a hundred people, a thousand out of like 10 million. That speaks for itself. So now, but you know, that majority of staff engineers, senior staff engineers, principal engineers, they're not demigods.

They are very good, but they're not like that kind of people you have on the 10 people in the world who can do that, like 100 people in the world who can do that. So the internal framework I use is, apart from technical expertise, is ability to be in a glue and technically managing the large scope. So usual framework is the following. Staff can manage two, three, four teams.

and the project which involves them and deliver this project. So acting as a tech lead, as an architect or designer, as a coder. sometimes coming and solving the hardest parts or just solving that. So senior staff is managing that on the biggest scope, on the scope of, you can say, organization. The organization is not the whole company. But for example, in terms of methods, maybe internal infrastructure.

or the whole integrity organization. So when you take not four teams, but maybe 25 teams, and then you, again, can become this glue and provide technical leadership and overseeing, again, a technical leadership, not just coming and talking, but actually understanding how and how does it work. And if something is done here wrongly, that will backfire there. again, you can solve problems that even

Valerii (54:04.042)
nobody within this organization can solve from a technical point of view. So you're not one of a million, a hundred thousand, but you're one of hundred or maybe 200. So the same logic applies further to, to, to principle engineer. So it's not a bigger, even bigger project involving even more people, even harder to organize, even harder to navigate, or there are more engineers, so the problems are harder.

Again, sometimes, sometimes people are getting to this level. Like let's say you worked in ads and you did something, you increased revenue by a hundred million. Your product would be promoted. if you.

Ilya (54:46.076)
Oh, a hundred million is nothing in ads these days. we're...

Valerii (54:48.782)
For one, I think that's a good reason to be promoted. Because if some changes used to deliver it specifically by yourself, and they increase the revenue by 100 million, I think it's a very strong case to receive greatly exceeds expectations, already finds expectations.

Ilya (54:55.409)
Hehehe.

Ilya (55:05.212)
Mm-hmm.

Ilya (55:08.708)
I don't think it's RE. GE, yeah maybe, but not RE.

Valerii (55:12.046)
Well, again, it depends if you've done that in a row, like if you did 100 million, then 100 million, 50 million, then 50 million, and 50 million, because it's been taken down, right? Because it's it sets up. And you know that it's 50 million years ago, it's probably 55 million now, because overall, the revenue base is growing. So again, there are different flavors to it, but basically, it's a scope.

Ilya (55:23.696)
Yeah, it adds up.

Valerii (55:41.646)
scale complexity, and that's why you have way less E7s than M2. For those who don't know what it is, it's basically senior staff engineers and their people at the same level on the management ladder. They're called M2, senior managers. So amount of staff engineers to the engineering managers is more or less the same as far as I know. At least it was like an open stats.

If you go level up, the amount of senior managers is twice or thrice more than senior staff engineers. Number of directors is like five, six times more than principal engineers, because obviously you don't need that many engineers at these levels. But if you have just engineers, you need someone who will run the teams. Like if you have 100 engineers or 200 engineers, you need director, right? So because then you have senior managers, have managers, so for every hundred people need a director.

and you have what, 50,000 engineers in Matlab, like you can think and realize how many directors you like 5,000 directors. You probably don't have 5,000 principal engineers.

Ilya (56:52.764)
Yeah, and if you go to distinguish, you basically just don't have distinguished engineers. There's like a few dozen, like I think ads had one when I was there and like all of ads had one.

Valerii (56:57.836)
Yeah. It's like a rare. Yeah, it's like an oddity to have a distinguished engineer. Like everybody has seen a VP, but not everybody has seen a distinguished engineer.

Ilya (57:11.12)
Yeah, they're a little bit of a unicorn. I got one more question for you and then I have another quick question after that. But the question that I have is the main question of basically this podcast. And it's reflecting back on all the people you've worked with and all the people you're working with now. What makes the best ML engineers? Like what's the differentiator between the best ML engineer as opposed to an ML engineer?

Valerii (58:10.242)
They aren't satisfied until they understand. And they... I think that's the biggest difference, like because they're good engineers, they work hard, they work a lot, they know a lot, they learn a lot, all this stuff, but they aren't satisfied until they understand. And this understanding comes in many forms, like is it the best results we could get? I don't know, I don't understand. Let's go further.

Let's try this. Let's try this. So they kind of, they experiment. So in ML you never understand truly. So that's kind of forces you to go further and further and further and think out of the books, but they're not satisfied until they understand.

Ilya (58:54.064)
Yeah, no, that's terrific. And this is where the five or 10 whys comes in. You just keep asking why and you want to make sure that you understand. Thank you very much. Just the final question to leave off with, how can people find more about you? Obviously the book, is there a preferred way you would like people to go procure it?

Valerii (59:14.407)
LinkedIn is pretty good. They can easily find me there and hit me a message. I think that's the way.

Ilya (59:20.73)
Yeah, and I'll have a link to that. I'll have a link to the book. Yeah, and thank you very much. Thank you for taking the time to talk. And yeah, thanks. Thanks.

Valerii (59:26.51)
Thank you.

Valerii (59:32.984)
Thank you very much. You have a good day ahead of you. Take care.

Valerii Babushkin - Mastering ML Careers & System Design

Join Our Free Trial