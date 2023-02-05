Partner im RedaktionsNetzwerk Deutschland
Screaming in the Cloud with Corey Quinn features conversations with domain experts in the world of Cloud Computing. Topics discussed include AWS, GCP, Azure, Or...
TechnologyBusinessCareers

  Simplifying Cloud Migration Strategy at Tidal with David Colebatch
    David Colebatch, CEO at Tidal.cloud, joins Corey on Screaming in the Cloud to discuss how Tidal is demystifying cloud migration strategy. David and Corey discuss the pros and cons of a hybrid cloud migration strategy, and David reveals the approach that Tidal takes to ensure they’re setting their customers up for success. David also discusses the human element to cloud migration initiatives, and how to overcome roadblocks when handling the people side of migrations. Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This is Screaming in the Cloud.Corey:  LANs of the late 90’s and early 2000’s were a magical place to learn about computers, hang out with your friends, and do cool stuff like share files, run websites & game servers, and occasionally bring the whole thing down with some ill-conceived software or network configuration. That’s not how things are done anymore, but what if we could have a 90’s style LAN experience along with the best parts of the 21st century internet? (Most of which are very hard to find these days.) Tailscale thinks we can, and I’m inclined to agree. With Tailscale I can use trusted identity providers like Google, or Okta, or GitHub to authenticate users, and automatically generate & rotate keys to authenticate devices I've added to my network. I can also share access to those devices with friends and teammates, or tag devices to give my team broader access. And that’s the magic of it, your data is protected by the simple yet powerful social dynamics of small groups that you trust.Try now - it's free forever for personal use. I’ve been using it for almost two years personally, and am moderately annoyed that they haven’t attempted to charge me for what’s become an essential-to-my-workflow service.Corey: Have you listened to the new season of Traceroute yet? Traceroute is a tech podcast that peels back the layers of the stack to tell the real, human stories about how the inner workings of our digital world affect our lives in ways you may have never thought of before. Listen and follow Traceroute on your favorite platform, or learn more about Traceroute at origins.dev. My thanks to them for sponsoring this ridiculous podcast. Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. Every once in a while at The Duckbill Group, I like to branch out and try something a little bit different before getting smashed vocally, right back into the box I find myself in for a variety of excellent reasons. One of these areas has been for a while, the idea of working with migrations on getting folks into cloud. There’s a lot of cost impact to it, but there’s also a lot of things that I generally consider to be unpleasant nonsense with which to deal. My guest today sort of takes a different philosophy to this. David Colebatch is the CEO and founder of Tidal.cloud. David, thank you for joining me.David: Oh, thanks for having me, Corey.Corey: Now, cloud migrations tend to be something that is, I want to say contentious, and for good reason. You have all the cloud providers who are ranting that cloud is the way and the light, as if they’ve just found religion, and yeah, the fact that it basically turns into a money-printing machine for them has nothing to do with their newfound advocacy for this approach. Now, I do understand that we all have positions that we come from that shape our perspective. You do run and did found a cloud migration company. What’s your take on it? Is this as big as the cloud providers say it is, is it overhyped, or is it underhyped?David: I think it’s probably in the middle of this stage of the hype cycle. But the reason that that Tidal exists and why I founded it was that many customers were approaching cloud just for cloud’s sake, you know, and they were looking at cloud as a place to park VMs. And our philosophy as software engineers at Tidal is that customers were missing out on all the new capabilities that cloud provided, you know, cloud is a new paradigm in compute. And so, our take on it is the customer should not look at cloud as a place to migrate to, but rather as a place to transform to and embrace all the new capabilities that are on offer.Corey: I’ve been saying for a while that if you sit there and run a total cost analysis for going down the path of a cloud migration, you will not save money in the short term, call it five years or whatnot. So, if you’re migrating to the cloud specifically to save money, in the common case, it should be for a capability story, not because it’s going to save you money off of what you’re currently doing in the data center. Agree, disagree, or it’s complicated?David: It’s complicated, but you’re right in one case: you need to work backwards from the outcomes, I think that much is pretty simple and clear, but many teams overlook that. And again, when you look at cloud for the sake of cloud, you generally do overlook that. But when we work with customers and they log into to our platform, what we find is that they’re often articulating their intent as I want to improve business agility, I want to improve staff productivity, and it’s less about just moving workloads to the cloud. Anyone can run a VM somewhere. And so, I think, when we work backwards from what the customer is trying to achieve and we look at TCO holistically, not just about how much a computer costs to run and operate in a colo facility, look at it holistically from a staff productivity perspective as well, then the business case for cloud becomes very profound.Corey: I’ve been saying for a while that I can make a good-faith Total Cost of Ownership analysis—or TCO analysis—in either direction, so tell me what outcome you want and I can come up with a very good-faith effort answer that gives you what you want. I don’t think I’ve seen too many TCO analyses, especially around cloud migrations, that were not justification exercises. They were very rarely open questions. It was, we’ve decided what we want to do. Now, let’s build a business case to do that thing. Agree, disagree?David: [laugh]. Agree. I’ve seen that. Yeah, we again, like to understand the true picture of total cost of ownership on-premises first, and many customers, depending on who you’re engaging with, but on the IT side, might actually shield a few of those costs or they might just not know them. And I’m talking about things like in the facilities, insurance costs, utility bills, and things like that, that might not bubble up.We need to get all those cards on the table in order to conduct a full TCO analysis. And then in the cloud side, we need to look at multiple scenarios per workload. So, we want to understand that lift-and-shift base case that many people come from, but also that transformative migration case which says, I might be running in a server-ful architecture today on-premises, but based on the source code and database analysis that we’ve done, we can see an easy lift to think like Lambda and serverless frameworks on the cloud. And so, when you take that transformative approach, you may spend some time upfront doing that transformation, or if it’s tight fit, it might be really easy; it might actually be faster than reverse-engineering firewall rules and doing a lift-and-shift. And in that case, you can save up to 97% in annual OPEX, which is a huge savings, of course.Corey: You said the magic words, lift-and-shift, which means all right, the gloves come off. Let’s have this conversation.David: Oh yeah.Corey: I work on AWS bills for a living. Cloud cost and architecture are fundamentally the same thing, and when I start looking at a company’s monthly bill, I can start to see the architectural patterns emerge with no further information than what’s shown in the exploded bill view, at least at a high level. It starts to be indicative of different things. And you can generally tell, on some level, when companies have come from a data center environment or at least a data center mentality, in what they’ve built. And I’ve talked to a number of companies where they have effectively completely lifted their data center into the cloud and the only real change that they have gotten in terms of value for it has been that machines are going down a lot less because the hard drive failed and they were really bad at replacing hard drives.Now, for companies in that position who have that challenge, yeah, the value is there and it’s apparent because I promise, whoever you are, the cloud providers are better at replacing failed hard drives than you are, full stop. And if that’s the value proposition you want, great, but it also feels like that is just scratching the surface of what the benefit of cloud providers can be.David: Absolutely. I mean, we look at cloud as a way to unlock new ways of working and it’s totally aligned with the new distributed product team approach that many enterprises are pursuing. You know, the rise of Agile and DevOps has sort of facilitated this movement away from single choke points of IT service delivery, like we used to with ITIL, into much more modern ways of working. And so, I imagine when you’re looking at those cloud bills, you might see a whole host of workloads centered into one or two accounts, like they’ve just replicated a data center into one or two accounts and lifted-and-shifted a bunch of EC2 to it. And yeah, that is not the most ideal architectural pattern to follow in the cloud. If you’re working backwards from, “I want to improve staff productivity; I want to improve business agility,” you need to do things like limit your blast radius and have a multi-account strategy that supports that.Corey: We’ve seen this as well and born-in-the-cloud companies, too, because for a long time, that was AWS’s guidance of put everything in a single AWS account. The end. And then just, you know, get good with IAM issues. Like, “Well okay, I found that developer environments impacted production.” Then, “Sounds like a skill issue.”Great, but then you also have things that cannot be allocated, like service quotas. When you have something in development run amok and exhaust service quotas for number of EC2 get instance info requests, suddenly, load balancers don’t anymore and auto-scaling is kind of aspirational when everything explodes on you. It’s the right path, but very often, people got there through following the best advice that AWS offers. I am in the middle of a migration myself from the quote-unquote, “Legacy” AWS account, I built a bunch of stuff in 2016 into its own dedicated account and honestly, it’s about as challenging as some data center moves that I’ve done historically.David: Oh, absolutely. I mean, the cobwebs build up over time and you have a lot of dependencies on services, you completely forget about.Corey: “How do I move this S3 bucket to another account?” “That’s the neat part. You don’t.”David: [laugh]. We shouldn’t just limit that to AWS. I mean, the other cloud providers have similar issues to deal with through their older cloud adoption frameworks which are now playing out. And some of those guidance points were due to technology limitations in the underlying platform, too, and so you know, at the time, that was the best way to go to cloud. But as I think customers have demanded more agility and more control over their blast radiuses and enabling self-service teams, this has forced everyone to sort of come along and embrace this multi-account strategy. Where the challenge is, with a lot of our enterprise clients, and especially in the public—Corey: Embrace it or you’ll be made to embrace it.David: Yeah [laugh]. We see with both our enterprise accounts that were early adopters, they certainly have that issue with too much concentration on one or two accounts, but public sector accounts as well, which we’re seeing a lot of momentum in, they come from a place where they’re heavily regulated and follow heavy architectural standards which dictate some of these things. And so, in order for those clients to be successful in the cloud, they have to have real leadership and real champions that are able to, sort of, forge through some of those issues and break outside of the mold in order to demonstrate success.Corey: On some level, when I see a lift that failed to shift, it’s an intentional choice in some cases where the company has decided to improve their data center environment at the cost of their cloud environment. And it feels, on some level, like it’s a transitional step, but then it’s almost a question that I always have is, was this the grand plan? So, I guess my question for you is, when you see a company that has some workloads in a data center and some living in the cloud provider in what most people call hybrid, is that outcome intentional or is it accidental, where midway through, they realize that some workloads are super hard to migrate? They have a mainframe and there is no AWS/400 available for their use, so they’re going to give up halfway, declare victory, and yep we’re hybrid now. How did they get there?David: I think it’s intentional, quite often that they see hybrid cloud as a stepping stone to going full cloud. And this just comes down to project scoping and governance, too. So, many leaders will draw a ring around the workloads that are easy to migrate and they’ll claim success at the end of that and move on to another job quite often. But the visionary leaders will actually chart a path to course that has a hundred percent adoption, full data center closure, off the mainframe, off AS/400, you know, refactored usually, but they’ll chart that course at a rate of change that the organization can accept. Because, you know, cloud being a new paradigm, cloud requiring new ways of working, they can’t just ram that kind of change through in their enterprise in one or two years; they really need to make sure that it’s being absorbed and adopted and embraced by the teams and not alienating the whole company as they go through. And so, I do see it as intentional, but that stepping stone that many companies take is also an okay thing in my mind.Corey: And to be clear, I should bound what I’m saying from the perspective that I’m talking about this from a platonic ideal perspective. I am not suggesting that, “Oh, this thing that you built at your company is crappy,” I mean, any more so than anything else is. I’ve never yet seen any infrastructure that the people running it would step back and say, “This is amazing and perfect.” Everyone thinks it’s a burning dumpster fire of sadness and regret and I’m not entirely sure that they’re wrong.I mean, designing an architecture—cloud or otherwise—on a whiteboard is relatively straightforward, for a junior employee, even. The problem is most people don’t get to start from scratch and build that thing. There’s existing stuff that needs to be migrated in and most of us don’t get the luxury of taking two years of downtime for that service while we wind up rebuilding it from scratch. So, it’s one of those how do you rebuild a car without taking it off the highway to do it type of questions.David: Well, you want to have a phased migration approach, quite often. Your business can’t stop and start because you’re doing a migration, so you want to build momentum with the early adopters that are easy to migrate and don’t require big interruptions to business. And then for those mission-critical workloads that do need to migrate—and you mentioned mainframe and AS/400 before—they might be areas where you introduce, like, a strangler fig pattern, you know, draw a ring around it, start replicating some services into cloud, and then phase that migration over a year or two, depending on your timeline and scale. And so, we’re very much pragmatic in this business that we want to make sure we’re doing everything for the right reasons, for the business-led reasons, and fitting in migrations around business objectives and strategies is super critical to success.Corey: What I’m curious about is when we talk about migrations, in fact, when I invited you on the show, and it was like, well, Tidal migrations—one thing I love about calling it that for the domain, in some cases, as well as other things is, “Huh, says right in the tin what it is. Awesome.” But it’s migrations, which I assumed to be, you know, from data centers into cloud. That’s great. But then you’ve got the question of, is that what your work looks like? Is it migrations in the other direction? Is cloud repatriation a thing that people are doing, and no one bothered to actually ever bother to demonstrate that to me? Is cloud to cloud? What are you migrating from and to?David: Well, that’s great. And we actually dropped migrations from the name.Corey: Oh, my apologies. Events, once again, outpace me.David: Tidal.cloud is our URL and essentially, Corey, the business of migration is something that’s only becoming increasingly frequent. Customers are not just migrating from on-premises data centers to cloud, they’re also migrating in between their cloud accounts like you are, but also from one cloud provider to another. And our business hypothesis here Tidal is that that innovation cycle is continuing to shrink, and so whereas when I was in the data center automation business, we used to have a 10 and 15-year investment cycle, now customers have embraced continuous delivery of their applications and so there’s this huge shift of investment horizons, bringing it down to an almost an annual event for many of the applications that we touch.Corey: You are in fact correct. Tidal.cloud does have a banner at the top that says, “Tidal Migrations is now Tidal.” Yep, you’re correct, not that I’m here to like incorrect you on the name of your own company, for God’s sake. That’s a new level of mansplaining I dare not delve into.But it does say, “Migration made modern,” right at the top, which is great because there’s a sense that I’ve always had that lift-and-shift is poo-pooed as a bad approach to migrating, but I’ve done it other ways and it becomes disastrous. I’ve always liked the approach of take something in a data center, migrated into cloud, in the process, changing as few things as possible, and then just get it stable and working there, and step two becomes the transformation because if you try and transform while it moves, yeah, that gets you a little closer to outcome in theory, but when things don’t work right—and their computers; let’s not kid ourselves, nothing works right—it’s a question now of was it my changes? Is it the cloud environment? Is there an unknown dependency that assumes things in the data center that are not true in cloud? It becomes very hard to track down the why of these things.David: There’s no one-size-fits-all for migration. It’s why we have the seven-hour assessment capabilities. You know, if one application, like you’ve just talked about, that one application might be better to lift and shift than modernize, there might be real business reasons for doing that. But what we’ve seen over the years is the customers generally have one migration budget. Now, IT gets one migration budget and they get to end a job in a lift-and-shift scenario and the business says, “Well, what changed? Nothing, my apps still run the same, I don’t notice any new capabilities.” And IT then says, “Yeah, yeah. Now, we need the modernization budget to finish.” And they said, “No, no, no. We’ve just given you a bunch of money. You’re not getting any more.”And so, that’s what quite often the migrate as a lift-and-shift kind of stalls and you see an exodus of talent out of those organizations, people leave to go on to the next migration project elsewhere and that organization really didn’t embrace any of the cloud-native changes that were required. We’d like to really say that—and you saw this on our header—that migrations made modern, we’d like to dispel the myth that you can either migrate or modernize. It’s really not an either/or. There’s a full spectrum of our methods, like replatform, and refactor, rehosting, in the middle there. And when we work backwards from customers, we want to understand their core objectives for going to cloud, their intent, their, “Why cloud?”We want to understand how it aligns on the cloud value framework, so business agility gains, staff productivity gains, total cost of ownership is important, of course. And then for each of their application workloads, choose the right 6R based on those business outcomes. And it can seem like a complicated or comprehensive problem, but if you automate it like we do, you can get very consistent results very quickly. And that’s really the accelerant that we give customers to accelerate their migration to cloud.Corey: One thing that I’ve noticed—and maybe this makes me cynical—but when I see companies doing lift-and-shift, often they will neglect to do the shift portion of it. Because there’s a compelling reason to do a migration to get out of a data center and into a cloud, and often that is a data center contract expiry coming up. But companies are very rarely going to invest the time, energy, and money—which all become the same thing, effectively, at company scale—in refactoring existing applications if they’re not already broken.I see that all the time in my work, I don’t make recommendations to folks very often have the form, “Oh, just migrate this entire application to serverless and you’ll save 80% or more on it.” And it’s, “That’s great, but that’s 18 months' worth of work and it doesn’t actually get us closer to our business milestones, so yeah, we’re not going to do that.” Cost directly is very rarely a compelling reason to make a migration, but when you’re rebuilding something for business purposes, factoring cost concerns into it seems to be a much better way to gain adoption and traction of those ideals.David: Yeah, yeah. Counterpoint on that, when we look at a portfolio of applications, like, hundreds or thousands of applications in an enterprise and we do this type of analysis on them with the customers, what we’ve learned is that they may refactor and replatform ten, 20% of their workloads, they may rehost 40%, and they’ll often turn off the rest, retire them, not migrate them. And many of our enterprise customers that we’ve spoken to have gone through rationalizations as they’ve gone to cloud and saved, you know, 59%, just turned off that 59% of an infrastructure, and the apps that they do end up refactoring and modernizing are the ones where either there’s a very easy path for them, like, the code is super compatible and written in a way that’s fitting with Lambda and so they’ve done that, or they’ve got, like you said, business needs coming up. So, the business is already investigating making some changes to the application, they already want to embrace CI/CD pipelines where they haven’t today. And for those applications, what we see teams doing is actually building new in the cloud and then managing that as an application migration, like, cutting over that.But in the scheme of an entire portfolio of hundreds or thousands of applications that might be 5, 10, 20% of the portfolio. It won’t be all of them. And that’s what we say, there’s a full spectrum of migration methods and we want to make sure we apply the right ones to each workload.Corey: Yeah, I want to be clear that there are different personas. I find that most of my customers tend to fall into two buckets. The first is that you have the born-in-the-cloud SaaS companies, and that’s the world I come from, where you have basically one workload that’s 80% of your application spend, your revenue, et cetera. Like, they are not a customer, but take Datadog as an example. Like, the Datadog monitoring application suite would be a good example of this, and then you have a bunch of longtail stuff.Conversely, you’ve got a large enterprise that might be spending $100 million or so every year, but their largest single application is a couple million bucks because it just has thousands upon thousands of them. And at that point, it becomes much more of a central IT planning problem. In one of those use cases, spending significant effort refactoring and rebuilding things, from an optimization perspective, can pay dividends. In other cases, it tends not to work in quite the same way, just because the economies of scale aren’t there. Do you find that most of your customers fall into one of those two buckets? Do you take a different view of the world? How do you see the market?David: Same view, we do. Enterprise customers are generally the areas that we find the most fit with, the ISVs, you know, that have one or two primary applications. Born in the cloud, they don’t need to do portfolio assessments. And with the enterprise customers, the central IT bit used to be a blocker and impediment for cloud. We’re increasingly seeing more interest from central IT who is trying to lead their organization to cloud, which is great, that’s a great sign.But in the past, it had been more of a business-led conversation where one business unit within an enterprise wants to branch away from central IT, and so they take it upon themselves to do an application assessment, they take it upon themselves to get their own cloud accounts, you know, a shadow IT move, in a way. And that had a lot of success because the business would always tie it back to business outcomes that they were trying to achieve. Now, into IT, doing mass migration, mass portfolio assessment, this does require them to engage deeply with the business areas and sometimes we’re seeing that happening for the very first time. There’s no longer IT at the end of a chain, but rather it’s a joint partnership as they go to cloud, which is really cool to see.Corey: When I go to Tidal.cloud, you have a gif—yes, that’s how it’s pronounced, I’m not going to take debates on that matter—but you have a gif at the top of your site a showing a command line tool that runs an analyze command on an application. What are you looking at to establish an application or workload’s suitability for migration? Because I have opinions on this, but you have, you know, a business around this and I’m not going to assume that my strongly-held opinions informed by several weeks of work are going to trump, you know, the thing that your entire company is built around.David: Thanks, Corey. Yeah, you’re looking at our command-line utilities there. It’s an accompanying part of our product suite. We have a web application and the command-line utilities are what customers use behind their firewall to analyze their applications. The data points that we look at are infrastructure, as you can imagine, you might plug into VMware and discover VMs that are running, we’ll look for non-x86 workloads on the network.So, infrastructure is sort of bread and butter; everyone does that. Where Tidal differentiates is going up the stack, analyzing source code, analyzing database technologies, and looking at the schema usage within your on-premises database, for example, which features and functionality are using, and then how that fits to more cloud-native database offerings. And then we’ll look at the technology age as well. And when you combine all of those technology factors together, we sort of form a view of what the migration difficulty to cloud will be on various migration outcomes, be it rehost, replatform, or refactor.The other thing that we add there is on the business side and the business intent. So, we want to understand from leadership what their intent is with cloud, and there’s some levers they pull in the Tidal platform there. But then we also want to understand from each application owner how they think about their applications, what the value of those applications are to them and what their forward-looking plans are. We capture all these things in our tool, we then run it through our recommendation engine, and that’s how we come up with a bespoke migration plan per client.Corey: One of the challenges I have in the cost arena around a lot of these tools that oh, we’re going to look at your various infrastructure-as-code situation and see what that’s going to cost you for a given change. It’s like, sure, that that’s not hard from a baseline of I want to spin up ten more EC2 instances. Yes, that is the tricky part of cloud economics known as basic arithmetic. The problem where I see is that okay, and then they’re going to run Kubernetes, which has no sense of zone affinity, so it’s going to wind up putting nondeterministic amounts of traffic across a AZ boundary and that’s going to spike data transfer in some use cases, but none of these tools have any conception as to what those workloads look like. Now, that’s a purely cost perspective, but that does have architectural approaches. Do you factor things like that in when you move up the stack?David: Absolutely. And really understanding on a Tidal inventory basis, understanding what the intent is of each of those workloads really does help you, from a cloud economics basics, to work out how much is reasonable in terms of cloud costs. So, for example, in Tidal, if you’re doing app assessment, you’re capturing any revenue to business that it generates, any staff productivity that it creates. And so, you’ve got the income side of that application workload. When you map that to on-premises costs and then later to cloud costs, your FinOps job becomes a lot easier because now you have the business context of those workloads too.Corey: So, one of the things that I have found is that you can judge the actual success of a project by how many people who work at the company claimed credit for it on LinkedIn, whereas conversely, when things don’t work out super well, it’s sort of a crickets moment. I’m curious as to your perspective on whether there is such a thing as a migration failure, or is it simply a, “Oh, we’re going to iterate on this in a new direction. We’ve replaced a failing part, which turned out, from our perspective, to be our CIO, but we have a new one who’s going to move us into cloud in the proper time and space.” We go through more of those things than some people do underwear. My God. But is there such a thing as a failed cloud migration?David: There absolutely is. And I get your point that success has many fathers. You know, when clients have brought us in for that success party at the end, you don’t recognize everybody there. But you know, failure can be, you know, you’ve missed on time, scope, or budget, and by those measures, I think 76% of IT projects were failing in 2018, when we ran those numbers.So absolutely, by those metrics, there are failed cloud migrations. What tends to happen is people claim success on the workloads that did migrate. They may then kick it out into a new project scope, the organizational change bit. So, we’ve had many customers who viewed the cloud migration as a lift-and-shift exercise and failed to execute on the organizational change and then months later realized, oh, that is important in order for my day two operations to really hum, and so then have embarked on that under a separate initiative. So, there’s certainly a lot of rescoping that goes on with these things.And what we like to make sure we’re teaching people—and we do this for free—is those lessons learned and pitfalls with cloud early on because we don’t want to see all those headlines of failed projects under that; we want to make sure that customers are armed with here are the things you should consider to execute on as you go to cloud.Corey: Do you ever run an analysis on a workload when a customer is asking, “So, how should we go about migrating this?” And your answer is, “You should absolutely not?”David: Well, all applications can go to cloud, it’s just a matter of how much elbow grease you want to put into it. And so, the absolutely not call comes from when that app doesn’t provide any utility to the business or maybe it has a useful life of six more months and the data center is going to be alive for seven. So, that’s when those types of judgment calls come in. Other times we’ve seen, you know, there’s already a replacement initiative underway by the business. IT wasn’t aware of it, but through our process and methodology, they engaged with the business for the first time and learned about it. And so, that helps them to avoid needing to migrate workloads because the business is already moving to Salesforce, for example.Corey: I imagine you’re also relatively used to the sinking realization that customers often have when they’re used to data center thinking and you ask them a question, like, “How many gigabytes a month does your application server send back and forth to your database server?” And their response, very reasonably, is, “Why on earth would I know the answer to that quest—oh, God. You mean, that’s how it bills?” It’s the sense of everything is different in cloud, sometimes, subtly, sometimes massively. But it’s a different way of thinking.So, I guess my last real big question for you on this is, moving technology is relatively straightforward but migrating people is very challenging. How do you find that the people and the processes that have grown up in data center environments with people whose identities are inextricably linked the technology they work on, being faced with the idea of it is now time to pick up and move these things into an environment where things that were incredibly valuable guardrails in a data center environment no longer serve you well?David: Yeah. The people side of cloud migration is the more challenging part. It’s actually one of the reasons we introduced a service offering around people change management. The general strategy is sort of the Kotter change process of creating that guiding coalition, the people who want to do something different, get them outside of IT, reporting out to the executives directly, so they’re unencumbered by the traditional processes. And once they start to demonstrate some success of a new way of working, a new paradigm, you kind of sell that back into the organization in order to drive that change.It’s getting a lot easier to position that organizational change aspects with customers. There’s enough horror stories out there of people that did not take that approach. And quite rightly. I mean, it’s tough to imagine, as a customer, like, if I’m applying my legacy processes to cloud migration, why would I expect to get anything but a legacy result? You know, and most of the customers that we talk to that are going to cloud want a transformational outcome, they want more business agility and greater staff productivity, and so they need to recognize that that doesn’t come without change to people and change the organization. It doesn’t mean you have to change the people out individually, but skilling the way we work, those types of things, are really important to invest in and I’d say even more so than the technology aspects of any cloud migration.Corey: David, I really want to thank you for taking the time to talk to me about something that is, I’d say near and dear to my heart, except I’m trying desperately not to deal with it more than I absolutely have to. If people want to learn more, where’s the best place for them to find you?David: Sure. I mean, tidalcloud.com is our website. I’m also on Twitter @dcolebatch. I like to tweet there a little bit, increasingly these days. I’m not on Bluesky yet, though, so I won’t see you there. And also on LinkedIn, of course.Corey: And we will, of course, put links to that in the [show notes 00:29:57]. Thank you so much for your time. I really appreciate it.David: Thanks, Corey. Great to be here.Corey: David Colebatch, CEO and founder of Tidal.cloud. I’m Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment that you will then struggle to migrate to a different podcast platform of your choice.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.
    5/16/2023
    32:39
  Doing What You Love in Cloud with Nate Avery
    Nate Avery, Outbound Product Manager at Google, joins Corey on Screaming in the Cloud to discuss what it’s like working in the world of tech, including the implications of AI technology on the workforce and the importance of doing what you love. Nate explains why he feels human ingenuity is so important in the age of AI, as well as why he feels AI will make humans better at the things they do. Nate and Corey also discuss the changing landscape of tech and development jobs, and why it’s important to help others throughout your career while doing something you love. About NateNate is an Outbound Product Manager at Google Cloud focused on our DevOps tools. Prior to this, Nate has 20 years of experience designing, planning, and implementing complex systems integrating custom-built and COTS applications. Throughout his career, he has managed diverse teams dedicated to meeting customer goals. With a background as a manager, engineer, Sys Admin, and DBA, Nate is currently working on ways to better build and use virtualized computer resources in both internal and external cloud environments. Nate was also named a Cisco Champion for Datacenter in 2015.Links Referenced: Google Cloud: https://cloud.google.com/devops Not Your Dad’s IT: http://www.notyourdadsit.com/ Twitter: https://twitter.com/nathaniel_avery LinkedIn: https://www.linkedin.com/in/nathaniel-avery-2a43574/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: It’s easy to **BEEP** up on AWS. Especially when you’re managing your cloud environment on your own!Mission Cloud un **BEEP**s your apps and servers. Whatever you need in AWS, we can do it. Head to missioncloud.com for the AWS expertise you need. Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn, and my guest today is Nate Avery, who’s an outbound product manager over at Google Cloud. Nate, thank you for joining me.Nate: Thank you for having me. This is really a pretty high honor. I’m super thrilled to be here.Corey: One of my questions that I have about any large company when I start talking to them and getting to know people who work over there, pretty quickly emerges, which is, “What’s the deal with your job title?” And it really doesn’t matter what the person does, what the company is, there’s always this strange nuance that tends to wind up creeping into the company. What is an outbound product manager and what is it you say it is you do here?Nate: Okay. That’s an interesting question because I’ve been here for about a year now and I think I’m finally starting to figure it out. Sure, I should have known more when I applied for the job, [laugh] but there’s what’s on the paper and then there’s what you do in reality. And so, what it appears to be, where I’m taking this thing now, is I talk to folks about our products and I try to figure out what it is they like, what it is they don’t like, and then how do we make it better? I take that information back to our engineers, we huddle up, and we figure out what we can do, how to do it better, how to set the appropriate targets when it comes to our roadmaps. We look at others in the industry, where we are, where they are, where we think we can maybe have an advantage, and then we try to make it happen. That’s really what it is.Corey: One of the strange things that happens at big companies, at least from my perspective, given that I’ve spent most of my career in small ones, is that everyone has a niche. There are very few people at large companies whose job description is yeah, I basically do everything. Where do you start? And where do you stop because Google Cloud, even bounding it to that business unit, is kind of enormous? You’ve [got 00:02:47] products that are outbound that you manage. And I feel like I should also call out that a product being outbound is not the same thing as being outgoing. I know that people are always wondering, what’s Google going to turn off next, but Google Cloud mostly does the right thing in that respect. Good work.Nate: [laugh]. Nice. So, the products I focus on are the DevOps products. So, those are Cloud Build, Cloud Deploy, Artifact Registry, Artifact Analysis. I also work with some of our other dev tooling such as Cloud Workstations. That’s in public preview right now, but maybe by the time this goes to air, it’ll actually be in general availability.And then I also will talk about some of our other lesser-known tools like Skaffold or maybe on occasion, I’ll throw out something about minikube. And also, Cloud Code, which is a really deep browser plugin for your IDE that gives you access to lots of different Google tools. So yeah, that’s sort of my area.Corey: Well, I’m going to start with the last thing you mentioned, where you have Cloud Code as an IDE tooling and a plug-in for it. I’m relatively new to the world of IDEs because I come from the world of grumpy Unix admins; you never know what you’re going to be remoting into next, but it’s got VI on it, so worst case, you’ll have that. So, I grew up using that, and as a result, that is still my default. I’ve been drifting toward VS Code a fair bit lately, as I’ve been regrettably learning JavaScript and TypeScript, just because having a lot of those niceties is great. But what’s really transformative for me has been a lot of the generative AI offerings from various companies around hey, how about we just basically tab-complete your code for you, which is astonishing. I know people love to argue about that and then they go right back to their old approach of copying and pasting their code off a Stack Overflow.Nate: Yeah. That’s an interesting one. When it works, it works and it’s magical. And those are those experiences where you say, “I’m going to do this thing forever and ever I’m never going to go back.” And then when it doesn’t work, you find yourself going back and then you maybe say, “Well, heck, that was horrible. Why’d I ever even go down this path?”I will say everyone’s working on something along those lines. I don’t think that that’s much of a secret. And there are just so many different avenues at getting there. And I think that this is so early in the game that where we are today isn’t where we’re going to be.Corey: Oh, just—it’s accelerating. Watching the innovation right now in the generative AI space is incredible. My light bulb moment that finally got me to start paying attention to this and viewing it as something other than hype that people are trying to sell us on conference stages was when I use one of them to spit out just, from a comment in VS Code, “Write a Python script that will connect to AWS pricing API and tell me what something costs, sorted from most to least expensive regions.” Because doing that manually would have taken a couple hours because their data structures are a sad joke and that API is garbage. And it sat and spun for a second and then it did it. But if I tell that story as, “This is the transformative moment that opened my eyes,” I sound incredibly sad and pathetic.Nate: No, I don’t think so. I think that what it does, is it… one, it will open up more eyes, but the other thing that it does is you have to take that to the next level, which is great. That’s great work, gone. Now that I have this information, what do I do with it? That’s really where we need to be going and where we need to think about what this AI revolution is going to allow us to do, and that’s to actually put this stuff into context.That’s what humans do, which the computers are not always great at. And so, for instance, I see a lot of posts online about, “Hey, you know, I used to do job X, where I wrote up all these things,” or, “I used to write a blog and now because of AI, my boss wants me to write, you know, five times the output.” And I’m thinking, “Well, maybe the thing that you’re writing doesn’t need to be written if it can be easily queried and generated on the fly.” You know? Maybe those blog posts just don’t have that much value anymore. So, what is it that we really should concentrate on in order to help us do better stuff, to have a higher order of importance in the world? That’s where I think a lot of this really will wind up going is… you know, just as people, we’ve got to be better. And this will help us get there.Corey: One area of nuance on this, though, is—you’re right when I talked about this with some of my developer friends, some of their responses were basically to become immediately defensive. Like, “Sure, it’s great for the easy stuff, but it’s not going to solve the high-level stuff that senior engineers are good at.” And I get that. This ridiculous thing that I had to do is not a threat to a senior engineer, but it is arguably a threat to someone I find on Upwork or Fiverr or whatnot to go and write this simple script for me.Nate: Oh yeah.Corey: Now, the concern that I have is one of approachability and accessibility because. Senior engineers don’t form fully created from the forehead of some God somewhere that emerges from Google. They start off as simply people who have no idea what they’re doing and have a burning curiosity about something, in many cases. Where is the next generation going to get the experience of writing a lot of that the small-scale stuff, if it’s done for them? And I know that sounds alarmist, and oh, no, the sky is falling, and are the children going to be all right, as most people my age start to get into. But I do wonder what the future holds.Nate: That’s legit. That’s a totally legit question because it’s always kind of hanging out there. I look at what my kids have access to today. They have freaking Oracle, the Oracle at Delphi on their phone; you know, and—Corey: If Oracle the database on their phone, I would hate to imagine what the cost of raising your kids to adulthood would be.Nate: Oh, it’s mighty, mighty high [laugh]. But no, they have all of this stuff at their hands and then even just in the air, right? There’s ambient computing, there’s any question you want answered, you could speak it into the air and it’ll come out. And it’ll be, let’s just say, I don’t know, at least 85% accurate. But my kids still ask me [laugh].Corey: Having my kids, who are relatively young, still argue and exhaust their patience on a robot with infinite patience instead of me who has no patience? Transformative. “How do I spell whatever it is?” “Ask Alexa,” becomes a story instead of, “Look it up in the dictionary,” like my parents used to tell me. It’s, “If I knew how to spell it, I would need to look it up in the dictionary, but I don’t, so I can’t.”Nate: Right. And I would never need to spell it again because I have the AI write my whole thing for me.Corey: That is a bit of concern for me when—some of the high school teachers are freaking out about students are writing essays with this thing. And, yeah, on the one hand, I absolutely see this as alarmism, where, oh, no, I’m going to have to do my job, on some level. But the reason you write so many of those boring, pointless essays in English class over the course of the K through 12 experience is ideally, it’s teaching you how to frame your discussions, how to frame an argument, how to tell a compelling story. And, frankly, I think that’s something that a lot of folks in the engineering cycle struggle with mightily. You’re a product slash program manager at this point; I sort of assume that I don’t need to explain to you that engineers are sometimes really bad at explaining what they mean.Nate: Yeah. Dude, I came up in tech. I’m… bad at it too sometimes [laugh]. Or when I think I’m doing a great job and then I look over and I see a… you know, the little blanky, blanky face, it goes, “Oh. Oh, hold on. I’ll recalibrate that for you.” It’s a thing.Corey: It’s such a bad trope that they have now decided that they are calling describing what you actually mean slash want is now an entire field called prompt engineering.Nate: Dude, I hate that. I don’t understand how this is going to be a job. It seems to be the most ridiculous thing in the world. If you say, “I sit down for six hours a day and I ask my computer questions,” I got to ask, “Well, why?” [laugh]. You know? And really, that’s the thing. It gets back—Corey: Well, most of us do that all day long. It’s just in Microsoft Excel or they use SQL to do it.Nate: Yeah… it is, but you don’t spend your day asking the question of your computer, “Why.” Or really, most of us ask the question, “How?” That’s really what it is we’re doing.Corey: Yeah. And that is where I think it’s going to start being problematic for some folks who are like, “Well, what is the point of writing blog posts if Chat-GIPITY can do it?” And yes, that’s how I pronounce it: Chat-GIPITY. And the response is, “Look, if you’re just going to rehash the documentation, you’re right. There’s no point in doing it.”Don’t tell me how to do something. Tell me why. Tell me when I should consider using this tool, that tool, why this is interesting to me, why it exists. Because the how, one way or another, there are a myriad ways to find out the answer to something, but you’ve got to care first and convincing people to care is something computers still have not figured out.Nate: Bingo. And that gets back to your question about the engineers, right? Yeah. Okay. So sure, the little low-level tasks of, “Hey I need you to write this API.” All right, so maybe that stuff does get farmed out.However, the overall architecture still has to be considered by someone, someone still has to figure out where and how, and when things should be placed and the order in which these things should be connected. That never really goes away. And then there’s also the act of creation. And by creation, I mean, just new ideas, things that—you know, that stroke of creativity and brilliance where you just say, “Man, I think there’s a better way to do this thing.” Until I see that from one of these generative AI products, I don’t know if anyone should truly feel threatened.Corey: I would argue that people shouldn’t necessarily feel threatened regardless because things always change; that’s the nature of it. I saw a headline on Hacker News recently where it said that 90% of my skills are worthless, but 10% of them are 10x what they were was worth. And I think that there’s a lot of truth to that because it’s, if you want a job where you never have to—you don’t have to keep up with the continuing field, there are options. Not to besmirch them, but accountants are a terrific example of this. Yes, there’s change to accountancy rules, but it happens slowly and methodically. You don’t go on vacation for two years as an accountant—or a sabbatical—come back and discover that everything’s different and math doesn’t work the way it once did. Computers on the other hand, it really does feel like it’s keep up or you never will.Nate: Unless you’re a COBOL guy and you get called back for y2k.Corey: Oh, of course. And I’m sure—and now you’re sitting around, you’re waiting because when the epic time problem hits in 2038, you’re going to get your next call out. And until then, it’s kind of a sad life. You’re the Maytag repair person.Nate: Yeah. I’m bad at humor, by the way, in case you have noticed. So, you touched on something there about the rate of change and how things change and whether or not these generative AI models are going to be able to—you know, just how far can they go? And I think that there’s a—something happened over the last week or so that really got me thinking about this. There was a posting of a fake AI-generated song, I think from Drake.And say what you want about cultural appropriation, all that sort of thing, and how horrible that is, what struck me was the idea that these sorts of endeavors can only go so far because in any genre where there’s language, and current language that morphs and changes and has subtlety to it, the generative AI would have to somehow be able to mimic that. And not to say that it could never get there, but again, I see us having some situations where folks are worried about a lot of things that they don’t need to worry about, you know, just at this moment.Corey: I’m curious to figure out what your take is on how you see the larger industry because for a long time—and yes, it’s starting to fade on some level, because it’s not 2006 anymore, but there was a lot of hero worship going on with respect to Google, in particular. It was the mythical city on the hill where all the smart people went and people’s entire college education was centered around the idea of, well, I’m going to get a job at Google when I graduate or I’m doomed. And it never seems to work out that way. I feel like there’s a much more broad awareness these days that there’s no one magical company that has the answers and there are a lot of different paths. But if you’re giving guidance to someone who’s starting down that path today, what would it be?Nate: Do what you love. Find something that you love, figure out who does the thing that you love, and go there. Or go to a place that does a thing that you love poorly. Go there. See if you can make a difference. But either way, you’re working on something that you like to do.And really, in this business, if you can’t get in the door at one of those places, then you can make your own door. It’s becoming easier and easier to just sort of shoehorn yourself into this space. And a lot of it, yeah, there’s got to be talent, yeah, you got to believe in yourself, all that sort of thing, but the barriers to entry are really low right now. It’s super easy to start up a website, it costs you nothing to have a GitHub account. I really find it surprising when I talked to my younger cousins or someone else in that age range and they start asking, like, “Well, hey, how do I get into business?”And I’m like, “Well, what’s your portfolio?” You know? And I ask them, “Do you want to work for someone else? Or would you like to at least try working for yourself first?” There are so many different avenues open to folks that you’re right, you don’t have to go to company X or you will never be anything anymore. That said, I am at [laugh] one of the bigger companies and do there are some brilliant people here. I bump into them and it’s kind of wild. It really, really is.Corey: Oh, I want to be very clear, despite the shade that I throw at Google—and contemporary peers in the big tech company space—there are an awful lot of people who are freaking brilliant. And more importantly, by far, a lot of people who are extraordinarily kind.Nate: Yeah. Yeah. So, all right, in this business, there’s that whole trope about, “Yeah, they’re super smart, but they’re such jerks.” It doesn’t have to be that way. It really doesn’t. And it’s neat when you run into a place that has thousands of people who do not fit that horrible stereotype out there of the geek who can’t, you know, who can’t get along well with others. It’s kind of nice.But I also think that that’s because the industry itself is opening up. I go on to Twitter now and I see so many new faces and I see folks coming in, you know, for whatever reason, they’re attracted to it for reasons, but they’re in. And that’s the really neat part of it. I used to worry that I didn’t see a lot of young people being interested in this space. But I’m starting to notice it now and I think that we’re going to wind up being in good hands.Corey: The kids are all right, I think, is a good way of framing it. What made you decide to go to Google? Again, you said you’ve been there about a year at this point. And, on some level, there’s always a sense in hindsight of, well, yeah, obviously someone went from this job to that job to that job. There’s a narrative here and it makes sense, but I’ve never once in my life found that it made sense and was clear while you’re making the decision. It feels like it only becomes clear in hindsight.Nate: Yes, I am an extremely lucky person. I am super fortunate, and I will tell a lot of people, sometimes I have the ability to fall ass-backwards into success. And in this case, I am here because I was asked. I was asked and I didn’t really think that I was the Google type because, I don’t know what I thought the Google type was, just, you know, not me.And yet, I… talked it out with some folks, a really good, good buddy of mine and [laugh] I’ll be darned, you know, next thing, you know, I’m here. So, gosh, what can I say except, don’t limit yourself [laugh]. We do have a tendency to do that and oh, my God, it’s great to have a champion and what I’d like to do now, now that you mention it and it’s been something that I had on my mind for a bit is, I’ve got to figure out how to, you know how to start, you know, giving back, paying it forward, whatever the phrase it is you want to use? Because—Corey: I like, “Send the elevator back down.”Nate: Send the elevator back down? There you go, right? If that escalator stopped, turn it back on.Corey: Yeah, escalator; temporarily, stairs.Nate: Yes. You know, there are tons of ways up. But you know, if you can help someone, just go ahead and do it. You’d be surprised what a little bit of kindness can do.Corey: Well, let’s tie this back to your day job for a bit, on some level. You’re working on, effectively, developer tools. Who’s the developer?Nate: Who’s the developer? So, there’s a general sense in the industry that anyone who works in IT or anyone who writes code is a developer. Sometimes there’s the very blanket statement out there. I tend to take the view that a developer is the person who writes the code. That is a developer, that’s [unintelligible 00:21:52] their job title. That’s the thing that they do.The folks who assist developers, the folks who keep the servers up and running, they’re going to have a lot of different names. They’re DevOps admins, they’re platform admins, they’re server admins. Whatever they are, rarely would I call them developers, necessarily. So, I get it. We try to make blanket statement, we try to talk to large groups at a time, but you wouldn’t go into your local county hospital and say that, “I want to talk to the dentist,” when you really mean, like, a heart surgeon.So, let’s not do that, you know? We’re known for our level of specificity when we discuss things in this field, so let’s try to be a little more specific when we talk about the folks who do what they do. Because I came up on that ops track and I know the type of effort that I put in, and I looked at folks across from me and I know the kind of hours that they put in, I know all of the blood sweat and tears and nightless sleeps and answering the pagers at four in the morning. So, let’s just call them what they are, [laugh] right? And it’s not to say that calling them a developer is an insult in any way, but it’s not a flex either.Corey: You do work at a large cloud company, so I have to assume that this is a revelation for you, but did you know that words actually mean things? I know, it’s true. You wouldn’t know it from a lot of the product names that wind up getting scattered throughout the world. The trophy for the worst one ever though, is Azure DevOps because someone I was talking to as a hiring manager once thought that they listed that is a thing they did on their resume and was about to can the resume. It’s, “Wow, when your product name is so bad that it impacts other people’s careers, that’s kind of impressively awful.”But I have found that back when the DevOps movement was getting started, I felt a little offput because I was an operations person; I was a systems administrator. And suddenly, people were asking me about being a developer and what it’s like. And honestly, on some level, I felt like an imposter, just because I write configuration files; I don’t write code. That’s very different. Code is something smart people write and I’m bad at doing that stuff.And in the fullness of time, I’m still bad at it, but at least now unenthusiastically bad at it. And, on some level, brute force also becomes a viable path forward. But it felt like it was gatekeeping, on some level, and I’ve always felt like the terms people use to describe what I did weren’t aimed at me. I just was sort of against the edge.Nate: Yeah. And it’s a weird thing that happens around here, how we get to these points, or… or somehow there’s an article that gets written and then all of a sudden, everyone’s life is changed in an industry. You go from your job being, “Hey, can you rack and stack the server?” To, “Hey, I need you to write this YAML code that’s going to virtually instantiate a server and also connect it to a load balancer, and we need these done globally.” It’s a really weird transition that happens in life.But like you said, that’s part of our job: it morphs, it changes, it grows. And that’s the fun of it. We hope that these changes are actually for the better and then they’re going to make us more productive and they’re going to make our businesses thrive and do things that they couldn’t be before, like maybe be more resilient. You know, you look at the number of customers—customers; I think of them as customers—who had issues because of that horrible day in 9/11 and, you know, their business goes down the tube because there wasn’t an adequate DR or COOP strategy, you know? And I know, I’m going way back in the wayback, but it’s real. And I knew people who were affected by it.Corey: It is. And the tide is rising. This gets back to what we were talking about where the things that got you here won’t necessarily get you there. And Cloud is a huge part of that. These days, I don’t need to think about load balancers, in many cases, or all of the other infrastructure pieces because Google Cloud—among other companies, as well, lots of them—have moved significantly up the stack.I mean, people are excited about Kubernetes in a whole bunch of ways, but what an awful lot of enterprises are super excited about is suddenly, a hard drive failure doesn’t mean their application goes down.Nate: [Isn’t that 00:26:24] kind of awesome?Corey: Like, that’s a transformative moment for them.Nate: It totally is. You know, I get here and I look at the things that people are doing and I kind of go, “Wow,” right? I’m in awe. And to be able to contribute to that in some way by saying, “Hey, you know what, we’ll be cool? How about we try this feature?” Is really weird, [laugh] right?It’s like, “Wow, they listened to me.” But we think about what it is we’re trying to do and a lot of it, strangely enough, is not just helping people, but helping people by getting out of the way. And that is huge, right? You know, because you just want it to work, but more than it just working, you want it to be seamless. What’s easier than putting your key in the ignition and turning it? Well, not having to use a key at all.So, what are those types of changes that we can bring to these different types of experiences that folks have? If you want to get your application onto a Kubernetes cluster, it shouldn’t be some Herculean feat.Corey: And running that application responsibly should not require a team of people, each making a quarter million bucks a year, just to be able to do it safely and responsibly. There’s going to be a collapsing down of what you have to know in order to run these things. I mean, web servers used to be something that required a month of your life and a fair bit of attention to run. Now, it’s a checkbox in a cloud console.Nate: Yeah. And that’s what we’re trying to get it to, right? Why isn’t everything a checkbox? Why can’t you say, “Look, I wrote my app. I did the hard part.” Let’s—you know, I just need to see it go somewhere. You know? Make it go and make it stay up. And how can I do that?And also, here’s a feature that we’re working on. Came out recently and we want folks to try it. It’s a cloud deploy feature that works for Cloud Run as well as it does for GKE. And it's… I know it’s going to sound super simple: it’s our canary deployment method. But it’s not just canary deployment, but also we can tie it into parallel deployment.And so, you can have your new version of your app stood up alongside your old version of the app and we can roll it out incrementally in parallel around the world and you can have an actual test that says, “Hey, is this working? Is it not working?” If it does, great, let’s go forward. If it doesn’t, let’s roll back. And some of the stuff sounds like common sense, but it’s been difficult to pull off.And now we’re trying to do it with just a few lines a YAML. So, you know, is it as simple as it could be? Well, we’re still looking at that. But the features are in there and we’re constantly looking at what we can do to iterate and figure out what the next thing is.Corey: I really want to thank you for taking the time to speak with me. If people want to learn more, where’s the best place for them to find you?Nate: Best place for them to find me used to be my blog, it’s Not Your Dad’s IT, However, I’ve been pretty negligent there since doing this whole Google thing, so I would say, just look me up on Twitter at @nathaniel_avery, look me up on Google. You can go to a pretty cool search engine and [laugh]—Corey: Oh, that’s right. You guys have a search engine now. Good work.Nate: That’s what I hear [laugh].Corey: Someday maybe it’ll even come to Google Docs.Nate: [laugh]. Yes, so yeah, that’s where to find me. You know, just look me up at Nathaniel Avery. I think that handle works for almost everything, Twitter, LinkedIn, wherever, and reach out.If there’s something you like about our DevOps tools, let me know. If there’s something you hate about our DevOps tools, definitely let me know. Because the only reason we’re doing this is to try and help people. And if we’re not doing that, then we need to know. We need to know why it isn’t working out.And trust me, I talk to these engineers every day. That’s the thing that really keeps them moving in the morning is knowing that they’re doing something to make things better for folks. Real quick, I’ll close out, and I think I may have mentioned this on some other podcasts. I come from the ops world. I was that guy who had to help get a deployment out on a Friday night and it lasted all weekend long and you’re staring there at your phone at some absurd time on a Sunday night and everyone’s huddled together and you’re trying to figure out, are we going to rollback or are we going to go forward? What are we going to do by Monday?Corey: I don’t miss those days.Nate: Oh, oh God no. I don’t miss those days either. But you know what I do want? I took this job because I don’t want anyone else to have those days. That’s really what it is. We want to make sure that these tools give folks the ability to deploy safely and to deploy with confidence and to take that level of risk out of the equation, so that folks can, you know, just get back to doing other things. You know, spend that time with your family, spend the time reading, spend that time prompting ChatGPT with questions, [laugh] whatever it is you want to do, but you shouldn’t have to sit there and wonder, “Oh, my God, is my app working? And what do I do when it doesn’t?”Corey: I really want to thank you for being as generous with your time and philosophy on this. Thanks again. I’ve really enjoyed our conversation.Nate: Thank you. Thank you. I’ve been a big fan of your work for years.Corey: [laugh]. Nate Avery, outbound product manager at Google Cloud. I’m Cloud Economist Corey Quinn and this is Screaming in the Cloud. Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.
    5/11/2023
    33:15
  Cutting Costs in Cloud with Everett Berry
    Everett Berry, Growth and Open Source at Vantage, joins Corey at Screaming in the Cloud to discuss the complex world of cloud costs. Everett describes how Vantage takes a broad approach to understanding and cutting cloud costs across a number of different providers, and reveals which providers he feels generate large costs quickly. Everett also explains some of his best practices for cutting costs on cloud providers, and explores what he feels the impact of AI will be on cloud providers. Corey and Everett also discuss the pros and cons of AWS savings plans, why AWS can’t be counted out when it comes to AI, and why there seems to be such a delay in upgrading instances despite the cost savings. About EverettEverett is the maintainer of ec2instances.info at Vantage. He also writes about cloud infrastructure and analyzes cloud spend. Prior to Vantage Everett was a developer advocate at Arctype, a collaborative SQL client acquired by ClickHouse. Before that, Everett was cofounder and CTO of Perceive, a computer vision company. In his spare time he enjoys playing golf, reading sci-fi, and scrolling Twitter.Links Referenced: Vantage: https://www.vantage.sh/ Vantage Cloud Cost Report: https://www.vantage.sh/cloud-cost-report Everett Berry Twitter: https://twitter.com/retttx Vantage Twitter: https://twitter.com/JoinVantage TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey:  LANs of the late 90’s and early 2000’s were a magical place to learn about computers, hang out with your friends, and do cool stuff like share files, run websites & game servers, and occasionally bring the whole thing down with some ill-conceived software or network configuration. That’s not how things are done anymore, but what if we could have a 90’s style LAN experience along with the best parts of the 21st century internet? (Most of which are very hard to find these days.) Tailscale thinks we can, and I’m inclined to agree. With Tailscale I can use trusted identity providers like Google, or Okta, or GitHub to authenticate users, and automatically generate & rotate keys to authenticate devices I've added to my network. I can also share access to those devices with friends and teammates, or tag devices to give my team broader access. And that’s the magic of it, your data is protected by the simple yet powerful social dynamics of small groups that you trust.Try now - it's free forever for personal use. I’ve been using it for almost two years personally, and am moderately annoyed that they haven’t attempted to charge me for what’s become an essential-to-my-workflow service.Corey: Have you listened to the new season of Traceroute yet? Traceroute is a tech podcast that peels back the layers of the stack to tell the real, human stories about how the inner workings of our digital world affect our lives in ways you may have never thought of before. Listen and follow Traceroute on your favorite platform, or learn more about Traceroute at origins.dev. My thanks to them for sponsoring this ridiculous podcast. Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. This seems like an opportune moment to take a step back and look at the overall trend in cloud—specifically AWS—spending. And who better to do that than this week, my guest is Everett Berry who is growth in open-source over at Vantage. And they’ve just released the Vantage Cloud Cost Report for Q1 of 2023. Everett, thank you for joining me.Everett: Thanks for having me, Corey.Corey: I enjoy playing slap and tickle with AWS bills because I am broken in exactly that kind of way where this is the thing I’m going to do with my time and energy and career. It’s rare to find people who are, I guess, similarly afflicted. So, it’s great to wind up talking to you, first off.Everett: Yeah, great to be with you as well. Last Week in AWS and in particular, your Twitter account, are things that we follow religiously at Vantage.Corey: Uh-oh [laugh]. So, I want to be clear because I’m sure someone’s thinking it out there, that, wait, Vantage does cloud cost optimization as a service? Isn’t that what I do? Aren’t we competitors? And the answer that I have to that is not by any definition that I’ve ever seen that was even halfway sensible.If SaaS could do the kind of bespoke consulting engagements that I do, we would not sell bespoke consulting engagements because it’s easier to click button: receive software. And I also will point out that we tend to work once customers are at a certain point at scale that in many cases is a bit prohibitive for folks who are just now trying to understand what the heck’s going on the first time finance has some very pointed questions about the AWS bill. That’s how I see it from my perspective, anyway. Agree? Disagree?Everett: Yeah, I agree with that. I think the product solution, the system of record that companies need when they’re dealing with Cloud costs ends up being a different service than the one that you guys provide. And I think actually the to work in concert very well, where you establish a cloud cost optimization practice, and then you keep it in place via software and via sort of the various reporting tools that the Vantage provide. So, I completely agree with you. In fact, in the hundreds of customers and deals that Vantage has worked on, I don’t think we have ever come up against Duckbill Group. So, that tells you everything you need to know in that regard.Corey: Yeah. And what’s interesting about this is that you have a different scale of visibility into the environment. We wind up dealing with a certain profile, or a couple of profiles, in our customer base. We work with dozens of companies a year; you work with hundreds. And that’s bigger numbers, of course, but also in many cases at different segments of the industry.I also am somewhat fond of saying that Vantage is more focused on going broad in ways where we tend to focus on going exclusively deep. We do AWS; the end. You folks do a number of different cloud providers, you do Datadog cost visibility. I’ve lost track of all the different services that you wind up tracking costs for.Everett: Yeah, that’s right. We just launched our 11th provider, which was OpenAI and for the first time in this report, we’re actually breaking out data among the different clouds and we’re comparing services across AWS, Google, and Azure. And I think it’s a bit of a milestone for us because we started on AWS, where I think the cost problem is the most acute, if you will, and we’ve hit a point now across Azure and Google where we actually have enough data to say some interesting things about how those clouds work. But in general, we have this term, single pane of glass, which is the idea that you use 5, 6, 7 services, and you want to bundle all those costs into one report.Corey: Yeah. And that is something that we see in many cases where customers are taking a more holistic look at things. But, on some level, when people ask me, “Oh, do you focus on Google bills, too,” or Azure bills in the early days, it was, “Well, not yet. Let’s take a look.” And what I was seeing was, they’re spending, you know, millions or hundreds of millions, in some cases, on AWS, and oh, yeah, here’s, like, a $300,000 thing we’re running over on GCP is a proof-of-concept or some bizdev thing. And it’s… yeah, why don’t we focus on the big numbers first? The true secret of cloud economics is, you know, big numbers first rather than alphabetical, but don’t tell anyone I told you that.Everett: It’s pretty interesting you say that because, you know, in this graph where we break down costs across providers, you can really see that effect on Google and Azure. So, for example, the number three spending category on Google is BigQuery and I think many people would say BigQuery is kind of the jewel of the Google Cloud empire. Similarly for Azure, we actually found Databricks showing up as a top-ten service. Compare that to AWS where you just see a very routine, you know, compute, database, storage, monitoring, bandwidth, down the line. AWS still is the king of costs, if you will, in terms of, like, just running classic compute workloads. And the other services are a little bit more bespoke, which has been something interesting to see play out in our data.Corey: One thing that I’ve heard that’s fascinating to me is that I’ve now heard from multiple Fortune 500 companies where the Datadog bill is now a board-level concern, given the size and scale of it. And for fun, once I modeled out all the instance-based pricing models that they have for the suite of services they offer, and at the time was three or $400 a month, per instance to run everything that they’ve got, which, you know, when you look at the instances that I have, costing, you know, 15, 20 bucks a month, in some cases, hmm, seems a little out of whack. And I can absolutely see that turning into an unbounded growth problem in kind of the same way. I just… I don’t need to conquer the world. I’m not VC-backed. I am perfectly content at the scale that I’m at—Everett: [laugh].Corey: —with the focus on the problems that I’m focused on.Everett: Yeah, Datadog has been fascinating. It’s been one of our fastest-growing providers of sort of the ‘others’ category that we’ve launched. And I think the thing with Datadog that is interesting is you have this phrase cloud costs are all about cloud architecture and I think that’s more true on Datadog than a lot of other services because if you have a model where you have, you know, thousands of hosts, and then you add-on one of Datadogs 20 services, which charges per host, suddenly your cloud bill has grown exponentially compared to probably the thing that you were after. And a similar thing happens—actually, my favorite Datadog cost recommendation is, when you have multiple endpoints, and you have sort of multiple query parameters for those endpoints, you end up in this cardinality situation where suddenly Datadog is tracking, again, like, exponentially increasing number of data points, which it’s then charging to you on a usage-based model. And so, Datadog is great partners with AWS and I think it’s no surprise because the two of them actually sort of go hand-in-hand in terms of the way that they… I don’t want to say take ad—Corey: Extract revenue?Everett: Yeah, extract revenue. That’s a good term. And, you know, you might say a similar thing about Snowflake, possibly, and the way that they do things. Like oh, the, you know, warehouse has to be on for one minute, minimum, no matter how long the query runs, and various architectural decisions that these folks make that if you were building a cost-optimized version of the service, you would probably go in the other direction.Corey: One thing that I’m also seeing, too, is that I can look at the AWS bill—and just billing data alone—and then say, “Okay, you’re using Datadog, aren’t you?” Like, “How did you know that?” Like, well, first, most people are secondly, CloudWatch is your number two largest service spend right now. And it’s the downstream effect of hammering all the endpoints with all of the systems. And is that data you’re actually using? Probably not, in some cases. It’s, everyone turns on all the Datadog integrations the first time and then goes back and resets and never does it again.Everett: Yeah, I think we have this set of advice that we give Datadog folks and a lot of it is just, like, turn down the ingestion volume on your logs. Most likely, logs from 30 days ago that are correlated with some new services that you spun up—like you just talked about—are potentially not relevant anymore, for the kind of day-to-day cadence that you want to get into with your cloud spending. So yeah, I mean, I imagine when you’re talking to customers, they’re bringing up sort of like this interesting distinction where you may end up in a meeting room with the actual engineering team looking at the actual YAML configuration of the Datadog script, just to get a sense of like, well, what are the buttons I can press here? And so, that’s… yeah, I mean, that’s one reason cloud costs are a pretty interesting world is, on the surface level, you may end up buying some RIs or savings plans, but then when you really get into saving money, you end up actually changing the knobs on the services that you’re talking about.Corey: That’s always a fun thing when we talk to people in our sales process. It’s been sord—“Are you just going to come in and tell us to buy savings plans or reserved instances?” Because the answer to that used to be, “No, that’s ridiculous. That’s not what we do.” But then we get into environments and find they haven’t bought any of those things in 18 months.Everett: [laugh].Corey: —and it’s well… okay, that’s step two. Step one is what are you using you shouldn’t be? Like, basically measure first then cut as opposed to going the other direction and then having to back your way into stuff. Doesn’t go well.Everett: Yeah. One of the things that you were discussing last year that I thought was pretty interesting was the gp3 volumes that are now available for RDS and how those volumes, while they offer a nice discount and a nice bump in price-to-performance on EC2, actually don’t offer any of that on RDS except for specific workloads. And so, I think that’s the kind of thing where, as you’re working with folks, as Vantage is working with people, the discussion ends up in these sort of nuanced niche areas, and that’s why I think, like, these reports, hopefully, are helping people get a sense of, like, well, what’s normal in my architecture or where am I sort of out of bounds? Oh, the fact that I’m spending most of my bill on NAT gateways and bandwidth egress? Well, that’s not normal. That would be something that would be not typical of what your normal AWS user is doing.Corey: Right. There’s always a question of, “Am I normal?” is one of the first things people love to ask. And it comes in different forms. But it’s benchmarking. It’s, okay, how much should it cost us to service a thousand monthly active users? It’s like, there’s no good way to say that across the board for everyone.Everett: Yeah. I like the model of getting into the actual unit costs. I have this sort of vision in my head of, you know, if I’m Uber and I’m reporting metrics to the public stock market, I’m actually reporting a cost to serve a rider, a cost to deliver an Uber Eats meal, in terms of my cloud spend. And that sort of data is just ridiculously hard to get to today. I think it’s what we’re working towards with Vantage and I think it’s something that with these Cloud Cost Reports, we’re hoping to get into over time, where we’re actually helping companies think about well, okay, within my cloud spend, it’s not just what I’m spending on these different services, there’s also an idea of how much of my cost to deliver my service should be realized by my cloud spending.Corey: And then people have the uncomfortable realization that wait, my bill is less a function of number of customers I have but more the number of engineers I’ve hired. What’s going on with that?Everett: [laugh]. Yeah, it is interesting to me just how many people end up being involved in this problem at the company. But to your earlier point, the cloud spending discussion has really ramped up over the past year. And I think, hopefully, we are going to be able to converge on a place where we are realizing the promise of the cloud, if you will, which is that it’s actually cheaper. And I think what these reports show so far is, like, we’ve still got a long ways to go for that.Corey: One thing that I think is opportune about the timing of this recording is that as of last week, Amazon wound up announcing their earnings. And Andy Jassy has started getting on the earnings calls, which is how you know it’s bad because the CEO of Amazon never deigned to show up on those things before. And he said that a lot of AWS employees are focused and spending their time on helping customers lower their AWS bills. And I’m listening to this going, “Oh, they must be talking to different customers than the ones that I’m talking to.” Are you seeing a lot of Amazonian involvement in reducing AWS bills? Because I’m not and I’m wondering where these people are hiding.Everett: So, we do see one thing, which is reps pushing savings plans on customers, which in general, is great. It’s kind of good for everybody, it locks people into longer-term spend on Amazon, it gets them a lower rate, savings plans have some interesting functionality where they can be automatically applied to the area where they offer the most discount. And so, those things are all positive. I will say with Vantage, we're a cloud cost optimization company, of course, and so when folks talk to us, they often already have talked to their AWS rep. And the classic scenario is, that the rep passes over a large spreadsheet of options and ways to reduce costs, but for the company, that spreadsheet may end up being quite a ways away from the point where they actually realize cost savings.And ultimately, the people that are working on cloud cost optimization for Amazon are account reps who are comped by how much cloud spending their accounts are using on Amazon. And so, at the end of the day, some of the, I would say, most hard-hitting optimizations that you work on that we work on, end up hitting areas where they do actually reduce the bill which ends up being not in the account manager’s favor. And so, it’s a real chicken-and-egg game, except for savings plans is one area where I think everybody can kind of work together.Corey: I have found that… in fairness, there is some defense for Amazon in this but their cost-cutting approach has been rightsizing instances, buy some savings plans, and we are completely out of ideas. Wait, can you switch to Graviton and/or move to serverless? And I used to make fun of them for this but honestly that is some of the only advice that works across the board, irrespective in most cases, of what a customer is doing. Everything else is nuanced and it depends.That’s why in some cases, I find that I’m advising customers to spend more money on certain things. Like, the reason that I don’t charge percentage of savings in part is because otherwise I’m incentivized to say things like, “Backups? What are you, some kind of coward? Get rid of them.” And that doesn’t seem like it’s going to be in the customer’s interest every time. And as soon as you start down that path, it starts getting a little weird.But people have asked me, what if my customers reach out to their account teams instead of talking to us? And it’s, we do bespoke consulting engagements; I do not believe that we have ever had a client who did not first reach out to their account team. If the account teams were capable of doing this at the level that worked for customers, I would have to be doing something else with my business. It is not something that we are seeing hit customers in a way that is effective, and certainly not at scale. You said—as you were right on this—that there’s an element here of account managers doing this stuff, there’s an [unintelligible 00:15:54] incentive issue in part, but it’s also, quality is extraordinarily uneven when it comes to these things because it is its own niche and a lot of people focus in different areas in different ways.Everett: Yeah. And to the areas that you brought up in terms of general advice that’s given, we actually have some data on this in this report. In particular Graviton, this is something we’ve been tracking the whole time we’ve been doing these reports, which is the past three quarters and we actually are seeing Graviton adoption start to increase more rapidly than it was before. And so, for this last quarter Q1, we’re seeing 5% of our costs that we’re measuring on EC2 coming from Graviton, which is up from, I want to say 2% the previous quarter, and, like, less than 1% the quarter before. The previous quarter, we also reported that Lambda costs are now majority on ARM among the Vantage customer base.And that one makes some sense to me just because in most cases with Lambda, it’s a flip of a switch. And then to your archival point on backups, this is something that we report in this one is that intelligent tiering, which we saw, like, really make an impact for folks towards the end of last year, the numbers for that were flat quarter over quarter. And so, what I mean by that is, we reported that I think, like, two-thirds of our S3 costs are still in the standard storage tier, which is the most expensive tier. And folks have enabled S3 intelligent tiering, which moves your data to progressively cheaper tiers, but we haven’t seen that increase this quarter. So, it’s the same number as it was last quarter.And I think speaks to what you’re talking about with a ceiling on some cost optimization techniques, where it’s like, you’re not just going to get rid of all your backups; you’re not just going to get rid of your, you know, Amazon WorkSpaces archived desktop snapshots that you need for some HIPAA compliance reason. Those things have an upper limit and so that’s where, when the AWS rep comes in, it’s like, as they go through the list of top spending categories, the recommendations they can give start to provide diminishing returns.Corey: I also think this is sort of a law of large numbers issue. When you start seeing a drop off in the growth rate of large cloud providers, like, there’s a problem, in that there are only so many exabyte scale workloads that can be moved inside of a given quarter into the cloud. You’re not going to see the same unbounded infinite growth that you would expect mathematically. And people lose their minds when they start to see those things pointed out, but the blame that oh, that’s caused by cost optimization efforts, with respect, bullshit it is. I have seen customers devote significant efforts to reducing their AWS bills and it takes massive amounts of work and even then they don’t always succeed in getting there.It gets better, but they still wind up a year later, having spent more on a month-by-month basis than they did when they started. Sure they understand it better and it’s organic growth that’s driving it and they’ve solved the low hanging fruit problem, but there is a challenge in acting as a boundary for what is, in effect, an unbounded growth problem.Everett: Yeah. And speaking to growth, I thought Microsoft had the most interesting take on where things could happen next quarter, and that, of course, is AI. And so, they attributed, I think it was, 1% of their guidance regarding 26 or 27% growth for Q2 Cloud revenue and it attributed 1% of that to AI. And I think Amazon is really trying to be in the room for those discussions when a large enterprise is talking about AI workloads because it’s one of the few remaining cloud workloads that if it’s not in the cloud already, is generating potentially massive amounts of growth for these guys.And so, I’m not really sure if I believe the 1% number. I think Microsoft may be having some fun with the fact that, of course, OpenAI is paying them for acting as a cloud provider for ChatGPT and further API, but I do think that AWS, although they were maybe a little slow to the game, they did, to their credit, launch a number of AI services that I’m excited to see if that contributes to the cost that we’re measuring next quarter. We did measure, for the first time, a sudden increase on those new [Inf1 00:20:17] EC2 instances, which are optimized for machine learning. And I think if AWS can have success moving customers to those the way they have with Graviton, then that’s going to be a very healthy area of growth for them.Corey: I’ll also say that it’s pretty clear to me that Amazon does not know what it’s doing in its world of machine-learning-powered services. I use Azure for the [unintelligible 00:20:44] clients I built originally for Twitter, then for Mastodon—I’m sure Bluesky is coming—but the problem that I’m seeing there is across the board, start to finish, that there is no cohesive story from the AWS side of here’s a picture tell me what’s in it and if it’s words, describe it to me. That’s a single API call when we go to Azure. And the more that Amazon talks about something, I find, the less effective they’re being in that space. And they will not stop talking about machine learning. Yes, they have instances that are powered by GPUs; that’s awesome. But they’re an infrastructure provider and moving up the stack is not in their DNA. But that’s where all the interest and excitement and discussion is going to be increasingly in the AI space. Good luck.Everett: I think it might be something similar to what you’ve talked about before with all the options to run containers on AWS. I think they today have a bit of a grab bag of services and they may actually be looking forward to the fact that they’re these truly foundational models which let you do a number of tasks, and so they may not need to rely so much on you know, Amazon Polly and Amazon Rekognition and sort of these task-specific services, which to date, I’m not really sure of the takeoff rates on those. We have this cloud costs leaderboard and I don’t think you would find them in the top 50 of AWS services. But we’ll see what happens with that.AWS I think, ends up being surprisingly good at sticking with it. I think our view is that they probably have the most customer spend on Kubernetes of any major cloud, even though you might say Google at first had the lead on Kubernetes and maybe should have done more with GKE. But to date, I would kind of agree with your take on AI services and I think Azure is… it’s Azure’s to lose for the moment.Corey: I would agree. I think the future of the cloud is largely Azure’s to lose and it has been for a while, just because they get user experience, they get how to talk to enterprises. I just… I wish they would get security a little bit more effectively, and if failing that, communicating with their customers about security more effectively. But it’s hard for a leopard to change its spots. Microsoft though has demonstrated an ability to change their nature multiple times, in ways that I would have bet were impossible. So, I just want to see them do it again. It’s about time.Everett: Yeah, it’s been interesting building on Azure for the past year or so. I wrote a post recently about, kind of, accessing billing data across the different providers and it’s interesting in that every cloud provider is unique in the way that it simply provides an external endpoint for downloading your billing data, but Azure is probably one of the easiest integrations; it’s just a REST API. However, behind that REST API are, like, years and years of different ways to pay Microsoft: are you on a pay-as-you-go plan, are you on an Azure enterprise plan? So, there’s all this sort of organizational complexity hidden behind Azure and I think sometimes it rears its ugly head in a way that stringing together services on Amazon may not, even if that’s still a bear in and of itself, if you will.Corey: Any other surprises that you found in the Cloud Cost Report? I mean, looking through it, it seems directionally aligned with what I see in my environments with customers. Like for example, you’re not going to see Kubernetes showing up as a line item on any of these things just because—Everett: Yeah.Corey: That is indistinguishable from a billing perspective when we’re looking at EC2 spend versus control plane spend. I don’t tend to [find 00:24:04] too much that’s shocking me. My numbers are of course, different percentage-wise, but surprise, surprise, different companies doing different things doing different percentages, I’m sure only AWS knows for sure.Everett: Yeah, I think the biggest surprise was just the—and, this could very well just be kind of measurement method, but I really expected to see AI services driving more costs, whether it was GPU instances, or AI-specific services—which we actually didn’t report on at all, just because they weren’t material—or just any indication that AI was a real driver of cloud spending. But I think what you see instead is sort of the same old folks at the top, and if you look at the breakdown of services across providers, that’s, you know, compute, database, storage, bandwidth, monitoring. And if you look at our percentage of AI costs as a percentage of EC2 costs, it’s relatively flat, quarter over quarter. So, I would have thought that would have shown up in some way in our data and we really didn’t see it.Corey: It feels like there’s a law of large numbers things. Everyone’s talking about it. It’s very hype right now—Everett: Yeah.Corey: But it’s also—you talk to these companies, like, “Okay, we have four exabytes of data that we’re storing and we have a couple 100,000 instances at any given point in time, so yeah, we’re going to start spending $100,000 a month on our AI adventures and experiments.” It’s like, that’s just noise and froth in the bill, comparatively.Everett: Exactly, yeah. And so, that’s why I think Microsoft’s thought about AI driving a lot of growth in the coming quarters is, we’ll see how that plays out, basically. The one other thing I would point to is—and this is probably not surprising, maybe, for you having been in the infrastructure world and seeing a lot of this, but for me, just seeing the length of time it takes companies to upgrade their instance cycles. We’re clocking in at almost three years since the C6 series instances have been released and for just now seeing C6 and R6 start to edge above 10% of our compute usage. I actually wonder if that’s just the stranglehold that Intel has on cloud computing workloads because it was only last year around re:Invent that the C6in and the Intel version of the C6 series instances had been released. So, I do think in general, there’s supposed to be a price-to-performance benefit of upgrading your instances, and so sometimes it surprises me to see how long it takes companies to get around to doing that.Corey: Generation 6 to 7 is also 6% more expensive in my sampling.Everett: Right. That’s right. I think Amazon has some work to do to actually make that price-to-performance argument, sort of the way that we were discussing with gp2 versus gp3 volumes. But yeah, I mean, other than that, I think, in general, my view is that we’re past the worst of it, if you will, for cloud spending. Q4 was sort of a real letdown, I think, in terms of the data we had and the earnings that these cloud providers had and I think Q1 is actually everyone looking forward to perhaps what we call out at the beginning of the report, which is a return to normal spend patterns across the cloud.Corey: I think that it’s going to be an interesting case. One thing that I’m seeing that might very well explain some of the reluctance to upgrade EC2 instances has been that a lot of those EC2 instances are databases. And once those things are up and running and working, people are hesitant to do too much with them. One of the [unintelligible 00:27:29] roads that I’ve seen of their savings plan approach is that you can migrate EC2 spend to Fargate to Lambda—and that’s great—but not RDS. You’re effectively leaving a giant pile of money on the table if you’ve made a three-year purchase commitment on these things. So, all right, we’re not going to be in any rush to migrate to those things, which I think is AWS getting in its own way.Everett: That’s exactly right. When we encounter customers that have a large amount of database spend, the most cost-effective option is almost always basically bare-metal EC2 even with the overhead of managing the backup-restore scalability of those things. So, in some ways, that’s a good thing because it means that you can then take advantage of the, kind of, heavy committed use options on EC2, but of course, in other ways, it’s a bit of a letdown because, in the ideal case, RDS would scale with the level of workloads and the economics would make more sense, but it seems that is really not the case.Corey: I really want to thank you for taking the time to come on the show and talk to me. I’ll include a link in the [show notes 00:28:37] to the Cost Report. One thing I appreciate is the fact that it doesn’t have one of those gates in front of it of, your email address, and what country you’re in, and how can our salespeople best bother you. It’s just, here’s a link to the PDF. The end. So, thanks for that; it’s appreciated. Where else can people go to find you?Everett: So, I’m on Twitter talking about cloud infrastructure and AI. I’m [email protected], that’s R-E-T-T-T-X. And then of course, Vantage also did quick hot-takes on this report with a series of graphs and explainers in a Twitter thread and that’s @JoinVantage.Corey: And we will, of course, put links to that in the [show notes 00:29:15]. Thank you so much for your time. I appreciate it.Everett: Thanks, Corey. Great to chat.Corey: Everett Berry, growth in open-source at Vantage. I’m Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry, insulting comment that will increase its vitriol generation over generation, by approximately 6%.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.
    5/9/2023
    31:59
  Operating in the Kubernetes Cloud on Amazon EKS with Eswar Bala
    Eswar Bala, Director of Amazon EKS at AWS, joins Corey on Screaming in the Cloud to discuss how and why AWS built a Kubernetes solution, and what customers are looking for out of Amazon EKS. Eswar reveals the concerns he sees from customers about the cost of Kubernetes, as well as the reasons customers adopt EKS over ECS. Eswar gives his reasoning on why he feels Kubernetes is here to stay and not just hype, as well as how AWS is working to reduce the complexity of Kubernetes. Corey and Eswar also explore the competitive landscape of Amazon EKS, and the new product offering from Amazon called Karpenter.About EswarEswar Bala is a Director of Engineering at Amazon and is responsible for Engineering, Operations, and Product strategy for Amazon Elastic Kubernetes Service (EKS). Eswar leads the Amazon EKS and EKS Anywhere teams that build, operate, and contribute to the services customers and partners use to deploy and operate Kubernetes and Kubernetes applications securely and at scale. With a 20+ year career in software , spanning multimedia, networking and container domains, he has built greenfield teams and launched new products multiple times.Links Referenced: Amazon EKS: https://aws.amazon.com/eks/ kubernetesthemuchharderway.com: https://kubernetesthemuchharderway.com kubernetestheeasyway.com: https://kubernetestheeasyway.com EKS documentation: https://docs.aws.amazon.com/eks/ EKS newsletter: https://eks.news/ EKS GitHub: https://github.com/aws/eks-distro TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: It’s easy to **BEEP** up on AWS. Especially when you’re managing your cloud environment on your own!Mission Cloud un **BEEP**s your apps and servers. Whatever you need in AWS, we can do it. Head to missioncloud.com for the AWS expertise you need. Corey: Welcome to Screaming in the Cloud, I’m Corey Quinn. Today’s promoted guest episode is brought to us by our friends at Amazon. Now, Amazon is many things: they sell underpants, they sell books, they sell books about underpants, and underpants featuring pictures of books, but they also have a minor cloud computing problem. In fact, some people would call them a cloud computing company with a gift shop that’s attached. Now, the problem with wanting to work at a cloud company is that their interviews are super challenging to pass.If you want to work there, but can’t pass the technical interview for a long time, the way to solve that has been, “Ah, we’re going to run Kubernetes so we get to LARP as if we worked at a cloud company but don’t.” Eswar Bala is the Director of Engineering for Amazon EKS and is going to basically suffer my slings and arrows about one of the most complicated, and I would say overwrought, best practices that we’re seeing industry-wide. Eswar, thank you for agreeing to subject yourself to this nonsense.Eswar: Hey, Corey, thanks for having me here.Corey: [laugh]. So, I’m a little bit unfair to Kubernetes because I wanted to make fun of it and ignore it. But then I started seeing it in every company that I deal with in one form or another. So yes, I can still sit here and shake my fist at the tide, but it’s turned into, “Old Man Yells at Cloud,” which I’m thrilled to embrace, but everyone’s using it. So, EKS is approaching the five-year mark since it was initially launched. What is EKS other than Amazon’s own flavor of Kubernetes?Eswar: You know, the best way I can define EKS is, EKS is just Kubernetes. Not Amazon’s version of Kubernetes. It’s just Kubernetes that we get from the community and offer it to customers to make it easier for them to consume. So, EKS. I’ve been with EKS from the very beginning when we thought about offering a managed Kubernetes service in 2017.And at that point, the goal was to bring Kubernetes to enterprise customers. So, we have many customers telling us that they want us to make their life easier by offering a managed version of Kubernetes that they’ve actually beginning to [erupt 00:02:42] at that time period, right? So, my goal was to figure it out, what does that service look like and which customer base should be targeting service towards.Corey: Kelsey Hightower has a fantastic learning tool out there in a GitHub repo called, “Kubernetes the Hard Way,” where he talks you through building the entire thing, start to finish. I wound up forking it and doing that on top of AWS, and you can find that at kubernetesthemuchharderway.com. And that was fun.And I went through the process and my response at the end was, “Why on earth would anyone ever do this more than once?” And we got that sorted out, but now it’s—customers aren’t really running these things from scratch. It’s like the Linux from Scratch project. Great learning tool; probably don’t run this in production in the same way that you might otherwise because there are better ways to solve for the problems that you will have to solve yourself when you’re building these things from scratch. So, as I look across the ecosystem, it feels like EKS stands in the place of the heavy, undifferentiated lifting of running the Kubernetes control plane so customers functionally don’t have to. Is that an effective summation of this?Eswar: That is precisely right. And I’m glad you mentioned, “Kubernetes the Hard Way,” I’m a big fan of that when it came out. And if anyone who did that tutorial, and also your tutorial, “Kubernetes the Harder Way,” would walk away thinking, “Why would I pick this technology when it’s super complicated to setup?” But then you see that customers love Kubernetes and you see that reflected in the adoption, even in 2016, 2017 timeframes.And the reason is, it made life easier for application developers in terms of offering web services that they wanted to offer to their customer base. And because of all the features that Kubernetes brought on, application lifecycle management, service discoveries, and then it evolved to support various application architectures, right, in terms of stateless services, stateful applications, and even daemon sets, right, like for running your logging and metrics agents. And these are powerful features, at the end of the day, and that’s what drove Kubernetes. And because it’s super hard to get going to begin with and then to operate, the day-two operator experience is super complicated.Corey: And the day one experience is super hard and the day two experience of, “Okay, now I’m running it and something isn’t working the way it used to. Where do I start,” has been just tremendously overwrought. And frankly, more than a little intimidating.Eswar: Exactly. Right? And that exactly was our opportunity when we started in 2017. And when we started, there was question on, okay, should we really build a service when you have an existing service like ECS in place? And by the way, like, I did work in ECS before I started working in EKS from the beginning.So, the answer then was, it was about giving what customers want. And their space for many container orchestration systems, right, ECS was the AWS service at that point in time. And our thinking was, how do we give customers what they wanted? They wanted a Kubernetes solution. Let’s go build that. But we built it in a way that we remove the undifferentiated heavy lifting of managing Kubernetes.Corey: One of the weird things that I find is that everyone’s using Kubernetes, but I don’t see it in the way that I contextualize the AWS universe, which of course, is on the bill. That’s right. If you don’t charge for something in AWS Lambda, and preferably a fair bit, I don’t tend to know it exists. Like, “What’s an IAM and what might that possibly do?” Always have reassuring thing to hear from someone who’s often called an expert in this space. But you know, if it doesn’t cost money, why do I pay attention to it?The control plane is what EKS charges for, unless you’re running a bunch of Fargate-managed pods and containers to wind up handling those things. So, it mostly just shows up as an addenda to the actual big, meaty portions of the belt. It just looks like a bunch of EC2 instances with some really weird behavior patterns, particularly with regard to auto-scaling and crosstalk between all of those various nodes. So, it’s a little bit of a murder mystery, figuring out, “So, what’s going on in this environment? Do you folks use containers at all?” And the entire Kubernetes shop is looking at me like, “Are you simple?”No, it’s just I tend to disregard the lies that customers say, mostly to themselves because everyone has this idea of what’s going on in their environment, but the bill speaks. It’s always been a little bit of an investigation to get to the bottom of anything that involves Kubernetes at significant points of scale.Eswar: Yeah, you’re right. Like if you look at EKS, right, like, we started with managing the control plane to begin with. And managing the control plane is a drop in the bucket when you actually look at the costs in terms of operating a Kubernetes cluster or running a Kubernetes cluster. When you look at how our customers use and where they spend most of their cost, it’s about where their applications run; it’s actually the Kubernetes data plane and the amount of compute and memory that the applications end of using end up driving 90% of the cost. And beyond that is the storage, beyond that as a networking costs, right, and then after that is the actual control plane costs. So, the problem right now is figuring out, how do we optimize our costs for the application to run on?Corey: On some level, it requires a little bit of understanding of what’s going on under the hood. There have been a number of cost optimization efforts that have been made in the Kubernetes space, but they tend to focus around stuff that I find relatively, well, I call it banal because it basically is. You’re looking at the idea of, okay, what size instances should you be running, and how well can you fill them and make sure that all the resources per node wind up being taken advantage of? But that’s also something that, I guess from my perspective, isn’t really the interesting architectural point of view. Whether or not you’re running a bunch of small instances or a few big ones or some combination of the two, that doesn’t really move the needle on any architectural shift, whereas ingesting a petabyte a month of data and passing 50 petabytes back and forth between availability zones, that’s where it starts to get really interesting as far as tracking that stuff down.But what I don’t see is a whole lot of energy or effort being put into that. And I mean, industry-wide, to be clear. I’m not attempting to call out Amazon specifically on this. That’s [laugh] not the direction I’m taking this in. For once. I know, I’m still me. But it seems to be just an industry-wide issue, where zone affinity for Kubernetes has been a very low priority item, even on project roadmaps on the Kubernetes project.Eswar: Yeah, the Kubernetes does provide ability for customers to restrict their workloads within as particular [unintelligible 00:09:20], right? Like, there is constraints that you can place on your pod specs that end up driving applications towards a particular AZ if they want, right? You’re right, it’s still left to the customers to configure. Just because there’s a configuration available doesn’t mean the customers use it. If it’s not defaulted, most of the time, it’s not picked up.That’s where it’s important for service providers—like EKS—to offer ability to not only provide the visibility by means of reporting that it’s available using tools like [Cue Cards 00:09:50] and Amazon Billing Explorer but also provide insights and recommendations on what customers can do. I agree that there’s a gap today. For example in EKS, in terms of that. Like, we’re slowly closing that gap and it’s something that we’re actively exploring. How do we provide insights across all the resources customers end up using from within a cluster? That includes not just compute and memory, but also storage and networking, right? And that’s where we are actually moving towards at this point.Corey: That’s part of the weird problem I’ve found is that, on some level, you get to play almost data center archaeologists when you start exploring what’s going on in these environments. I found one of the only reliable ways to get answers to some of this stuff has been oral tradition of, “Okay, this Kubernetes cluster just starts hurling massive data quantities at 3 a.m. every day. What’s causing that?” And it leads to, “Oh, no no, have you talked to the data science team,” like, “Oh, you have a data science team. A common AWS billing mistake.” And exploring down that particular path sometimes pays dividends. But there’s no holistic way to solve that globally. Today. I’m optimistic about tomorrow, though.Eswar: Correct. And that’s where we are spending our efforts right now. For example, we recently launched our partnership with Cue Cards, and Cue Cards is now available as an add-on from the Marketplace that you can easily install and provision on Kubernetes EKS clusters, for example. And that is a start. And Cue Cards is amazing in terms of features, in terms of insight it offers, right, it looking into computer, the memory, and the optimizations and insights it provides you.And we are also working with the AWS Cost and Usage Reporting team to provide a native AWS solution for the cost reporting and the insights aspect as well in EKS. And it’s something that we are going to be working really closely to solve the networking gaps in the near future.Corey: What are you seeing as far as customer concerns go, with regard to cost and Kubernetes? I see some things, but let’s be very clear here, I have a certain subset of the market that I spend an inordinate amount of time speaking to and I always worry that what I’m seeing is not holistically what’s going on in the broader market. What are you seeing customers concerned about?Eswar: Well, let’s start from the fundamentals here, right? Customers really want to get to market faster, whatever services and applications that they want to offer. And they want to have it cheaper to operate. And if they’re adopting EKS, they want it cheaper to operate in Kubernetes in the cloud. They also want a high performance, they also want scalability, and they want security and isolation.There’s so many parameters that they have to deal with before they put their service on the market and continue to operate. And there’s a fundamental tension here, right? Like they want cost efficiency, but they also want to be available in the market quicker and they want performance and availability. Developers have uptime, SLOs, and SLAs is to consider and they want the maximum possible resources that they want. And on the other side, you’ve got financial leaders and the business leaders who want to look at the spending and worry about, like, okay, are we allocating our capital wisely? And are we allocating where it makes sense? And are we doing it in a manner that there’s very little wastage and aligned with our customer use, for example? And this is where the actual problems arise from [unintelligible 00:13:00].Corey: I want to be very clear that for a long time, one of the most expensive parts about running Kubernetes has not been the infrastructure itself. It’s been the people to run this responsibly, where it’s the day two, day three experience where for an awful lot of companies like, oh, we’re moving to Kubernetes because I don’t know we read it in an in-flight magazine or something and all the cool kids are doing it, which honestly during the pandemic is why suddenly everyone started making better IT choices because they’re execs were not being exposed to airport ads. I digress. The point, though, is that as customers are figuring this stuff out and playing around with it, it’s not sustainable that every company that wants to run Kubernetes can afford a crack SRE team that is individually incredibly expensive and collectively staggeringly so. That it seems to be the real cost is the complexity tied to it.And EKS has been great in that it abstracts an awful lot of the control plane complexity away. But I still can’t shake the feeling that running Kubernetes is mind-bogglingly complicated. Please argue with me and tell me I’m wrong.Eswar: No, you’re right. It’s still complicated. And it’s a journey towards reducing the complexity. When we launched EKS, we launched only with managing the control plane to begin with. And that’s where we started, but customers had the complexity of managing the worker nodes.And then we evolved to manage the Kubernetes worker nodes in terms two products: we’ve got Managed Node Groups and Fargate. And then customers moved on to installing more agents in their clusters before they actually installed their business applications, things like Cluster Autoscaler, things like Metric Server, critical components that they have come to rely on, but doesn’t drive their business logic directly. They are supporting aspects of driving core business logic.And that’s how we evolved into managing the add-ons to make life easier for our customers. And it’s a journey where we continue to reduce the complexity of making it easier for customers to adopt Kubernetes. And once you cross that chasm—and we are still trying to cross it—once you cross it, you have the problem of, okay so, adopting Kubernetes is easy. Now, we have to operate it, right, which means that we need to provide better reporting tools, not just for costs, but also for operations. Like, how easy it is for customers to get to the application level metrics and how easy it is for customers to troubleshoot issues, how easy for customers to actually upgrade to newer versions of Kubernetes. All of these challenges come out beyond day one, right? And those are initiatives that we have in flight to make it easier for customers [unintelligible 00:15:39].Corey: So, one of the things I see when I start going deep into the Kubernetes ecosystem is, well, Kubernetes will go ahead and run the containers for me, but now I need to know what’s going on in various areas around it. One of the big booms in the observability space, in many cases, has come from the fact that you now need to diagnose something in a container you can’t log into and incidentally stopped existing 20 minutes for you got the alert about the issue, so you’d better hope your telemetry is up to snuff. Now, yes, that does act as a bit of a complexity burden, but on the other side of it, we don’t have to worry about things like failed hard drives taking systems down anymore. That has successfully been abstracted away by Kubernetes, or you know, your cloud provider, but that’s neither here nor there these days. What are you seeing as far as, effectively, the sidecar pattern, for example of, “Oh, you have too many containers and need to manage them? Have you considered running more containers?” Sounds like something a container salesman might say.Eswar: So, running containers demands that you have really solid observability tooling, things that you’re able to troubleshoot—successfully—debug without the need to log into the containers itself. In fact, that’s an anti-pattern, right? You really don’t want a container to have the ability to SSH into a particular container, for example. And to be successful at it demands that you publish your metrics and you publish your logs. All of these are things that a developer needs to worry about today in order to adopt containers, for example.And it's on the service providers to actually make it easier for the developers not to worry about these. And all of these are available automatically when you adopt a Kubernetes service. For example, in EKS, we are working with our managed Prometheus service teams inside Amazon, right—and also CloudWatch teams—to easily enable metrics and logging for customers without having to do a lot of heavy lifting.Corey: Let’s talk a little bit about the competitive landscape here. One of my biggest competitors in optimizing AWS bills is Microsoft Excel, specifically, people are going to go ahead and run it themselves because, “Eh, hiring someone who’s really good at this, that sounds expensive. We can screw it up for half the cost.” Which is great. It seems to me that one of your biggest competitors is people running their own control plane, on some level.I don’t tend to accept the narrative that, “Oh, EKS is expensive that winds up being what 35 bucks or 70 bucks or whatever it is per control plane per cluster on a monthly basis.” Okay, yes, that’s expensive if you’re trying to stay completely within a free tier perhaps, but if you’re running anything that’s even slightly revenue-generating or a for-profit company, you will spend far more than that just on people’s time. I have no problems—for once—with the EKS pricing model, start to finish. Good work on that. You’ve successfully nailed it. But are you seeing significant pushback from the industry of, “Nope, we’re going to run our own Kubernetes management system instead because we enjoy pain, corporately speaking.”Eswar: Actually, we are in a good spot there, right? Like, at this point, customers who choose to run Kubernetes on AWS by themselves and not adopt EKS just fall into one main category, so—or two main categories: number one, they have existing technical stack built on running Kubernetes on themselves and they’d rather maintain that and not moving to EKS. Or they demand certain custom configurations of the Kubernetes control plane that EKS doesn’t support. And those are the only two reasons why we see customers not moving into EKS and prefer to run their own Kubernetes on AWS clusters.[midroll 00:19:46]Corey: It really does seem, on some level, like there’s going to be a… I don’t want to say reckoning because that makes it sound vaguely ominous and that’s not the direction that I intend for things to go in, but there has to be some form of collapsing of the complexity that is inherent to all of this because the entire industry has always done that. An analogy that I fall back on because I’ve seen this enough times to have the scars to show for it is that in the ’90s, running a web server took about a week of spare time and an in-depth knowledge of GCC compiler flags. And then it evolved to ah, I could just unzip a tarball of precompiled stuff, and then RPM or Deb became a thing. And then Yum, or something else, or I guess apt over in the Debian land to wind up wrapping around that. And then you had things like Puppet where it was it was ensure installed. And now it’s Docker Run.And today, it’s a checkbox in the S3 console that proceeds to yell at you because you’re making a website public. But that’s neither here nor there. Things don’t get harder with time. But I’ve been surprised by how I haven’t yet seen that sort of geometric complexity collapsing of around Kubernetes to make it easier to work with. Is that coming or are we going to have to wait for the next cycle of things?Eswar: Let me think. I actually don’t have a good answer to that, Corey.Corey: That’s good, at least because if you did, I’d worried that I was just missing something obvious. That’s kind of the entire reason I ask. Like, “Oh, good. I get to talk to smart people and see what they’re picking up on that I’m absolutely missing.” I was hoping you had an answer, but I guess it’s cold comfort that you don’t have one off the top of your head. But man, is it confusing.Eswar: Yeah. So, there are some discussions in the community out there, right? Like, it’s Kubernetes the right layer to do interact? And there are some tooling that’s built on top of Kubernetes, for example, Knative that tries to provide a serverless layer on top of Kubernetes, for example. There are also attempts at abstracting Kubernetes completely and providing tooling that just completely removes any sort of Kubernetes API out of the picture and maybe a specific CI/CD-based solution that takes it from the source and deploys the service without even showing you that there’s Kubernetes underneath, right?All of these are evolutions that are being tested out there in the community. Time will tell whether these end up sticking. But what’s clear here is the gravity around Kubernetes. All sorts of tooling that gets built on top of Kubernetes, all the operators, all sorts of open-source initiatives that are built to run on Kubernetes. For example, Spark, for example, Cassandra, so many of these big, large-scale, open-source solutions are now built to run really well on Kubernetes. And that is the gravity that’s pushing Kubernetes at this point.Corey: I’m curious to get your take on one other, I would consider interestingly competitive spaces. Now, because I have a domain problem, if you go to kubernetestheeasyway.com, you’ll wind up on the ECS marketing page. That’s right, the worst competition in the world: the people who work down the hall from you.If someone’s considering using ECS, Elastic Container Service versus EKS, Elastic Kubernetes Service, what is the deciding factor when a customer’s making that determination? And to be clear, I’m not convinced there’s a right or wrong answer. But I am curious to get your take, given that you have a vested interest, but also presumably don’t want to talk complete smack about your colleagues. But feel free to surprise me.Eswar: Hey, I love ECS, by the way. Like I said, I started my life in the AWS in ECS. So look, ECS is a hugely successful container orchestration service. I know we talk a lot about Kubernetes, I know there’s a lot of discussions around Kubernetes, but I wouldn’t make it a point that, like, ECS is a hugely successful service. Now, what determines how customers go to?If customers are… if the customers tech stack is entirely on AWS, right, they use a lot of AWS services and they want an easy way to get started in the container world that has really tight integration with other AWS services without them having to configure a lot, ECS is the way, right? And customers have actually seen terrific success adopting ECS for that particular use case. Whereas EKS customers, they start with, “Okay, I want an open-source solution. I really love Kubernetes. I lo—or, I have a tooling that I really like in the open-source land that really works well with Kubernetes. I’m going to go that way.” And those kind of customers end up picking EKS.Corey: I feel like, on some level, Kubernetes has become the most the default API across a wide variety of environments. AWS obviously, but on-prem other providers. It seems like even the traditional VPS companies out there that offer just rent-a-server in the cloud somewhere are all also offering, “Oh, and we have a Kubernetes service as well.” I wound up backing a Kickstarter project that runs a Kubernetes cluster with a shared backplane across a variety of Raspberries Pi, for example. And it seems to be almost everywhere you look.Do you think that there’s some validity to that approach of effectively whatever it is that we’re going to wind up running in the future, it’s going to be done on top of Kubernetes or do you think that that’s mostly hype-driven these days?Eswar: It’s definitely not hype. Like we see the proof in the kind of adoption we see. It’s becoming the de facto container orchestration API. And with all the tooling, open-source tooling that’s continuing to build on top of Kubernetes, CNCF tooling ecosystem that’s actually spawned to actually support Kubernetes at option, all of this is solid proof that Kubernetes is here to stay and is a really strong, powerful API for customers to adopt.Corey: So, four years ago, I had a prediction on Twitter, and I said, “In five years, nobody will care about Kubernetes.” And it was in February, I believe, and every year, I wind up updating an incrementing a link to it, like, “Four years to go,” “Three years to go,” and I believe it expires next year. And I have to say, I didn’t really expect when I made that prediction for it to outlive Twitter, but yet, here we are, which is neither here nor there. But I’m curious to get your take on this. But before I wind up just letting you savage the naive interpretation of that, my impression has been that it will not be that Kubernetes has gone away. That is ridiculous. It is clearly in enough places that even if they decided to rip it out now, it would take them ten years, but rather than it’s going to slip below the surface level of awareness.Once upon a time, there was a whole bunch of energy and drama and debate around the Linux virtual memory management subsystem. And today, there’s, like, a dozen people on the planet who really have to care about that, but for the rest of us, it doesn’t matter anymore. We are so far past having to care about that having any meaningful impact in our day-to-day work that it’s just, it’s the part of the iceberg that’s below the waterline. I think that’s where Kubernetes is heading. Do you agree or disagree? And what do you think about the timeline?Eswar: I agree with you; that’s a perfect analogy. It’s going to go the way of Linux, right? It’s here to stay; it just going to get abstracted out if any of the abstraction efforts are going to stick around. And that’s where we’re testing the waters there. There are many, many open-source initiatives there trying to abstract Kubernetes. All of these are yet to gain ground, but there’s some reasonable efforts being made.And if they are successful, they just end up being a layer on top of Kubernetes. Many of the customers, many of the developers, don’t have to worry about Kubernetes at that point, but a certain subset of us in the tech world will need to do a deal with Kubernetes, and most likely teams like mine that end up managing and operating their Kubernetes clusters.Corey: So, one last question I have for you is that if there’s one thing that AWS loves, it’s misspelling things. And you have an open-source offering called Karpenter spelled with a K that is an extending of that tradition. What does Karpenter do and why would someone use it?Eswar: Thank you for that. Karpenter is one of my favorite launches in the last one year.Corey: Presumably because you’re terrible at the spelling bee back when you were a kid. But please tell me more.Eswar: [laugh]. So Karpenter, is an open-source flexible and high performance cluster auto-scaling solution. So basically, when your cluster needs more capacity to support your workloads, Karpenter automatically scales the capacity as needed. For people that know the Kubernetes space well, there’s an existing component called Cluster Autoscaler that fills this space today. And it’s our take on okay, so what if we could reimagine the capacity management solution available in Kubernetes? And can we do something better? Especially for cases where we expect terrific performance at scale to enable cost efficiency and optimization use cases for our customers, and most importantly, provide a way for customers not to pre-plan a lot of capacity to begin with.Corey: This is something we see a lot, in the sense of very bursty workloads where, okay, you’re going to steady state load. Cool. Buy a bunch of savings plans, get things set up the way you want them, and call it a day. But when it’s bursty, there are challenges with it. Folks love using Spot, but in the event of a sudden capacity shortfall, the question is, is can we spin up capacity to backfill it within those two minutes that we have a warning on that on? And if the answer is no, then it becomes a bit of a non-starter.Customers have had to build an awful lot of those things around EC2 instances that handle a lot of that logic for them in ways that are tuned specifically for their use cases. I’m encouraged to see there’s a Kubernetes story around this that starts to remove some of that challenge from the customer side.Eswar: Yeah. So, the burstiness is where complexity comes [here 00:29:42], right? Like many customers for steady state, they know what their capacity requirements are, they set up the capacity, they can also reason out what is the effective capacity needed for good utilization for economical reasons and they can actually pre plan that and set it up. But once burstiness comes in, which inevitably does it at [unintelligible 00:30:05] applications, customers worry about, “Okay, am I going to get the capacity that I need in time that I need to be able to service my customers? And am I confident at it?”If I’m not confident, I’m going to actually allocate capacity beforehand, assuming that I’m going to actually get the burst that I needed. Which means, you’re paying for resources that you’re not using at the moment. And the burstiness might happen and then you’re on the hook to actually reduce the capacity for it once the peak subsides at the end of the [day 00:30:36]. And this is a challenging situation. And this is one of the use cases that we targeted Karpenter towards.Corey: I find that the idea that you’re open-sourcing this is fascinating because of two reasons. One, it does show a willingness to engage with the community that… again, it’s difficult. When you’re a big company, people love to wind up taking issue with almost anything that you do. But for another, it also puts it out in the open, on some level, where, especially when you’re talking about cost optimization and decisions that affect cost, it’s all out in public. So, people can look at this and think, “Wait a minute, it’s not—what is this line of code that means if it’s toward the end of the month, crank it up because we might need to hit our numbers.” Like, there’s nothing like that in there. At least I’m assuming. I’m trusting that other people have read this code because honestly, that seems like a job for people who are better at that than I am. But that does tend to breed a certain element of trust.Eswar: Right. It’s one of the first things that we thought about when we said okay, so we have some ideas here to actually improve the capacity management solution for Kubernetes. Okay, should we do it out in the open? And the answer was a resounding yes, right? I think there’s a good story here that actually enables not just AWS to offer these ideas out there, right, and we want to bring it to all sorts of Kubernetes customers.And one of the first things we did is to architecturally figure out all the core business logic of Karpenter, which is, okay, how to schedule better, how quickly to scale, what is the best instance types to pick for this workload. All of that business logic was abstracted out from the actual cloud provider implementation. And the cloud provider implementation is super simple. It’s just creating instances, deleting instances, and describing instances. And it’s something that we bake from the get-go so it’s easier for other cloud providers to come in and to add their support to it. And we as a community actually can take these ideas forward in a much faster way than just AWS doing it.Corey: I really want to thank you for taking the time to speak with me today about all these things. If people want to learn more, where’s the best place for them to find you?Eswar: The best place to learn about EKS, right, as EKS evolves, is using our documentation, we have an EKS newsletter that you can go subscribe, and you can also find us on GitHub where we share our product roadmap. So, it’s a great places to learn about how EKS is evolving and also sharing your feedback.Corey: Which is always great to hear, as opposed to, you know, in the AWS Console, where we live, waiting for you to stumble upon us, which, yeah. No it’s good does have a lot of different places for people to engage with you. And we’ll put links to that, of course, in the [show notes 00:33:17]. Thank you so much for being so generous with your time. I appreciate it.Eswar: Corey, really appreciate you having me.Corey: Eswar Bala, Director of Engineering for Amazon EKS. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice telling me why, when it comes to tracking Kubernetes costs, Microsoft Excel is in fact the superior experience.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.
    5/5/2023
    34:29
  Learning eBPF with Liz Rice
    Liz Rice, Chief Open Source Officer at Isovalent, joins Corey on Screaming in the Cloud to discuss the release of her newest book, Learning eBPF, and the exciting possibilities that come with eBPF technology. Liz explains what got her so excited about eBPF technology, and what it was like to write a book while also holding a full-time job. Corey and Liz also explore the learning curve that comes with kernel programming, and Liz illustrates why it’s so important to be able to explain complex technologies in simple terminology. About LizLiz Rice is Chief Open Source Officer with eBPF specialists Isovalent, creators of the Cilium cloud native networking, security and observability project. She sits on the CNCF Governing Board, and on the Board of OpenUK. She was Chair of the CNCF's Technical Oversight Committee in 2019-2022, and Co-Chair of KubeCon + CloudNativeCon in 2018. She is also the author of Container Security, and Learning eBPF, both published by O'Reilly.She has a wealth of software development, team, and product management experience from working on network protocols and distributed systems, and in digital technology sectors such as VOD, music, and VoIP. When not writing code, or talking about it, Liz loves riding bikes in places with better weather than her native London, competing in virtual races on Zwift, and making music under the pseudonym Insider Nine.Links Referenced: Isovalent: https://isovalent.com/ Learning eBPF: https://www.amazon.com/Learning-eBPF-Programming-Observability-Networking/dp/1098135121 Container Security: https://www.amazon.com/Container-Security-Fundamental-Containerized-Applications/dp/1492056707/ GitHub for Learning eBPF: https://github.com/lizRice/learning-eBPF TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. Our returning guest today is Liz Rice, who remains the Chief Open Source Officer with Isovalent. But Liz, thank you for returning, suspiciously closely timed to when you have a book coming out. Welcome back.Liz: [laugh]. Thanks so much for having me. Yeah, I’ve just—I’ve only had the physical copy of the book in my hands for less than a week. It’s called Learning eBPF. I mean, obviously, I’m very excited.Corey: It’s an O’Reilly book; it has some form of honeybee on the front of it as best I can tell.Liz: Yeah, I was really pleased about that. Because eBPF has a bee as its logo, so getting a [early 00:01:17] honeybee as the O’Reilly animal on the front cover of the book was pretty pleasing, yeah.Corey: Now, this is your second O’Reilly book, is it not?Liz: It’s my second full book. So, I’d previously written a book on Container Security. And I’ve done a few short reports for them as well. But this is the second, you know, full-on, you can buy it on Amazon kind of book, yeah.Corey: My business partner wrote Practical Monitoring for O’Reilly and that was such an experience that he got entirely out of observability as a field and ran running to AWS bills as a result. So, my question for you is, why would anyone do that more than once?Liz: [laugh]. I really like explaining things. And I had a really good reaction to the Container Security book. I think already, by the time I was writing that book, I was kind of interested in eBPF. And we should probably talk about what that is, but I’ll come to that in a moment.Yeah, so I've been really interested in eBPF, for quite a while and I wanted to be able to do the same thing in terms of explaining it to people. A book gives you a lot more opportunity to go into more detail and show people examples and get them kind of hands-on than you can do in their, you know, 40-minute conference talk. So, I wanted to do that. I will say I have written myself a note to never do a full-size book while I have a full-time job because it’s a lot [laugh].Corey: You do have a full-time job and then some. As we mentioned, you’re the Chief Open Source Officer over at Isovalent, you are on the CNCF governing board, you’re on the board of OpenUK, and you’ve done a lot of other stuff in the open-source community as well. So, I have to ask, taking all of that together, are you just allergic to things that make money? I mean, writing the book as well on top of that. I’m told you never do it for the money piece; it’s always about the love of it. But it seems like, on some level, you’re taking it to an almost ludicrous level.Liz: Yeah, I mean, I do get paid for my day job. So, there is that [laugh]. But so, yeah—Corey: I feel like that’s the only way to really write a book is, in turn, to wind up only to just do it for—what someone else is paying you to for doing it, viewing it as a marketing exercise. It pays dividends, but those dividends don’t, in my experience from what I’ve heard from everyone say, pay off as of royalties on book payments.Liz: Yeah, I mean, it’s certainly, you know, not a bad thing to have that income stream, but it certainly wouldn’t make you—you know, I’m not going to retire tomorrow on the royalty stream unless this podcast has loads and loads of people to buy the book [laugh].Corey: Exactly. And I’m always a fan of having such [unintelligible 00:03:58]. I will order it while we’re on the call right now having this conversation because I believe in supporting the things that we want to see more of in the world. So, explain to me a little bit about what it is. Whatever you talking about learning X in a title, I find that that’s often going to be much more approachable than arcane nonsense deep-dive things.One of the O’Reilly books that changed my understanding was Linux Kernel Internals, or Understanding the Linux Kernel. Understanding was kind of a heavy lift at that point because it got very deep very quickly, but I absolutely came away understanding what was going on a lot more effectively, even though I was so slow I needed a tow rope on some of it. When you have a book that started with learning, though, I imagined it assumes starting at zero with, “What’s eBPF?” Is that directionally correct, or does it assume that you know a lot of things you don’t?Liz: Yeah, that’s absolutely right. I mean, I think eBPF is one of these technologies that is starting to be, particularly in the cloud-native world, you know, it comes up; it’s quite a hot technology. What it actually is, so it’s an acronym, right? EBPF. That acronym is almost meaningless now.So, it stands for extended Berkeley Packet Filter. But I feel like it does so much more than filtering, we might as well forget that altogether. And it’s just become a term, a name in its own right if you like. And what it really does is it lets you run custom programs in the kernel so you can change the way that the kernel behaves, dynamically. And that is… it’s a superpower. It’s enabled all sorts of really cool things that we can do with that superpower.Corey: I just pre-ordered it as a paperback on Amazon and it shows me that it is now number one new release in Linux Networking and Systems Administration, so you’re welcome. I’m sure it was me that put it over the top.Liz: Wonderful. Thank you very much. Yeah [laugh].Corey: Of course, of course. Writing a book is one of those things that I’ve always wanted to do, but never had the patience to sit there and do it or I thought I wasn’t prolific enough, but over the holidays, this past year, my wife and business partner and a few friends all chipped in to have all of the tweets that I’d sent bound into a series of leather volumes. Apparently, I’ve tweeted over a million words. And… yeah, oh, so I have to write a book 280 characters at a time, mostly from my phone. I should tweet less was really the takeaway that I took from a lot of that.But that wasn’t edited, that wasn’t with an overall theme or a narrative flow the way that an actual book is. It just feels like a term paper on steroids. And I hated term papers. Love reading; not one to write it.Liz: I don’t know whether this should make it into the podcast, but it reminded me of something that happened to my brother-in-law, who’s an artist. And he put a piece of video on YouTube. And for unknowable reasons if you mistyped YouTube, and you spelt it, U-T-U-B-E, the page that you would end up at from Google search was a YouTube video and it was in fact, my brother-in-law’s video. And people weren’t expecting to see this kind of art movie about matches burning. And he just had the worst comment—like, people were so mean in the comments. And he had millions of views because people were hitting this page by accident, and he ended up—Corey: And he made the cardinal sin of never read the comments. Never break that rule. As soon as you do that, it doesn’t go well. I do read the comments on various podcast platforms on this show because I always tell people to insulted all they want, just make sure you leave a five-star review.Liz: Well, he ended up publishing a book with these comments, like, one comment per page, and most of them are not safe for public consumption comments, and he just called it Feedback. It was quite something [laugh].Corey: On some level, it feels like O’Reilly books are a little insulated from the general population when it comes to terrible nonsense comments, just because they tend to be a little bit more expensive than the typical novel you’ll see in an airport bookstore, and again, even though it is approachable, Learning eBPF isn’t exactly the sort of title that gets people to think that, “Ooh, this is going to be a heck of a thriller slash page-turner with a plot.” “Well, I found the protagonist unrelatable,” is not sort of the thing you’re going to wind up seeing in the comments because people thought it was going to be something different.Liz: I know. One day, I’m going to have to write a technical book that is also a murder mystery. I think that would be, you know, quite an achievement. But yeah, I mean, it’s definitely aimed at people who have already come across the term, want to know more, and particularly if you’re the kind of person who doesn’t want to just have a hand-wavy explanation that involves boxes and diagrams, but if, like me, you kind of want to feel the code, and you want to see how things work and you want to work through examples, then that’s the kind of person who might—I hope—enjoy working through the book and end up with a possible mental model of how eBPF works, even though it’s essentially kernel programming.Corey: So, I keep seeing eBPF in an increasing number of areas, a bunch of observability tools, a bunch of security tools all tend to tie into it. And I’ve seen people do interesting things as far as cost analysis with it. The problem that I run into is that I’m not able to wind up deploying it universally, just because when I’m going into a client engagement, I am there in a purely advisory sense, given that I’m biasing these days for both SaaS companies and large banks, that latter category is likely going to have some problems if I say, “Oh, just take this thing and go ahead and deploy it to your entire fleet.” If they don’t have a problem with that, I have a problem with their entire business security posture. So, I don’t get to be particularly prescriptive as far as what to do with it.But if I were running my own environment, it is pretty clear by now that I would have explored this in some significant depth. Do you find that it tends to be something that is used primarily in microservices environments? Does it effectively require Kubernetes to become useful on day one? What is the onboard path where people would sit back and say, “Ah, this problem I’m having, eBPF sounds like the solution.”Liz: So, when we write tools that are typically going to be some sort of infrastructure, observability, security, networking tools, if we’re writing them using eBPF, we’re instrumenting the kernel. And the kernel gets involved every time our application wants to do anything interesting because whenever it wants to read or write to a file, or send receive network messages, or write something to the screen, or allocate memory, or all of these things, the kernel has to be involved. And we can use eBPF to instrument those events and do interesting things. And the kernel doesn’t care whether those processes are running in containers, under Kubernetes, just running directly on the host; all of those things are visible to eBPF.So, in one sense, doesn’t matter. But one of the reasons why I think we’re seeing eBPF-based tools really take off in cloud-native is that you can, by applying some programming, you can link events that happened in the kernel to specific containers in specific pods in whatever namespace and, you know, get the relationship between an event and the Kubernetes objects that are involved in that event. And then that enables a whole lot of really interesting observability or security tools and it enables us to understand how network packets are flowing between different Kubernetes objects and so on. So, it’s really having this vantage point in the kernel where we can see everything and we didn’t have to change those applications in any way to be able to use eBPF to instrument them.Corey: When I see the stories about eBPF, it seems like it’s focused primarily on networking and flow control. That’s where I’m seeing it from a security standpoint, that’s where I’m seeing it from cost allocation aspect. Because, frankly, out of the box, from a cloud provider’s perspective, Kubernetes looks like a single-tenant application with a really weird behavioral pattern, and some of that crosstalk gets very expensive. Is there a better way than either using eBPF and/or VPC flow logs to figure out what’s talking to what in the Kubernetes ecosystem, or is BPF really your first port of call?Liz: So, I’m coming from a position of perspective of working for the company that created the Cilium networking project. And one of the reasons why I think Cilium is really powerful is because it has this visibility—it’s got a component called Hubble—that allows you to see exactly how packets are flowing between these different Kubernetes identities. So, in a Kubernetes environment, there’s not a lot of point having network flows that talk about IP addresses and ports when what you really want to know is, what’s the Kubernetes namespace, what’s the application? Defining things in terms of IP addresses makes no sense when they’re just being refreshed and renewed every time you change pods. So yeah, Kubernetes changes the requirements on networking visibility and on firewalling as well, on network policy, and that, I think, is you don’t have to use eBPF to create those tools, but eBPF is a really powerful and efficient platform for implementing those tools, as we see in Cilium.Corey: The only competitor I found to it that gives a reasonable explanation of why random things are transferring multiple petabytes between each other in the middle of the night has been oral tradition, where I’m talking to people who’ve been around there for a while. It’s, “So, I’m seeing this weird traffic pattern at these times a day. Any idea what that might be?” And someone will usually perk up and say, “Oh, is it—” whatever job that they’re doing. Great. That gives me a direction to go in.But especially in this era of layoffs and as environments exist for longer and longer, you have to turn into a bit of a data center archaeologist. That remains insufficient, on some level. And some level, I’m annoyed with trying to understand or needing to use tooling like this that is honestly this powerful and this customizable, and yes, on some level, this complex in order to get access to that information in a meaningful sense. But on the other, I’m glad that that option is at least there for a lot of workloads.Liz: Yeah. I think, you know, that speaks to the power of this new generation of tooling. And the same kind of applies to security forensics, as well, where you might have an enormous stream of events, but unless you can tie those events back to specific Kubernetes identities, which you can use eBPF-based tooling to do, then how do you—the forensics job of tying back where did that event come from, what was the container that was compromised, it becomes really, really difficult. And eBPF tools—like Cilium has a sub-project called Tetragon that is really good at this kind of tying events back to the Kubernetes pod or whether we want to know what node it was running on what namespace or whatever. That’s really useful forensic information.Corey: Talk to me a little bit about how broadly applicable it is. Because from my understanding from our last conversation, when you were on the show a year or so ago, if memory serves, one of the powerful aspects of it was very similar to what I’ve seen some of Brendan Gregg’s nonsense doing in his kind of various talks where you can effectively write custom programming on the fly and it’ll tell you exactly what it is that you need. Is this something that can be instrument once and then effectively use it for basically anything, [OTEL 00:16:11]-style, or instead, does it need to be effectively custom configured every time you want to get a different aspect of information out of it?Liz: It can be both of those things.Corey: “It depends.” My least favorite but probably the most accurate answer to hear.Liz: [laugh]. But I think Brendan did a really great—he’s done many talks talking about how powerful BPF is and built lots of specific tools, but then he’s also been involved with Bpftrace, which is kind of like a language for—a high-level language for saying what it is that you want BPF to trace out for you. So, a little bit like, I don’t know, awk but for events, you know? It’s a scripting language. So, you can have this flexibility.And with something like Bpftrace, you don’t have to get into the weeds yourself and do kernel programming, you know, in eBPF programs. But also there’s gainful employment to be had for people who are interested in that eBPF kernel programming because, you know, I think there’s just going to be a whole range of more tools to come, you know>? I think we’re, you know, we’re seeing some really powerful tools with Cilium and Pixie and [Parker 00:17:27] and Kepler and many other tools and projects that are using eBPF. But I think there’s also a whole load of more to come as people think about different ways they can apply eBPF and instrument different parts of an overall system.Corey: We’re doing this over audio only, but behind me on my wall is one of my least favorite gifts ever to have been received by anyone. Mike, my business partner, got me a thousand-piece puzzle of the Kubernetes container landscape where—Liz: [laugh].Corey: This diagram is psychotic and awful and it looks like a joke, except it’s not. And building that puzzle was maddening—obviously—but beyond that, it was a real primer in just how vast the entire container slash Kubernetes slash CNCF landscape really is. So, looking at this, I found that the only reaction that was appropriate was a sense of overwhelmed awe slash frustration, I guess. It’s one of those areas where I spend a lot of time focusing on drinking from the AWS firehose because they have a lot of products and services because their product strategy is apparently, “Yes,” and they’re updating these things in a pretty consistent cadence. Mostly. And even that feels like it’s multiple full-time jobs shoved into one.There are hundreds of companies behind these things and all of them are in areas that are incredibly complex and difficult to go diving into. EBPF is incredibly powerful, I would say ridiculously so, but it’s also fiendishly complex, at least shoulder-surfing behind people who know what they’re doing with it has been breathtaking, on some level. How do people find themselves in a situation where doing a BPF deep dive make sense for them?Liz: Oh, that’s a great question. So, first of all, I’m thinking is there an AWS Jigsaw as well, like the CNCF landscape Jigsaw? There should be. And how many pieces would it have? [It would be very cool 00:19:28].Corey: No, because I think the CNCF at one point hired a graphic designer and it’s unclear that AWS has done such a thing because their icons for services are, to be generous here, not great. People have flashcards that they’ve built for is what services does logo represent? Haven’t a clue, in almost every case because I don’t care in almost every case. But yeah, I’ve toyed with the idea of doing it. It’s just not something that I’d ever want to have my name attached to it, unfortunately. But yeah, I want someone to do it and someone else to build it.Liz: Yes. Yeah, it would need to refresh every, like, five minutes, though, as they roll out a new service.Corey: Right. Because given that it appears from the outside to be impenetrable, it’s similar to learning VI in some cases, where oh, yeah, it’s easy to get started with to do this trivial thing. Now, step two, draw the rest of the freaking owl. Same problem there. It feels off-putting just from a perspective of you must be at least this smart to proceed. How do you find people coming to it?Liz: Yeah, there is some truth in that, in that beyond kind of Hello World, you quite quickly start having to do things with kernel data structures. And as soon as you’re looking at kernel data structures, you have to sort of understand, you know, more about the kernel. And if you change things, you need to understand the implications of those changes. So, yeah, you can rapidly say that eBPF programming is kernel programming, so why would anybody want to do it? The reason why I do it myself is not because I’m a kernel programmer; it’s because I wanted to really understand how this is working and build up a mental model of what’s happening when I attach a program to an event. And what kinds of things can I do with that program?And that’s the sort of exploration that I think I’m trying to encourage people to do with the book. But yes, there is going to be at some point, a pretty steep learning curve that’s kernel-related but you don’t necessarily need to know everything in order to really have a decent understanding of what eBPF is, and how you might, for example—you might be interested to see what BPF programs are running on your existing system and learn why and what they might be doing and where they’re attached and what use could that be.Corey: Falling down that, looking at the process table once upon a time was a heck of an education, one week when I didn’t have a lot to do and I didn’t like my job in those days, where, “Oh, what is this Avahi daemon that constantly running? MDNS forwarding? Who would need that?” And sure enough, that tickled something in the back of my mind when I wound up building out my networking box here on top of BSD, and oh, yeah, I want to make sure that I can still have discovery work from the IoT subnet over to whatever it is that my normal devices live. Ah, that’s what that thing always running for. Great for that one use case. Almost never needed in other cases, but awesome. Like, you fire up a Raspberry Pi. It’s, “Why are all these things running when I’m just want to have an embedded device that does exactly one thing well?” Ugh. Computers have gotten complicated.Liz: I know. It’s like when you get those pop-ups on—well certainly on Mac, and you get pop-ups occasionally, let’s say there’s such and such a daemon wants extra permissions, and you think I’m not hitting that yes button until I understand what that daemon is. And it turns out, it’s related, something completely innocuous that you’ve actually paid for, but just under a different name. Very annoying. So, if you have some kind of instrumentation like tracing or logging or security tooling that you want to apply to all of your containers, one of the things you can use is a sidecar container approach. And in Kubernetes, that means you inject the sidecar into every single pod. And—Corey: Yes. Of course, the answer to any Kubernetes problem appears to be have you tried running additional containers?Liz: Well, right. And there are challenges that can come from that. And one of the reasons why you have to do that is because if you want a tool that has visibility over that container that’s inside the pod, well, your instrumentation has to also be inside the pod so that it has visibility because your pod is, by design, isolated from the host it’s running on. But with eBPF, well eBPF is in the kernel and there’s only one kernel, however many containers were running. So, there is no kind of isolation between the host and the containers at the kernel level.So, that means if we can instrument the kernel, we don’t have to have a separate instance in every single pod. And that’s really great for all sorts of resource usage, it means you don’t have to worry about how you get those sidecars into those pods in the first place, you know that every pod is going to be instrumented if it’s instrumented in the kernel. And then for service mesh, service mesh usually uses a sidecar as a Layer 7 Proxy injected into every pod. And that actually makes for a pretty convoluted networking path for a packet to sort of go from the application, through the proxy, out to the host, back into another pod, through another proxy, into the application.What we can do with eBPF, we still need a proxy running in userspace, but we don’t need to have one in every single pod because we can connect the networking namespaces much more efficiently. So, that was essentially the basis for sidecarless service mesh, which we did in Cilium, Istio, and now we’re using a similar sort of approach with Ambient Mesh. So that, again, you know, avoiding having the overhead of a sidecar in every pod. So that, you know, seems to be the way forward for service mesh as well as other types of instrumentation: avoiding sidecars.Corey: On some level, avoiding things that are Kubernetes staples seems to be a best practice in a bunch of different directions. It feels like it’s an area where you start to get aligned with the idea of service meesh—yes, that’s how I pluralize the term service mesh and if people have a problem with that, please, it’s imperative you’ve not send me letters about it—but this idea of discovering where things are in a variety of ways within a cluster, where things can talk to each other, when nothing is deterministically placed, it feels like it is screaming out for something like this.Liz: And when you think about it, Kubernetes does sort of already have that at the level of a service, you know? Services are discoverable through native Kubernetes. There’s a bunch of other capabilities that we tend to associate with service mesh like observability or encrypted traffic or retries, that kind of thing. But one of the things that we’re doing with Cilium, in general, is to say, but a lot of this is just a feature of the networking, the underlying networking capability. So, for example, we’ve got next generation mutual authentication approach, which is using SPIFFE IDs between an application pod and another application pod. So, it’s like the equivalent of mTLS.But the certificates are actually being passed into the kernel and the encryption is happening at the kernel level. And it’s a really neat way of saying we don’t need… we don’t need to have a sidecar proxy in every pod in order to terminate those TLS connections on behalf of the application. We can have the kernel do it for us and that’s really cool.Corey: Yeah, at some level, I find that it still feels weird—because I’m old—to have this idea of one shared kernel running a bunch of different containers. I got past that just by not requiring that [unintelligible 00:27:32] workloads need to run isolated having containers run on the same physical host. I found that, for example, running some stuff, even in my home environment for IoT stuff, things that I don’t particularly trust run inside of KVM on top of something as opposed to just running it as a container on a cluster. Almost certainly stupendous overkill for what I’m dealing with, but it’s a good practice to be in to start thinking about this. To my understanding, this is part of what AWS’s Firecracker project starts to address a bit more effectively: fast provisioning, but still being able to use different primitives as far as isolation boundaries go. But, on some level, it’s nice to not have to think about this stuff, but that’s dangerous.Liz: [laugh]. Yeah, exactly. Firecracker is really nice way of saying, “Actually, we’re going to spin up a whole VM,” but we don’t ne—when I say ‘whole VM,’ we don’t need all of the things that you normally get in a VM. We can get rid of a ton of things and just have the essentials for running that Lambda or container service, and it becomes a really nice lightweight solution. But yes, that will have its own kernel, so unlike, you know, running multiple kernels on the same VM where—sorry, running multiple containers on the same virtual machine where they would all be sharing one kernel, with Firecracker you’ll get a kernel per instance of Firecracker.Corey: The last question I have for you before we wind up wrapping up this episode harkens back to something you said a little bit earlier. This stuff is incredibly technically nuanced and deep. You clearly have a thorough understanding of it, but you also have what I think many people do not realize is an orthogonal skill of being able to articulate and explain those complex concepts simply an approachably, in ways that make people understand what it is you’re talking about, but also don’t feel like they’re being spoken to in a way that’s highly condescending, which is another failure mode. I think it is not particularly well understood, particularly in the engineering community, that there are—these are different skill sets that do not necessarily align congruently. Is this something you’ve always known or is this something you’ve figured out as you’ve evolved your career that, oh I have a certain flair for this?Liz: Yeah, I definitely didn’t always know it. And I started to realize it based on feedback that people have given me about talks and articles I’d written. I think I’ve always felt that when people use jargon or they use complicated language or they, kind of, make assumptions about how things are, it quite often speaks to them not having a full understanding of what’s happening. If I want to explain something to myself, I’m going to use straightforward language to explain it to myself [laugh] so I can hold it in my head. And I think people appreciate that.And you can get really—you know, you can get quite in-depth into something if you just start, step by step, build it up, explain everything as you go along the way. And yeah, I think people do appreciate that. And I think people, if they get lost in jargon, it doesn’t help anybody. And yeah, I very much appreciate it when people say that, you know, they saw a talk or they read something I wrote and it meant that they finally grokked whatever that concept was that that I was trying to explain. I will say at the weekend, I asked ChatGPT to explain DNS in the style of Liz Rice, and it started off, it was basically, “Hello there. I’m Liz Rice and I’m here to explain DNS in very simple terms.” I thought, “Okay.” [laugh].Corey: Every time I think I’ve understood DNS, there’s another level to it.Liz: I’m pretty sure there is a lot about DNS that I don’t understand, yeah. So, you know, there’s always more to learn out there.Corey: There’s certainly is. I really want to thank you for taking time to speak with me today about what you’re up to. Where’s the best place for people to find you to learn more? And of course, to buy the book.Liz: Yeah, so I am Liz Rice pretty much everywhere, all over the internet. There is a GitHub repo that accompanies the books that you can find that on GitHub: lizRice/learning-eBPF. So, that’s a good place to find some of the example code, and it will obviously link to where you can download the book or buy it because you can pay for it; you can also download it from Isovalent for the price of your contact details. So, there are lots of options.Corey: Excellent. And we will, of course, put links to that in the [show notes 00:32:08]. Thank you so much for your time. It’s always great to talk to you.Liz: It’s always a pleasure, so thanks very much for having me, Corey.Corey: Liz Rice, Chief Open Source Officer at Isovalent. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.
    5/2/2023
    33:59

