Sharon Florentine

Union’s Andrew Dye describes his journey to MLOps

Union.ai Systems Engineer Andrew Dye sat down with Demetrios Brinkmann and David Aponte of MLOps Community for an MLOps Coffee Session to discuss navigating the world of machine learning, bridging the gap between firmware and MLOps, what Andrew wished he’d known sooner, making sense of abstractions, Union.ai tooling and more. 

Andrew is a software engineer at Union and a contributor to Flyte, a production-grade data and ML orchestration platform. Before that, he was a tech lead for ML infrastructure at Meta, where he focused on ML training reliability. 

Andrew started as a low-level systems engineer working on Microsoft's HoloLens AR headset in 2014. In this role, he worked on custom silicon that was used to do real-time tracking for AR experiences. He had no knowledge of the layers above or below him, but he gained an understanding of the complexity that was happening at the hardware layer and how it enabled advanced experiences with data and computation. As he worked on this project, he became curious about machine learning and began to study it more deeply. He was interested in the learning curve for others who came to the field without any prior knowledge, particularly in the context of the rapidly evolving field and its potential applications in the metaverse.

Oh, and Andrew prefers his coffee black, but loves an Americano when he can find one.

The MLOps Outsider

When you think of machine learning, machine learning operations (MLOps) and MLOps roles, you probably expect that MLOps engineers work at a consistently high level — that these folks are the “rocket scientists” of the software engineering world, with extensive, deep and broad experience in the space that stretches back years. But in fact, to solve some of the space’s biggest problems, the opposite is true. As Andrew explained, he’s actually got a fairly low-level systems engineering background rooted in firmware and is a relative newbie to the MLOps space.

“Out of the gate, I'm a complete MLOps outsider here,” Andrew said. “In 2014, I started working on Hololens–for those not familiar with that, it’s the AR headset at Microsoft–and this was really my first kind of bumping into the ML space. I was a low-level systems engineer; I was working on  custom silicon sitting between state-of-the-art hardware to do real-time tracking for AR and then complex algorithms to make that all happen and a bunch of cool experiences on top. I know nothing about either layer above or below me,” Andrew said. His work involved using custom silicon with a custom instruction set to do  real-time positional tracking. While there was a lot of complexity at the hardware layer, it was necessary to empower the real-time, social positional experiences at the layers above, he said. 

“I remember days sitting in my office playing this bug-shooting game; like bugs are coming out of your walls, and it was wild to see how the power of images and data and the computation could just enable these really awesome experiences,” he said. It was there that he first encountered machine learning.

The ML learning curve

When Andrew moved to Meta in 2017, he switched gears and began to focus more on the ML space. The key to ascending the learning curve was to learn by osmosis from those around him, he said; to tap into the expertise and knowledge of the experts in the space.
“It was highly iterative,” Andrew said. “I knew nothing from the beginning and I’d just been kind of soaking things up since.”

For those not familiar with the way Meta works, Andrew explained that once you're hired as an engineer, you progress through a boot camp and a team-finding process. It’s up to the individual which teams to engage with and which focus area to choose.

“Based on my exposure to ML and computer vision, I knew that I was interested in exploring that space more, but really didn't know what I was looking for. I ultimately found a team that was responsible for distributed training and what would eventually become the AI infrastructure organization, and it seemed just like the perfect mashup between the ML space that I was interested in learning more about and the systems problems which I was skilled in and comfortable and passionate about,” he said. That formed an ideal “Goldilocks zone” for Andrew to jump into. He initially began with a narrow focus, but was able to grow through exploration and experimentation of different facets of technology at his own pace.

What he wished he knew then

To flatten the learning curve, Andrew surrounded himself with smart people and tried to absorb as much as he could. But he also was aware of just how much he still didn’t know. 

I really saw this from two angles; one going up the stack toward the ML space — and I just had no idea. I took the Andrew Ng course on Coursera to give me some more familiarity around ‘What is deep learning?’ and ‘What are neural nets?’ and I knew just enough,” he explained. But there’s a difference between academic, hypothetical understanding and being able to put that knowledge into practice, even when it comes to the basics. Andrew didn’t let himself get too hung up on what he didn’t know and instead focused on trusting his gut and applying his existing knowledge to the role.

“A lot of that's pretty academic and simple, and what I was working on was doing distributed training using model parallelism across tens of nodes,” he said. “Things got way more complex really quickly and builds on a lot of these basics that I really didn't quite understand, so I was just kind of flying by the seat of my pants – leaning on others where I could but also trusting that I'll figure it out eventually. I didn't get too hung up on thinking ‘I gotta know all this!’ because the entire surface is not knowable,” he said.

Interestingly, Andrew said, he hadn’t worked at a service layer before, having come from operating low-level systems. So he was also learning classic service-oriented architecture concepts and pieces at the same time. The talent surrounding him at Meta provided him with tons of examples and modeled what he should aspire to know and to be, he explained.

“Meta, as a leader in the space, had a ton of amazing experts that I had an opportunity to learn from. I was really relying a ton on my intuition and examples that I found, and trying to mimic what I could and what I thought was right  as we moved forward and tried to scale things,” he said.

The problem of scale

Scale is a problem that Andrew tackled at Microsoft, at Meta and currently at Union, but in different ways. Scale as it applied to Hololens had more to do with local processing and latencies for those using VR, he explained.

“It's really critical to get your positional updates down so you can update the renderings, and so you're talking about optimizing latency at all sorts of layers.” At Microsoft, “the chip we were working on had a number of cores, and we did explicit cache management to pass messages across cores to avoid any sort of latencies,” he said. At Meta, scale problems manifested in a more traditional sense.

“More data, bigger models — larger, more complexity and more compute power,” he said. “When I first joined Meta in 2017, I had an opportunity to work very tangentially on a paper that was training the imagenet 1K dataset in an hour, which was state of the art at the time and not worked in the academic research space. I was working on some fault-tolerance features to move that project forward — but that was 256 GPUs; p100s, if I recall.”

That was interesting to Andrew because it demonstrated how the theoretical problems he was solving could actually be applied in a very practical way that impacted the end user.

“It was interesting to apply the distributed learnings from building the fault tolerance mechanisms and see how it actually maps to the researcher who's leveraging this,” he said. “Investing in reliability and these things are cool from a computer science standpoint, but the end goal is to train this model.”

More than meets the eye

At the same time, the transformer ML model was gaining traction in the space, and models started getting bigger and bigger. It became necessary to distribute those models across multiple devices, and Andrew found himself tackling scheduling and resource constraints.

“Even at Meta we were constrained in terms of how many GPUs we could get access to. We were faced with questions like how do you schedule fairly so that someone who wants to do a really big training job and wants tons of GPUs can optimize their use — you don't want to hold back the resources and wait to run until enough are available because then you end up wasting the resources and the net efficiency of the cluster,” he explained. Meta used intricate scheduling algorithms to make this work, but Andrew said it was an interesting case study about the cascading fallout from the problems of scale and the heterogeneity of the workloads.  

“It's important to have systems that anticipate this heterogeneity in isolation. Some of these problems are that there's one set of configurations that are strung through them, but … what are the network behaviors where should you locate these things? Where is the data? There's a bunch of interesting problems,” he said.

Managing search

Another challenge in the ML space is anticipating failures and identifying behavioral patterns. This is important, Andrew said, because of the rapid pace of change and the constant influx of new ideas and iteration.

“It’s really obvious in retrospect, but anticipating failures and and behavioral patterns, instrumenting those and then logging them is paramount. It’s really hard in the ML space because it moves so quickly and you're trying out new ideas and throwing them away; trying something else out — it's really hard to find the point at which it's worth investing in instrumentation or visibility or tests,” he said. It’s not unheard of to get stuck in a feedback loop where you’re iterating repeatedly because it’s impossible to diagnose a particular problem. Leveraging observability and failure notion front and center is key, especially at the information layer where things are a little bit more stable, he said.

“Ensuring that you're investing in observability means that you can derive patterns from runs without needing to iterate to learn what went wrong. There's opportunities to look at patterns across workloads, as well,” he said. 

There are a number of proprietary and open source tools for instrumentation; tabular logging is great for the verbose details of various states you're passing through and — less commonly used but helpful — are time-series data points, he said.

“So, just emitting counters as you got to various events as they happened or watching latencies and you can look at standard deviations. You can use those to glean insights that can give you your next experimentation direction or point you to the next thing to try,” Andrew said.

Union.ai: The next chapter

Now in his role as a systems engineer at Union, Andrew said his view of how things are done, what tools are used and who is responsible has broadened, and he relishes the expanded sense of community around ML and MLOps.  

“Coming from Meta, I feel like I had this narrow perspective on how things were done, what tools you used — there was one way to do things. And there were layers; I was at times several steps removed from ML engineers or data scientists,” he said. 

When he started engaging with the wider ML ecosystem and ultimately connected with Union, he was amazed at the community and the number of different personas working collaboratively in the space.

“Whether they're low-level systems engineers or ML app folks or ML engineers, they are engaged together and working in a very collaborative way,” he said. “Flyte’s community really embodies this, and it's really awesome to see the number of companies that have adopted Flyte and the conversations happening in their Slack channel.” 

He is thrilled to see the numerous scenarios contributors to Flyte are looking to enable and the conversations about how to support Flyte.

“Bridging the gap through communication is really powerful because it lets each side kind of understand other perspectives and the challenges in the space,” Andrew said. Another thing he finds interesting about Flyte in particular is that the tool itself facilitates this kind of collaboration because different teams and personas build different parts of an orchestration pipeline that are eventually stitched together.

“Someone can do data input for a pipeline that's been curated by someone else deeply familiar with data and features; then someone else is optimizing the performance and behaviors at a lower level and someone’s training staff. The tool kind of becomes this medium to actually facilitate that communication,” he said.  

“There are a lot of different skills working together for a common goal — and that’s what makes it MLOps, not the individual components. By themselves they're kind of separate, but once they are put together working on account of a common product or for a common goal, it becomes easier not to know everything when you have a strong team to count on and bounce ideas off of,” he said.

It’s exactly the kind of space where collaborative learning can take place and where knowledge and information sharing happens by default, he said.

“Maybe this other person has an area of expertise, and maybe you have different expertise, so you need this kind of broad coverage of the sort of skill sets that you would need to to manage all these things,” he said.

But as the tooling is changing, will that also change the roles and the responsibilities of the individuals working on the project?  Andrew said he thinks it’s too soon to tell. He said he thinks, from a tooling perspective, that it's important to be opinionated on some pieces and less opinionated on others where things are less cauterized.

With Flyte, he said, which puts strong typing front and center, that is an opinion the tool has, but it also becomes a constraint within which everyone has to operate. As an orchestration tool, Flyte is unopinionated on, for instance, how you get a test done, and is highly extensible and supports a variety of plugins to interact with various tools, legacy and modern. 

When it comes to the way ideas develop, coming to a consensus and forming an opinion is important because that shapes the direction of things to come. For example, in the distributed training space, data parallelism was originally the standard way to train models. Then, as the models grew, they reached the parameter limits of data parallelism and ushered in pipelining; then came tensor parallelism. But it's only now that there’s a consensus on the definitions of these things and when and where it makes sense to use each. This can be challenging for folks entering the space and trying to learn the field without any set of solid definitions. 

Andrew said it’s important not to over invest in a certain space until it has stabilized a bit, and adds that this is especially important when it comes to abstractions.

“When building on top of these very complex systems that you are orchestrating, it's critically important that you know ML engineers and researchers and such are working at higher-level areas of abstraction, and getting those abstractions right is really, really difficult.” Sometimes creating a simple layer of abstraction can become infinitely more complex than intended, so engineers must be intentional about what layers of abstraction they expose, how they expose them and when it makes sense to jump to a lower layer in the stack. Building a one-size-fits-all solution is an unattainable goal, but being intentional when building tooling and abstractions — who you're targeting, what capabilities you want to offer and when is an appropriate time to work at a different layer of the stack — is the key to reaching consensus.

The framing is the hardest part

The most challenging part of any cutting-edge solution is framing the problem it’s going to solve. Finding the right problem to work on can make it easier to determine how to frame the solution and start building out the abstractions that will allow the technology to do its best work. Currently, Andrew is working on Union ML, a layer built on top of Flyte and Flytekit that offers a higher-level abstraction to easily instantiate models and servers to do inference on. Another area of focus is removing more of the infrastructure management burden  from teams. Flyte itself is an amazing tool, Andrew said, but as a production-grade system it must be deployed and managed and configured — in other words, there’s a lot of overhead. 

“For companies that are equipped and prepared to take that on, it’s super powerful, but a lot of companies don't want to manage infrastructure. They really want to just get started and iterate and experiment. At Union, we're doing what we can — starting with Union Cloud — to get folks started as fast as they can and remove all of that burden. There’s a Flyte cluster running; you can hit it, execute your workloads and not worry about managing them,” Andrew said.

The competition

One of the major issues inherent in working with machine learning is that complexity can spike and grow exponentially. While Andrew said it’s fairly easy for someone to ideate and ultimately productionize their workloads with just some simple scripts, a notebook and some YAML files to stitch everything together. But then things snowball. 

“All of a sudden, it's like, “Whoa, what did I build? What am I managing here?’ I’m new to a lot of the tooling here — this is actually the first time I've used Kubernetes — and I'm learning this on the fly as an infrastructure engineer. And I'm like, ‘Whoa, this is complicated!’ 

I can't imagine a data scientist or a middleware engineer having to think through a lot of these pieces and configuration,” Andrew said. “I think about what Flyte offers and what Union is focused on is around getting the engineer from early idea to production seamlessly so it's working. It's a tool that you can start to do simple things with. You can run locally; you can iterate and it’s built with principles that enable it to go from that to production grade where all the caveats and corner cases have been thought through and handled and are built into the product.”  

The fact that there’s not a major leap that users have to make and that it removes the need to rewrite your product in an entirely new tool is what sets Flyte apart. Currently, Flyte’s used by more than 80 companies; there are a number of additional investors and even more companies that built their own solutions, Andrew said, including at Meta.

While other tools like Airflow and Kubeflow exist, they don’t precisely fit the category of ML orchestration as Flyte does, Andrew said. Flyte is a more targeted, purpose-built approach and improves on the experience of a tool like Airflow.

Best practices

In many respects, the distributed training space is like high performance computing (HPC) mixed with deep learning and shares many of the learnings from systems engineering. For better or worse, users are bound by some of these principles and they are emerging as best practices.His experience in the distributed training space and at Meta taught Andrew the importance of version management repeatability and cachability, and he said there are opportunities to extend these capabilities within Flyte. 

“This is something that we had in the orchestration system at Meta. I think I kind of took it for granted or it seemed obvious it's desirable — and when chatting with [Union.ai CEO] Ketan and others at Union when I first met them, they said this is one of the first things they read up on. Not only was there forethought in doing this, but the way Ketan spoke to this was, “Oh, yeah. Of course we would have this — why would you do this any other way? Why would people not leverage this?’ and, interestingly, it's not something that I saw leveraged as broadly in my time at Meta,” Andrew said.

That puts Union ahead of the pack, which he believes is due to their experience at Lyft and their familiarity with productionizing  workloads there. He added that there are so many features already included and many more planned.

“I think users don't even know how powerful it is yet, and don't even know what they will need when they move from this idea phase to production-grade, larger scale and start dealing with some of these problems,” he said.

The future of Flyte

Andrew has seen the future; he knows what’s coming; and he, along with the rest of the team, are doing their best to prepare Union.ai and Flyte for that future. It’s inevitable when getting to a certain scale with ML models that certain challenges will arise. Union has lots of opportunities to solve for these as compute scales up and jobs become bigger. That said, for now, Union is a young company and is focusing on what it does really well — the orchestration layer. Andrew said.  

“What the future holds for Union — I think there's a lot of room for growth. We have a ton of really  awesome and smart people at the company that are highly plugged into what's going on in the industry,” he said. “I think that's wise: You do what you're best at or what you're good at. There's too many things going on in this space to take a stab at them all. … It probably wouldn't be wise for a company to make that their business strategy.”

Want to listen to the full conversation? Find it here. We will talk more about Machine Learning in Production and related topics in an upcoming blog post series about Orchestration for Data and ML. If you have any questions or comments please reach out to us at feedback@union.ai.

About Union.ai

Union.ai helps organizations deliver reliable, reproducible and cost-effective machine learning and data orchestration built around open-source Flyte. Flyte is a one of a kind workflow automation platform that simplifies the journey of data scientists and machine learning engineers from ideation to production. Some of the top companies, including Lyft, Spotify, GoJek and more, rely on Flyte to power their Data & ML products. Based in Bellevue, Wash., Union.ai was started by founding engineers of Flyte and is the leading contributor to Flyte.

Article
Update