Union.ai

Event

October 12, 2022

•

Min Read

MLOps AMA Session with Ketan Umare

Sandra Youssef

Union.ai CEO/co-founder and Flyte TSC Chair Ketan Umare was recently featured on an MLOps Community “Ask Me Anything” session. The discussion mainly focused on ML pipelines and ML models, top challenges, maintainability and reproducibility, ML orchestration tools, and monitoring data quality.

Based on that session, we’ve compiled this list of questions and answers to give you a quick look at our perspective on MLOps.

What are the top three biggest challenges when it comes to reproducing ML pipelines?

‍Objectives/outcomes: Not clearly defined.‍
Data: The inability to access the same data over time and or to understand the properties of the dataset.‍
Infrastructure/code/configuration used to produce models are incorrectly versioned and not in a consumable, reproducible state.

What does it mean for ML pipelines to be recoverable, and how can we ensure that they are?

It means that a pipeline run can be resumed or recovered from past results, or where it left off in the case of catastrophic system-level failures.
A few tactics for recoverability are: ‍
~Granular caching, where outputs of nodes don’t have to be re-computed.‍
~Intra-task (intra-node) checkpointing, where progress within a task can be resumed (e.g., with long-running model training tasks). Workflows are great at macro checkpointing, but the system should also support intra-task checkpointing.

What is the difference between an ML pipeline and an ML model?

Pipelines go beyond training models.
They include other steps like data wrangling, feature engineering, training, evaluating, and even deploying models.

Why is it hard to maintain and reproduce ML models?

ML models have many dependencies, including code, data, config, random seeds, and software dependencies, just to name a few. This makes maintaining them difficult because changing any one of these will change the model, potentially significantly if it is sensitive to one of those dependencies.
There are well-established tools for some of these dependencies (e.g. Git for code versioning) but standard tools for versioning and managing data are still in their infancy.

What is the most important way to make an internal ML platform successful in a company?

Platforms are hard! There is no way to satisfy everyone. Gradually define the audience and improve.
It is important to work very closely with users to help them succeed. Every platform serves its users, and keeping their success in mind is critical.
Change is inevitable, so platforms must be flexible.
From a technology POV, platform operators should be able to control the platform and deliver delightful experiences, without constant upgrades to user code.

How do the most successful ML teams take models from development to production? Do they develop on notebooks and then re-write this code into a production pipeline, or do they start from the beginning with a pipeline that they use as part of model development?

Most successful ML teams learn to work together and think of the outcome first. Getting everybody on the same page through the tooling helps.
ML product is a team sport.
Notebooks have a slight problem, but it is mostly an issue of incorrect tooling.
Following good software practices does help. Avoid problems like lost code, lost models, non-reproducible models, and lack of shared understanding.
The metaphorical wall between engineers and data scientists should be demolished to deliver products successfully.
The process of moving code from notebook to production is still evolving.

Should we separate concerns about a platform and its users? Why or why not?

Platform and User concerns should be separated.
Platforms tend to think about infrastructure, whereas users prioritize business goals over infrastructure.
The separation is critical. Otherwise, every ML engineer will become a distributed systems engineer and vice versa, which is not practical.

Should ML be tested just like software engineering?

One way of looking at ML models is that they are function approximators, where they approximate the function specified by the data (input-output pairs in the case of supervised learning).
Short answer, yes!
Long answer: If you have a function, it would make sense to test it. There are two broad ways to frame this:‍
~Testing the code that trains the model: This can be done with typical testing frameworks like Pytest. Given a mock dataset (e.g. Pandera can help to synthesize training data from a schema), test the model training function and see whether it successfully produces a trained model artifact. This is useful but not ideal because the mock data will likely be out of the distribution of the real data of interest.‍
~Testing a trained model in the lab: This is where the challenge is. In ML, a test set is analogous to test cases in the typical software engineering sense, except that it’s kind of a fuzzy acceptance test because you can never get all of them to pass… the best is to collect a baseline and make sure it doesn’t regress to lower levels of performance.‍
~Testing a trained model in production: This is the most challenging, and is analogous to regressions that might happen in traditional software function due to incorrect assumptions about the world, as seen from the many model observability startups that came up in the last few years.

Is Airflow enough for ML? Is it the right tool?

Airflow was designed to orchestrate data workflows. It is useful to run ETL/ELT tasks due to the ease with which it connects to a standard set of third-party sources to achieve data orchestration. ML practitioners have additional requirements:
~Resource-intensive tasks: Airflow doesn’t propose a direct way to handle this, but a KubernetesPodOperator can be used to overcome resource constraints by containerizing the code.‍
~Data lineage: Airflow is not a data-aware platform; it simply constructs the flow but is oblivious to the inner details of passing around the data.‍
~Versioning: There have been some indirect techniques to handle the versioning of Airflow code, but it still isn’t an obvious feature. Moreover, Airflow doesn’t support revisiting the directed acyclic graphs (DAGs) and re-running the workflows on demand.‍
~Infrastructure automation: ML code needs high processing power. However, it is difficult to achieve high resource utilization in Airflow when workflows aren’t running or when fewer than a certain number of tasks require computing capacity.‍
~Caching: Airflow doesn’t yet support caching task outputs to help expedite executions and eliminate the reuse of resources.‍
~Checkpointing and integration with ML frameworks: Airflow doesn’t support this.‍
~Scheduling multiple executions — ad hoc or in parallel for the same pipeline — isn’t something Airflow provides.

How different is Flyte from other orchestrators like Airflow, Luigi, and Kedro? What was the main motivation to start developing a new tool?

Flyte began in 2017, after we failed with Airflow. None of the other tools existed.
We took an ML-centric approach focused on reproducibility, maintainability, and flexibility.
We took a team-centric approach because, at Lyft, we were large teams running mission critical workloads on Flyte.
When we started 5 years ago, Kubernetes was new, but we had faith that it would become a standard for its ML, and eventually data systems. This is why we are top contributors to Spark on K8s and Flink on K8s.
But we also realized that Kubernetes is extremely hard for end users to use. It was not designed for stateful large scale workloads. Flyte does not scale by chance. It scales by design, by intent and by experience.
Most importantly, it abstracts K8s away while it still depends on it

What do you think are the main advantages and disadvantages of using Flyte versus Kubeflow Pipelines?

DX
UX
Maintainability
Scaling
Performance
Community

What have been some of your biggest learnings from open-sourcing Flyte?

Open-sourcing is harder than you think. It is not about the best product. It is about how to communicate and how to build a community.
Open source today has helped the development world advance. (Remember the world of proprietary operating systems and different toolchains? Linux and Git helped change all that.)
Open source is transformative — if it is truly free.
To build a business on open-source is extremely hard. The world of quick gratification makes many open-source products empty shells that cost money to do anything real with them
We at Union.ai want to change this, and we believe we can build a more sustainable company by doing so.

What exactly does someone get if they pay for the product that Union offers? Is it a managed version of Flyte?

Union Cloud is a managed version of the Flyte data and ML workflow orchestrator. It frees data and ML teams from infrastructure constraints and setup. Users of Union Cloud benefit from:
~RBAC
~Admin dashboard visibility for more efficient internal management
~Same frontend / management across multiple cloud deployments
~Super simple scaling innovations
~An upcoming vertically integrated platform

What do you think of the trade-offs between building out the open-source capabilities of Flyte and creating a wedge that will make people pay for the product Union created?

We think about building the best product in open-source. Users of Flyte love it and are willing to pay for it. This is not a business strategy. So here are real differences:
We realize that if it is hard for people to install Flyte, they give up. Some features will just make it even harder, so we just take the burden away by offering that in Union Cloud.
Let me illustrate this with an example: We spent about 3 months building full SSO support in open-source Flyte. It is one of the only products that support full oauth2 authorization server. This is because we believe that authentication is not an option, but a necessity. Security is not an option. But a lot of new users stumble on setting up oauth2. It is hard, and sometimes they deflect to using a hosted solution from a competitor or get frustrated.
If we now add a more scalable distributed database, it would make life even harder. We believe the OSS has to fit the median population with knobs in either direction.
Flyte is also completely open-source and is very general in architecture. There are parts to change so that it supports simultaneous multi-cloud, hybrid, different engines, and creating higher-level services.
We help teams that do not have infra know-how get going super-fast, without a lot of deviation from open-source software.

Did you see a difference since Flyte has been attributed “graduated” status by the Cloud Native Computing Foundation? Is the tool taken more seriously now for example?

We do see differences. There are more than 80 companies that have deployed Flyte to production.

You built the open-source software first, and then launched the company. How does GTM change if you do it the other way around?

Open-sourcing after the company starts is hard. You have to spend VC dollars to build open source, which is really hard to justify.
In addition, the motivation for open-sourcing is likely less than ideal. It is not open-sourced to benefit society, it’s really just the sort of GTM move that has given OSS tools a bad rep.

Do bigger customers necessarily have many custom requirements when it comes to ML?

The wonderful part of building Union on Flyte is that Flyte is used by really large customers.
So many things are also built into it, like SSO, security, etc.
But we try to minimize customizations, since wanting to use open-source Flyte is the #1 criterion today.
This is how we ensure that our small team does not get distracted.

You spoke about how pipeline complexity can creep up on you fast when you came on the podcast. What are some of the wildest pipelines you've seen? (At Lyft or anywhere)

We have seen a lot of wild pipelines, not only at Lyft but at Spotify, Stripe, Blackshark, etc.‍
~Lyft: surge pricing was a difficult problem. It was a blend of multiple models — around 5ish looking for different signals. A team of about 6-7 ds/mle/eng were building about 600 inter-connected and linked pipelines. Also, mapping at Lyft took the base map to final nav using multiple interrelated pipelines.‍
~Spotify: their financial forecast had 40 data analysts collaborate over Flyte to deliver and simulate different financial models.‍
~Blackshark: a pipeline with 10 million nodes is used to render the world.

Do you think Data Scientists need to know a bit about K8s in order to be able to have an impact at work? (Given that the company they work at is using K8s heavily)

No, they should not have to know about Kubernetes or even infrastructure.
What they need to know is the coding language of choice, good programming hygiene, and maybe a little bit of Docker.

Do you think monitoring data quality will become more important in the future? More and more frameworks and tools seem to focus on that lately.

Let’s divide the question into two parts: ‍
Is data quality important? Absolutely. It has always been and always will be. But, the form it will take may change.‍
Why are there so many tools? Solutions depend on 2 things:
~Market perception: Market perceives that data quality is important.‍
~Ease of building: It is much easier to build Data Quality monitoring. The design language offered is pretty similar to APM.
Monitoring data quality today may not be mature enough. I think an important question that we need to ask is: Once I measure quality, what do I do with this knowledge? If I see data drift, should I always retrain? This depends on the outcomes — the ground truth and in some cases, this is hard or very delayed.

Do you think it is good practice to have data scientists who only work on models, then pass the model to someone else in charge of deploying that model to production?

It’s not ideal, but:
~Currently, it is often the only good practice: Deploying and monitoring production models requires system and infrastructure knowledge that data scientists may lack.
~As we build a managed Flyte platform, we encourage ML teams — especially data scientists and ML engineers to deploy their models.
~We want Flyte to be connected to all experimentation and deployment platforms.

What can data scientists do in their day-to-day lives to develop good programming hygiene?

At a high level, it’s very important to be very clear about REPL-like exploration and software engineering.
REPL-like exploration helps you iterate and build domain knowledge about topics like data and models. The code that you write is secondary to human understanding of that topic.
Software engineering is the act of writing code that will last. It’s also the act of promoting human understanding of the code itself: Is it well documented? Are functions and variables named clearly? Are they unit tested? Does it explicitly lay out external dependencies? Can I build it from scratch?
One more thing: Try to learn different programming paradigms — for example, functional programming.