Protein Folding: An Example of Bioinformatics with Union
Protein folding has profound implications for understanding disease, drug development, and fundamental life processes. Every protein begins as a chain of amino acids but must fold into a specific three-dimensional structure to function properly. This structure determines how the protein binds, catalyzes reactions, and performs structural roles in cells. When proteins misfold, it can lead to diseases like Alzheimer's, Parkinson's, and various cancers, making protein structure prediction crucial for both basic research and medical applications.
The protein folding problem has been notoriously difficult for several reasons. The number of possible conformations a protein chain can adopt is astronomical—even a small protein of 100 amino acids could theoretically fold in more ways than there are atoms in the universe. Colabfold has democratized this but still relies on free tools like Google Colab, which imposes runtime constraints and is prone to disconnections, making it unreliable for large-scale studies. Additionally, MMseqs2’s hosted server can become a bottleneck during peak usage, limiting customization with proprietary data. These challenges underscore the need for robust, flexible platforms to support advanced protein folding research.
Protein Folding with Union
Union addresses these challenges with an end-to-end AI development platform tailored for production-grade bioinformatics workflows. With Union, you can:
Provision Flexible Infrastructure
On Union, you can provision virtually any infrastructure and experience no inherent runtime constraints or interruptions. Even better, you can specify which accelerator you want instead of relying on what’s available. You can, of course, still rely on the MMseqs server for generating MSAs, or provision your own (as detailed in our Protein Folding and Bioinformatics white paper offered on this page).
Optimize Workflows
Leverage Union’s task-oriented orchestration to dynamically scale workloads while reducing idle costs. Scaling down to 0 is a prudent cost-saving measure; and scaling up from 0 necessitates re-provisioning. However, this dynamic is managed much more efficiently by formalizing these different stages as discrete tasks. One task can populate the databases, another can run MMseqs, and finally a third can run the actual structure prediction. These tasks can all be assembled flexibly into a workflow, chaining inputs and outputs to implicitly form an execution graph. Finally, this workflow can be executed on a schedule which can be tuned to match the cadence of input generation.
Enhance Productivity
Use enterprise-grade features like checkpointing, versioning, caching, and dependency management to accelerate research and enable production-grade protein structure prediction.
Stateful Containers to Reduce Cost
Bioinformatics workflows need to be optimized to run efficiently, or else you risk runaway compute costs. Union Actors are stateful containers that can be used across several tasks while only having to start up once – dramatically reducing the upstart cost of repeated tasks. In the context of folding, this means that the databases can be downloaded to the Actor environment, potentially even loaded into memory, allowing multiple tasks to be submitted to this environment indefinitely.
The Future of Bioinformatics and Union
The possibilities for advancing bioinformatics with Union are vast. Researchers can scale workflows for high-throughput studies, integrate proprietary data, or schedule automated runs to align with research needs. By bridging the gap between accessibility and advanced capabilities, Union is empowering scientists to push the boundaries of bioinformatics and AI. You can also check out our ever-growing UnionBio repo.
Access Union’s Colabfold White Paper
For a much more detailed view into how Union powers bioinformatics, check out our protein folding white paper. The white paper includes a tutorial and instructions for:
- Setting up Union Actors
- Creating tasks and workflows
- Downloading and preparing databases
- Generating multiple sequence alignments locally (via MMseqs2)
- Protein structure prediction (via AlphaFold2)
- Visualizing the predicted structure