Slurm agent
This guide provides a comprehensive overview of setting up an environment to test the Slurm agent locally and enabling the agent in your Flyte deployment. Before proceeding, the first and foremost step is to spin up your own Slurm cluster, as it serves as the foundation for the setup.
Spin up a Slurm cluster
Setting up a Slurm cluster can be challenging due to the limited detail in the official instructions. This tutorial simplifies the process, focusing on configuring a single-host Slurm cluster with slurmctld
(central management daemon) and slurmd
(compute node daemon).
Install MUNGE
MUNGE is an authentication service, allowing a process to authenticate the UID and GID of another local or remote process within a group of hosts having common users and groups.
1. Install necessary packages
sudo apt install munge libmunge2 libmunge-dev
2. Generate and verify a MUNGE credential
munge -n | unmunge | grep STATUS
A status of
STATUS: Success(0)
is expected and the MUNGE key is stored at/etc/munge/munge.key
. If the key is absent, run the following:
sudo /usr/sbin/create-munge-key
3. Change ownership and permissions of MUNGE directories
sudo chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
sudo chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/
sudo chmod 0755 /run/munge/
sudo chmod 0700 /etc/munge/munge.key
sudo chown -R munge: /etc/munge/munge.key
4. Start MUNGE
sudo systemctl enable munge
sudo systemctl restart munge
Check the status with
systemctl status munge
or inspect the log at/var/log/munge
.
Create a dedicated Slurm user
The SlurmUser must be created as needed prior to starting Slurm and must exist on all nodes in your cluster.
sudo adduser --system --uid <uid> --group --home /var/lib/slurm slurm
A system user usually has a
uid
in the range of 0-999. Refer to Add a system user.
cat /etc/passwd | grep <uid>
sudo mkdir -p /var/spool/slurmctld /var/spool/slurmd /var/log/slurm
sudo chown -R slurm: /var/spool/slurmctld /var/spool/slurmd /var/log/slurm
Run the Slurm cluster
1. Install Slurm packages
mkdir <your-clean-dir> && cd <your-clean-dir>
wget https://download.schedmd.com/slurm/slurm-24.05.5.tar.bz2
sudo apt-get update
sudo apt-get install -y build-essential fakeroot devscripts equivs
sudo apt install -y \
libncurses-dev libgtk2.0-dev libpam0g-dev libperl-dev liblua5.3-dev \
libhwloc-dev dh-exec librrd-dev libipmimonitoring-dev hdf5-helpers \
libfreeipmi-dev libhdf5-dev man2html-base libcurl4-openssl-dev \
libpmix-dev libhttp-parser-dev libyaml-dev libjson-c-dev \
libjwt-dev liblz4-dev libmariadb-dev libdbus-1-dev librdkafka-dev
tar -xaf slurm-24.05.5.tar.bz2
cd slurm-24.05.5
sudo sed -i 's/^Types: deb$/Types: deb deb-src/' /etc/apt/sources.list.d/ubuntu.sources
sudo apt update
sudo mk-build-deps -i debian/control
debuild -b -uc -us
Install the packages:
cd ..
sudo dpkg -i slurm-smd_24.05.5-1_amd64.deb
sudo dpkg -i slurm-smd-client_24.05.5-1_amd64.deb
sudo dpkg -i slurm-smd-slurmctld_24.05.5-1_amd64.deb
sudo dpkg -i slurm-smd-slurmd_24.05.5-1_amd64.deb
2. Generate a Slurm configuration file
Generate slurm.conf
using the official configurator.
Example:
ClusterName=localcluster
SlurmctldHost=localhost
ProctrackType=proctrack/linuxproc
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
NodeName=localhost CPUs=<cpus> RealMemory=<available-mem> Sockets=<sockets> CoresPerSocket=<cores-per-socket> ThreadsPerCore=<threads-per-core> State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
For GPU support, edit
/etc/slurm/slurm.conf
and/etc/slurm/gres.conf
:
# slurm.conf
GresTypes=gpu
NodeName=localhost Gres=gpu:1 CPUs=<cpus> RealMemory=<available-mem> ...
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
# gres.conf
AutoDetect=nvml
NodeName=localhost Name=gpu Type=tesla File=/dev/nvidia0
3. Start daemons
sudo systemctl enable slurmctld
sudo systemctl restart slurmctld
sudo systemctl enable slurmd
sudo systemctl restart slurmd
4. Try some Slurm commands
sinfo
srun -N 1 hostname
If the state is
drain
, run:
scontrol update nodename=<your-nodename> state=idle
Test your Slurm agent locally
This section describes how to test the Slurm agent locally without running the backend gRPC server.
Overview
The Slurm agent has 3 core methods:
create
: Runsrun
orsbatch
get
: Query job status usingscontrol
delete
: Cancel job usingscancel
For Python function tasks:
Set up a local test environment
You need: a client (localhost), a Slurm cluster, and S3-compatible object storage.
1. Install the Slurm agent on your local machine
pip install flytekitplugins-slurm
2. Install the Slurm agent on the cluster
pip install flytekitplugins-slurm
3. Set up SSH configuration
ssh-keygen -t rsa -b 4096
ssh-copy-id <username>@<fqdn-or-ip>
~/.ssh/config
:
Host <host-alias>
HostName <fqdn-or-ip>
Port <ssh-port>
User <username>
IdentityFile <path-to-private-key>
Sanity check:
ssh <host-alias>
4. Set up Amazon S3 bucket
Configure AWS credentials on both client and server.
~/.aws/config
:
[default]
region = <your-region>
~/.aws/credentials
:
[default]
aws_access_key_id = <aws-access-key-id>
aws_secret_access_key = <aws-secret-access-key>
On EC2, assign an IAM role with S3 permissions.
Check access:
aws s3 ls
Specify agent configuration
Flyte binary
Demo cluster
kubectl edit configmap flyte-sandbox-config -n flyte
tasks:
task-plugins:
default-for-task-types:
container: container
container_array: k8s-array
sidecar: sidecar
slurm_fn: agent-service
slurm: agent-service
enabled-plugins:
- container
- sidecar
- k8s-array
- agent-service
Helm chart
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- k8s-array
- agent-service
default-for-task-types:
- container: container
- container_array: k8s-array
- slurm_fn: agent-service
- slurm: agent-service
Flyte core
values-override.yaml
:
enabled_plugins:
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- k8s-array
- agent-service
default-for-task-types:
container: container
sidecar: sidecar
container_array: k8s-array
slurm_fn: agent-service
slurm: agent-service
Add the Slurm Private Key
1. Install flyteagent pod
helm repo add flyteorg https://flyteorg.github.io/flyte
helm install flyteagent flyteorg/flyteagent --namespace flyte
2. Set Private Key as a Secret
SECRET_VALUE=$(base64 < your_slurm_private_key_path) && \
kubectl patch secret flyteagent -n flyte --patch "{\"data\":{\"flyte_slurm_private_key\":\"$SECRET_VALUE\"}}"
3. Restart development
kubectl rollout restart deployment flyteagent -n flyte