Using Argo on the OFO cluster
This guide describes how to submit workflows to Argo and monitor them. This is a generic guide for any Argo workflow.
For specific instructions on running the photogrammetry (automate-metashape) workflow, see the Photogrammetry workflow guide.
Prerequisites
This guide assumes you already:
- Have access to the
ofoclusterkubernetes cluster - Have resized the cluster to the apprpriate size for your workflow
- Have
kubectlconfigured to connect to the cluster
For guidance on all of this setup, see Cluster access and resizing.
Install the Argo CLI locally (one-time)
The Argo CLI is a wrapper around kubectl that simplifies communication with Argo on the cluster.
You should install it on your local machine since that is where you will have set up and
authenticated kubectl following the Cluster access guide.
# Specify the CLI version to install
ARGO_WORKFLOWS_VERSION="v3.7.2"
ARGO_OS="linux"
# Download the binary
wget "https://github.com/argoproj/argo-workflows/releases/download/${ARGO_WORKFLOWS_VERSION}/argo-${ARGO_OS}-amd64.gz"
# Unzip
gunzip "argo-${ARGO_OS}-amd64.gz"
# Make binary executable
chmod +x "argo-${ARGO_OS}-amd64"
# Move binary to path
sudo mv "./argo-${ARGO_OS}-amd64" /usr/local/bin/argo
# Test installation
argo version
Authenticate with the cluster
Once after every reboot, you will need to re-set your KUBECONFIG environment variable so that
argo (and the underlying kubectl) CLI tools can authenticate with the cluster:
Submit a workflow
Simply run argo submit -n argo /path/to/your/workflow.yaml, optionally adding parameters
by appending text in the format -p PARAMETER_NAME=parameter_value. The parameter_value can be
an environment variable. For example:
Observe and manage workflows
Argo web UI
Access the Argo UI at argo.focal-lab.org. When prompted to log in, supply the client authentication token. You can find the token string in Vaultwarden under the record "Argo UI token".
Command line
Argo provides a CLI to inspect and monitor workflows. But sometimes pure Kubernetes works well too. For example, here is a snippet to show all nodes and which Argo pods are running on each.
# Get all nodes first
kubectl get nodes -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n' | grep -v '^$' > /tmp/nodes.txt
# Get all non-DaemonSet pods in argo namespace
kubectl get pods -n argo -o custom-columns=NODE:.spec.nodeName,NAMESPACE:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase,OWNER:.metadata.ownerReferences[0].kind --no-headers 2>/dev/null | \
grep -v DaemonSet > /tmp/pods.txt
# Display results
while read node; do
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo "NODE: $node"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
# Get pods for this node
grep "^$node " /tmp/pods.txt 2>/dev/null | awk '{printf " %-40s %-20s %s\n", $3, $2, $4}'
# Check if node has no pods
if ! grep -q "^$node " /tmp/pods.txt 2>/dev/null; then
echo " (no argo pods)"
fi
echo ""
done < /tmp/nodes.txt
# Cleanup
rm /tmp/nodes.txt /tmp/pods.txt
Autoscaler considerations
Because our cluster is set up to autoscale (add/remove nodes to match demand), there is the possibility that workflow pods that are running on underutilized nodes will get evicted (killed) and rescheduled on a different node, in the autoscaler's effort to free up and remove the underutilized node. To minimize the probability of this, we have implemented several measures:
-
Pod packing via affinity rules: Workflow pods prefer to schedule on nodes that already have other running workflow pods. This keeps pods consolidated on fewer nodes, reducing the chance that a node becomes "underutilized" while still running a workflow pod. This is configured via
podAffinityrules that prefer nodes with pods labeledworkflows.argoproj.io/completed: "false". -
Automatic pod cleanup: Completed pods are automatically deleted via
podGC: OnPodCompletion. This prevents finished pods from lingering and confusing the scheduler's affinity decisions. -
S3 log archiving: Workflow logs are archived to S3, so we don't need completed pods to stick around for log access.
-
Eviction protection: All workflow pods are annotated with
cluster-autoscaler.kubernetes.io/safe-to-evict: "false". This tells the cluster autoscaler to never evict running workflow pods, even if the node appears underutilized. Once pods complete, they are deleted by podGC, which allows the node to scale down normally.
These defaults are configured in the workflow controller configmap (see Argo installation).
GPU node scheduling
To prevent expensive GPU resources from being consumed by CPU-only workloads, the cluster uses explicit scheduling rules:
How it works:
- CPU nodes are labeled with
workload-type: cpubased on their nodegroup naming pattern (any node withcpuin the name) - CPU pods use
nodeSelector: workload-type: cputo explicitly target CPU nodes - GPU pods request GPU resources (e.g.,
nvidia.com/gpuor MIG resources), which naturally constrains them to nodes advertising those resources - All pods still inherit
podAffinityfrom the workflow controller configmap to prefer scheduling on nodes with other running pods (to support autoscaling in removing empty nodes)
This approach ensures CPU pods cannot schedule on GPU nodes, even during the brief period when new GPU nodes join the cluster before NFD labels them.
For admin setup of CPU node labeling, see CPU node labeling.
MIG (Multi-Instance GPU)
For workloads with low GPU utilization, MIG nodegroups partition each A100 into multiple isolated slices. To use MIG, create a MIG nodegroup (see MIG nodegroups).
For custom workflows, update the workflow GPU template to request MIG resources:
resources:
requests:
nvidia.com/mig-2g.10gb: 1 # Instead of nvidia.com/gpu: 1
cpu: "10"
memory: "38Gi"
For the photogrammetry workflow, configure MIG resources (along with CPU/memory) in your config file's argo section instead of editing the workflow YAML. See Step-based workflow resource configuration for details.
Available MIG resource types:
nvidia.com/mig-1g.5gb- 1/7 compute, 5GB VRAM (7 per GPU)nvidia.com/mig-2g.10gb- 2/7 compute, 10GB VRAM (3 per GPU)nvidia.com/mig-3g.20gb- 3/7 compute, 20GB VRAM (2 per GPU)
GPU resource requests work for both MIG and non-MIG GPU nodes.