Integrate with InferenceService

This page shows how to leverage the Alauda Build of Kueue's scheduling and resource management capabilities when running inferenceService in Alauda AI.

Prerequisites

You have installed the Alauda AI.
You have installed the Alauda Build of Kueue.
You have installed the Alauda Build of Hami(for demonstrating vGPU).
The Alauda Container Platform Web CLI has communication with your cluster.

Procedure

Create a project and namespace in Alauda Container Platform, for example, the project name is test, and the namespace name is test-1.
Switch to Alauda AI, click Namespace Manage in Admin > Management Namespace, and select the previously created namespace to complete the management

Create the assets by running the following command:

cat <<EOF| kubectl create -f -
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: cluster-queue
spec:
  namespaceSelector: {}
  resourceGroups:
  - coveredResources: ["cpu", "memory", "pods", "ephemeral-storage", "nvidia.com/gpualloc", "nvidia.com/total-gpucores", "nvidia.com/total-gpumem"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 9
      - name: "memory"
        nominalQuota: 36Gi
      - name: "pods"
        nominalQuota: 5
      - name: "nvidia.com/gpualloc"
        nominalQuota: "2"
      - name: "nvidia.com/total-gpucores"
        nominalQuota: "50"
      - name: "nvidia.com/total-gpumem"
        nominalQuota: "20000"
      - name: "ephemeral-storage"
        nominalQuota: 100Gi
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata:
  name: default-flavor
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
  namespace: test-1
  name: test
spec:
  clusterQueue: cluster-queue
EOF

Create a InferenceService resource in the Alauda AI UI with label kueue.x-k8s.io/queue-name: test:

kind: InferenceService
apiVersion: serving.kserve.io/v1beta1
metadata:
  labels:
    kueue.x-k8s.io/queue-name: test
  name: test
  namespace: test-1
# ...
spec:
    model:
      resources:
        limits:
          cpu: '1'
          ephemeral-storage: 10Gi
          memory: 2Gi
          nvidia.com/gpualloc: '1'
          nvidia.com/gpucores: '80'
          nvidia.com/gpumem: 8k
# ...

Observe pods of the InferenceService:

kubectl -n test-1 get pod | grep test-predictor

You will see that this pod is in a SchedulingGated state:

test-predictor-8475554f4d-zw7lp   0/1     SchedulingGated   0          13s   <none>   <none>   <none>           <none>

Update the nvidia.com/total-gpucores quotas:

cat <<EOF| kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: cluster-queue
spec:
  namespaceSelector: {}
  resourceGroups:
  - coveredResources: ["cpu", "memory", "pods", "ephemeral-storage", "nvidia.com/gpualloc", "nvidia.com/total-gpucores", "nvidia.com/total-gpumem"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 9
      - name: "memory"
        nominalQuota: 36Gi
      - name: "pods"
        nominalQuota: 5
      - name: "nvidia.com/gpualloc"
        nominalQuota: "2"
      - name: "nvidia.com/total-gpucores"
        nominalQuota: "100"
      - name: "nvidia.com/total-gpumem"
        nominalQuota: "20000"
      - name: "ephemeral-storage"
        nominalQuota: 100Gi
EOF

You will see that this pod is in a Running state:

test-predictor-8475554f4d-zw7lp   1/1     Running   0          13s   <none>   <none>   <none>           <none>

#Integrate with InferenceService

#TOC

#Prerequisites

#Procedure

Integrate with InferenceService

TOC

Prerequisites

Procedure