You are an administrator managing a large-scale Kubernetes-based GPU cluster using Run:AI.
To automate repetitive administrative tasks and efficiently manage resources across multiple nodes, which of the following is essential when using the Run:AI Administrator CLI for environments where automation or scripting is required?
What two (2) platforms should be used with Fabric Manager? (Choose two.)
A system administrator notices that jobs are failing intermittently on Base Command Manager due to incorrect GPU configurations in Slurm. The administrator needs to ensure that jobs utilize GPUs correctly.
How should they troubleshoot this issue?
A Slurm user needs to submit a batch job script for execution tomorrow.
Which command should be used to complete this task?
An administrator is troubleshooting a bottleneck in a deep learning run time and needs consistent data feed rates to GPUs.
Which storage metric should be used?
You are a Solutions Architect designing a data center infrastructure for a cloud-based AI application that requires high-performance networking, storage, and security. You need to choose a software framework to program the NVIDIA BlueField DPUs that will be used in the infrastructure. The framework must support the development of custom applications and services, as well as enable tailored solutions for specific workloads. Additionally, the framework should allow for the integration of storage services such as NVMe over Fabrics (NVMe-oF) and elastic block storage.
Which framework should you choose?
You are managing a Slurm cluster with multiple GPU nodes, each equipped with different types of GPUs. Some jobs are being allocated GPUs that should be reserved for other purposes, such as display rendering.
How would you ensure that only the intended GPUs are allocated to jobs?
A Slurm user needs to display real-time information about the running processes and resource usage of a Slurm job.
Which command should be used?
A Fleet Command system administrator wants to create an organization user that will have the following rights:
For locations - read only
For Applications - read/write/admin
For Deployments - read/write/admin
For Dashboards - read only
What role should the system administrator assign to this user?
A new researcher needs access to GPU resources but should not have permission to modify cluster settings or manage other users.
What role should you assign them in Run:ai?