Summer Special Sale - 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: best70

Page: 1 / 4
Total 36 questions
Exam Code: NCP-AII                Update: Jun 27, 2026
Exam Name: NVIDIA AI Infrastructure

NVIDIA NVIDIA AI Infrastructure NCP-AII Exam Dumps: Updated Questions & Answers (June 2026)

Question # 1

For an NVIDIA Enterprise AI Factory with 256 GPUs, which storage solution characteristic is most critical to validate during scaling tests?

A.

Consistent per-node throughput > 8 GiB/s.

B.

Single-node write performance during idle clusters.

C.

RAID rebuild times under disk failure.

D.

Maximum 4K random read IOPS exceeding 1 million.

Question # 2

A systems administrator is preparing a new DGX server for deployment. What is the most secure approach to configuring the BMC port during initial setup?

A.

Enable remote access to the BMC over the internet using the default admin credentials for initial troubleshooting.

B.

Connect the BMC port directly to the production network and retain default admin credentials for convenience.

C.

Leave the BMC port disconnected until after the operating system is fully configured and in production.

D.

Connect the BMC port to a dedicated and firewalled network and change the default admin credentials.

Question # 3

An administrator is configuring node categories in BCM for a DGX BasePOD cluster. They need to group all NVIDIA DGX H200 nodes under a dedicated category for GPU-accelerated workloads. Which approach aligns with NVIDIA ' s recommended BCM practices?

A.

Assign nodes to the ’login " category to simplify Slurm integration.

B.

Create a new " dgx-h200 " category, assign all DGX H200 nodes to it.

C.

Use the existing " dgxnodes " category without modification, as it is preconfigured for all DGX systems.

D.

Avoid categories and configure each DGX node individually via CLI.

Question # 4

An administrator installs NVIDIA GPU drivers on a DGX H100 system with UEFI Secure Boot enabled. After reboot, the drivers fail to load. What is the first action to resolve this issue?

A.

Disable Secure Boot permanently in BIOS/UEFI settings.

B.

Delete /etc/X11/xorg.conf to force driver reconfiguration.

C.

Enroll the Machine Owner Key (MOK) during system reboot and enter the recorded password.

D.

Reinstall drivers using apt-get install nvidia-driver-550 without rebooting.

Question # 5

After a firmware upgrade on a DGX H100, the administrator notices that one GPU is not detected by the system. Which troubleshooting step should be performed first to identify the root cause?

A.

Review firmware update logs and run nvsm show health to check for hardware or firmware errors on the affected GPU.

B.

Remove the GPU from the system and replace it with a new one before any diagnostics.

C.

Ignore the issue and proceed with production workloads if the other GPUs are operational.

D.

Immediately re-run the firmware upgrade on all system components.

Question # 6

During HPL execution on a DGX cluster, the benchmark fails with " not enough memory " errors despite sufficient physical RAM. Which HPL.dat parameter adjustment is most effective?

A.

Reduce the problem size while maintaining the same block size.

B.

Set PMAP to 1 to enable process mapping.

C.

Increase block size to 6144 to maximize GPU utilization.

D.

Disable double-buffering via BCAST parameter.

Question # 7

An infrastructure engineer runs an NCCL burn-in on an eight-node GPU cluster. Over a 12-hour period, all GPUs are tested with repeated all-reduce collectives. Monitoring tools show the following observations:

Aggregate bandwidth remains within 5% of documented reference for the hardware on every run.

No errors or timeouts are reported in NCCL logs.

On three occasions, one GPU logged single-run bandwidth dips of 15–20% compared to its normal performance, but performance recovered on the next run and stayed stable afterward. System logs show no hardware or driver errors.

Two minor NCCL WARN-level messages about “unexpected latency spike” appear in system logs for separate nodes, but could not be reproduced.

Which conclusion is the best strategy before releasing the cluster to production?

A.

Proceed, since all bandwidth targets are met, issues were transient and self-resolved, and there are no persistent errors or timeouts across repeated burn-ins.

B.

Recommend proactive maintenance, because any bandwidth drop, even if transient and unreproducible, shows the burn-in failed; clusters must not show performance variance above 10% for any GPU even once.

C.

Approve for AI workload use, but flag affected nodes for manual exclusion from distributed training jobs, as nodes showing any anomaly should be isolated whenever possible.

Question # 8

A DGX H100 system shows intermittent “Link Down” errors on a 200G DAC cable. CVT reports “No Signal” despite physical connection. What is the first hardware check?

A.

Replace the switch’s optical transceiver with a higher-wattage model.

B.

Reconfigure the port for 100G speeds via NVIDIA MST.

C.

Upgrade all leaf switches to support RS-FEC.

D.

Verify cable compatibility via the ConnectX-7 firmware validated adapters list and inspect connectors for damage.

Question # 9

A system administrator wants to configure MIG for seven slices on an H100 GPU in an NVIDIA HGX system. Which command should be used?

A.

mig-parted

B.

nvidia-smi

C.

nvcc

D.

nvlink-config

Question # 10

A user wants to restrict a Docker container to use only GPUs 0 and 2. Which command achieves this?

A.

docker run --gpus ' " device=0,2 " ' nvidia/cuda:12.1-base nvidia-smi

B.

docker run -e NVIDIA_VISIBLE_DEVICES=0,2 nvidia/cuda:12.1-base nvidia-smi

C.

docker run --gpus all nvidia/cuda:12.1-base nvidia-smi -id=0,2

D.

docker run --device /dev/nvidia0,/dev/nvidia2 nvidia/cuda:12.1-base nvidia-smi

Page: 1 / 4
Total 36 questions

Most Popular Certification Exams

Payment

       

Contact us

Site Secure

mcafee secure

TESTED 27 Jun 2026