Halloween Special Sale - 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: best70

Page: 1 / 4
Total 40 questions
Exam Code: Databricks-Certified-Professional-Data-Engineer                Update: Oct 31, 2025
Exam Name: Databricks Certified Data Engineer Professional Exam

Databricks Databricks Certified Data Engineer Professional Exam Databricks-Certified-Professional-Data-Engineer Exam Dumps: Updated Questions & Answers (October 2025)

Question # 1

Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output?

A.

Regex

B.

Julia

C.

pyspsark.ml.feature

D.

Scala Datasets

E.

C++

Question # 2

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs Ul. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.

What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?

A.

Can manage

B.

Can edit

C.

Can run

D.

Can Read

Question # 3

Given the following error traceback (from display(df.select(3*"heartrate"))) which shows AnalysisException: cannot resolve 'heartrateheartrateheartrate', which statement describes the error being raised?

A.

There is a type error because a DataFrame object cannot be multiplied.

B.

There is a syntax error because the heartrate column is not correctly identified as a column.

C.

There is no column in the table named heartrateheartrateheartrate.

D.

There is a type error because a column object cannot be multiplied.

Question # 4

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.

Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

A.

Scala is the only language that can be accurately tested using interactive notebooks; because the best performance is achieved by using Scala code compiled to JARs. all PySpark and Spark SQL logic should be refactored.

B.

The only way to meaningfully troubleshoot code execution times in development notebooks Is to use production-sized data and production-sized clusters with Run All execution.

C.

Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production.

D.

Calling display () forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results.

E.

The Jobs Ul should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.

Question # 5

A Delta Lake table with Change Data Feed (CDF) enabled in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources. The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours. Which approach would simplify the identification of these changed records?

A.

Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.

B.

Modify the overwrite logic to include a field populated by calling current_timestamp() as data are being written; use this field to identify records written on a particular date.

C.

Replace the current overwrite logic with a MERGE statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the Change Data Feed.

D.

Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.

Question # 6

A distributed team of data analysts share computing resources on an interactive cluster with autoscaling configured. In order to better manage costs and query throughput, the workspace administrator is hoping to evaluate whether cluster upscaling is caused by many concurrent users or resource-intensive queries.

In which location can one review the timeline for cluster resizing events?

A.

Workspace audit logs

B.

Driver's log file

C.

Ganglia

D.

Cluster Event Log

E.

Executor's log file

Question # 7

A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each micro-batch of data is processed in less than 3 seconds; at least 12 times per minute, a micro-batch is processed that contains 0 records. The streaming write was configured using the default trigger settings. The production job is currently scheduled alongside many other Databricks jobs in a workspace with instance pools provisioned to reduce start-up time for jobs with batch execution. Holding all other variables constant and assuming records need to be processed in less than 10 minutes, which adjustment will meet the requirement?

A.

Set the trigger interval to 500 milliseconds; setting a small but non-zero trigger interval ensures that the source is not queried too frequently.

B.

Set the trigger interval to 3 seconds; the default trigger interval is consuming too many records per batch, resulting in spill to disk that can increase volume costs.

C.

Set the trigger interval to 10 minutes; each batch calls APIs in the source storage account, so decreasing trigger frequency to the maximum allowable threshold should minimize this cost.

D.

Use the trigger once option and configure a Databricks job to execute the query every 10 minutes; this approach minimizes costs for both compute and storage.

Question # 8

A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constrains and multi-table inserts to validate records on write.

Which consideration will impact the decisions made by the engineer while migrating this workload?

A.

All Delta Lake transactions are ACID compliance against a single table, and Databricks does not enforce foreign key constraints.

B.

Databricks only allows foreign key constraints on hashed identifiers, which avoid collisions in highly-parallel writes.

C.

Foreign keys must reference a primary key field; multi-table inserts must leverage Delta Lake's upsert functionality.

D.

Committing to multiple tables simultaneously requires taking out multiple table locks and can lead to a state of deadlock.

Question # 9

Review the following error traceback:

Which statement describes the error being raised?

A.

The code executed was PvSoark but was executed in a Scala notebook.

B.

There is no column in the table named heartrateheartrateheartrate

C.

There is a type error because a column object cannot be multiplied.

D.

There is a type error because a DataFrame object cannot be multiplied.

E.

There is a syntax error because the heartrate column is not correctly identified as a column.

Question # 10

The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.

Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.

A.

"Can Manage" privileges on the required cluster

B.

Workspace Admin privileges, cluster creation allowed. "Can Attach To" privileges on the required cluster

C.

Cluster creation allowed. "Can Attach To" privileges on the required cluster

D.

"Can Restart" privileges on the required cluster

E.

Cluster creation allowed. "Can Restart" privileges on the required cluster

Page: 1 / 4
Total 40 questions

Most Popular Certification Exams

Payment

       

Contact us

dumpscollection live chat

Site Secure

mcafee secure

TESTED 31 Oct 2025