Eliminate Risk of Failure with Databricks Certified Associate Developer for Apache Spark 3.0 Exam Dumps

Schedule your time wisely to provide yourself sufficient time each day to prepare for the Databricks Certified Associate Developer for Apache Spark 3.0 exam. Make time each day to study in a quiet place, as you'll need to thoroughly cover the material for the Databricks Certified Associate Developer for Apache Spark 3.0 exam. Our actual Apache Spark Associate Developer exam dumps help you in your preparation. Prepare for the Databricks Certified Associate Developer for Apache Spark 3.0 exam with our Databricks Certified Associate Developer for Apache Spark 3.0 dumps every day if you want to succeed on your first try.

GET UNLIMITED ACCESS

All Study Materials

Instant Downloads

24/7 costomer support

Satisfaction Guaranteed

Q1.

Which of the following code blocks returns the number of unique values in column storeId of DataFrame transactionsDf?

AtransactionsDf.select('storeId').dropDuplicates().count()

BtransactionsDf.select(count('storeId')).dropDuplicates()

CtransactionsDf.select(distinct('storeId')).count()

DtransactionsDf.dropDuplicates().agg(count('storeId'))

EtransactionsDf.distinct().select('storeId').count()

Answer: A

See the explanation below.

transactionsDf.select('storeId').dropDuplicates().count()

Correct! After dropping all duplicates from column storeId, the remaining rows get counted, representing the number of unique values in the column.

transactionsDf.select(count('storeId')).dropDuplicates()

No. transactionsDf.select(count('storeId')) just returns a single-row DataFrame showing the number of non-null rows. dropDuplicates() does not have any effect in this context.

transactionsDf.dropDuplicates().agg(count('storeId'))

Incorrect. While transactionsDf.dropDuplicates() removes duplicate rows from transactionsDf, it does not do so taking only column storeId into consideration, but eliminates full row duplicates

instead.

transactionsDf.distinct().select('storeId').count()

Wrong. transactionsDf.distinct() identifies unique rows across all columns, but not only unique rows with respect to column storeId. This may leave duplicate values in the column, making the count

not represent the number of unique values in that column.

transactionsDf.select(distinct('storeId')).count()

False. There is no distinct method in pyspark.sql.functions.

Q2.

The code block displayed below contains one or more errors. The code block should load parquet files at location filePath into a DataFrame, only loading those files that have been modified before

2029-03-20 05:44:46. Spark should enforce a schema according to the schema shown below. Find the error.

Schema:

1. root

2. |-- itemId: integer (nullable = true)

3. |-- attributes: array (nullable = true)

4. | |-- element: string (containsNull = true)

5. |-- supplier: string (nullable = true)

Code block:

1. schema = StructType([

2. StructType("itemId", IntegerType(), True),

3. StructType("attributes", ArrayType(StringType(), True), True),

4. StructType("supplier", StringType(), True)

5. ])

7. spark.read.options("modifiedBefore", "2029-03-20T05:44:46").schema(schema).load(filePath)

AThe attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.

BColumns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.

CThe data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.

DColumns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.

EColumns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.

Answer: D

See the explanation below.

Correct code block:

schema = StructType([

StructField('itemId', IntegerType(), True),

StructField('attributes', ArrayType(StringType(), True), True),

StructField('supplier', StringType(), True)

])

spark.read.options(modifiedBefore='2029-03-20T05:44:46').schema(schema).parquet(filePath)

This Question: is more difficult than what you would encounter in the exam. In the exam, for this Question: type, only one error needs to be identified and not 'one or multiple' as in the

question.

Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.

Correct! Columns in the schema definition should use the StructField type. Building a schema from pyspark.sql.types, as here using classes like StructType and StructField, is one of multiple ways

of expressing a schema in Spark. A StructType always contains a list of StructFields (see documentation linked below). So, nesting StructType and StructType as shown in the Question: is

wrong.

The modification date threshold should be specified by a keyword argument like options(modifiedBefore='2029-03-20T05:44:46') and not two consecutive non-keyword arguments as in the original

code block (see documentation linked below).

Spark cannot identify the file format correctly, because either it has to be specified by using the DataFrameReader.format(), as an argument to DataFrameReader.load(), or directly by calling, for

example, DataFrameReader.parquet().

Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.

No. If StructField would be used for the columns instead of StructType (see above), the third argument specified whether the column is nullable. The original schema shows that columns should be

nullable and this is specified correctly by the third argument being True in the schema in the code block.

It is correct, however, that the modification date threshold is specified incorrectly (see above).

The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.

Wrong. The attributes array is specified correctly, following the syntax for ArrayType (see linked documentation below). That Spark cannot identify the file format is correct, see correct answer

above. In addition, the DataFrameReader is called correctly through the SparkSession spark.

Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.

Incorrect, the object types in the schema definition are correct and syntax of the call to Spark's DataFrameReader is correct.

The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.

False. The data type of the schema is StructType and an accepted data type for the DataFrameReader.schema() method. It is correct however that the modification date threshold is specified

incorrectly (see correct answer above).

Q3.

The code block shown below should return a DataFrame with only columns from DataFrame transactionsDf for which there is a corresponding transactionId in DataFrame itemsDf. DataFrame

itemsDf is very small and much smaller than DataFrame transactionsDf. The query should be executed in an optimized way. Choose the answer that correctly fills the blanks in the code block to

accomplish this.

__1__.__2__(__3__, __4__, __5__)

A1. transactionsDf
2. join
3. broadcast(itemsDf)
4. transactionsDf.transactionId==itemsDf.transactionId
5. 'outer'

B1. transactionsDf
2. join
3. itemsDf
4. transactionsDf.transactionId==itemsDf.transactionId
5. 'anti'

C1. transactionsDf
2. join
3. broadcast(itemsDf)
4. 'transactionId'
5. 'left_semi'

D1. itemsDf
2. broadcast
3. transactionsDf
4. 'transactionId'
5. 'left_semi'

E1. itemsDf
2. join
3. broadcast(transactionsDf)
4. 'transactionId'
5. 'left_semi'

Answer: C

See the explanation below.

Correct code block:

transactionsDf.join(broadcast(itemsDf), 'transactionId', 'left_semi')

This Question: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that 'the query should be executed in an optimized way'. You also have qualitative information about the size of itemsDf and

transactionsDf. Given that itemsDf is 'very small' and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the 'very small'

DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard

it. Another answer option wraps the broadcast() operator around transactionsDf - the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can

likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([...]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([...]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An

outer join would include columns from both DataFrames, where a left semi join only includes columns from the 'left' table, here transactionsDf, just as asked for by the question. So, the correct

answer is the one that uses the left_semi join.

Q4.

The code block shown below should show information about the data type that column storeId of DataFrame transactionsDf contains. Choose the answer that correctly fills the blanks in the code

block to accomplish this.

Code block:

transactionsDf.__1__(__2__).__3__

A1. select
2. 'storeId'
3. print_schema()

B1. limit
2. 1
3. columns

C1. select
2. 'storeId'
3. printSchema()

D1. limit
2. 'storeId'
3. printSchema()

E1. select
2. storeId
3. dtypes

Answer: B

See the explanation below.

Correct code block:

transactionsDf.select('storeId').printSchema()

The difficulty of this Question: is that it is hard to solve with the stepwise first-to-last-gap approach that has worked well for similar questions, since the answer options are so different from

one

another. Instead, you might want to eliminate answers by looking for patterns of frequently wrong answers.

A first pattern that you may recognize by now is that column names are not expressed in quotes. For this reason, the answer that includes storeId should be eliminated.

By now, you may have understood that the DataFrame.limit() is useful for returning a specified amount of rows. It has nothing to do with specific columns. For this reason, the answer that resolves to

limit('storeId') can be eliminated.

Given that we are interested in information about the data type, you should Question: whether the answer that resolves to limit(1).columns provides you with this information. While

DataFrame.columns is a valid call, it will only report back column names, but not column types. So, you can eliminate this option.

The two remaining options either use the printSchema() or print_schema() command. You may remember that DataFrame.printSchema() is the only valid command of the two. The select('storeId')

part just returns the storeId column of transactionsDf - this works here, since we are only interested in that column's type anyways.

More info: pyspark.sql.DataFrame.printSchema --- PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, Question: 57 (Databricks import instructions)

Q5.

Which of the following code blocks writes DataFrame itemsDf to disk at storage location filePath, making sure to substitute any existing data at that location?

AitemsDf.write.mode('overwrite').parquet(filePath)

BitemsDf.write.option('parquet').mode('overwrite').path(filePath)

CitemsDf.write(filePath, mode='overwrite')

DitemsDf.write.mode('overwrite').path(filePath)

EitemsDf.write().parquet(filePath, mode='overwrite')

Answer: A

See the explanation below.

itemsDf.write.mode('overwrite').parquet(filePath)

Correct! itemsDf.write returns a pyspark.sql.DataFrameWriter instance whose overwriting behavior can be modified via the mode setting or by passing mode='overwrite' to the parquet() command.

Although the parquet format is not prescribed for solving this question, parquet() is a valid operator to initiate Spark to write the data to disk.

itemsDf.write.mode('overwrite').path(filePath)

No. A pyspark.sql.DataFrameWriter instance does not have a path() method.

itemsDf.write.option('parquet').mode('overwrite').path(filePath)

Incorrect, see above. In addition, a file format cannot be passed via the option() method.

itemsDf.write(filePath, mode='overwrite')

Wrong. Unfortunately, this is too simple. You need to obtain access to a DataFrameWriter for the DataFrame through calling itemsDf.write upon which you can apply further methods to control how

Spark data should be written to disk. You cannot, however, pass arguments to itemsDf.write directly.

itemsDf.write().parquet(filePath, mode='overwrite')

False. See above.

More info: pyspark.sql.DataFrameWriter.parquet --- PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, Question: 56 (Databricks import instructions)

Are You Looking for More Updated and Actual Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions?

If you want a more premium set of actual Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions then you can get them at the most affordable price. Premium Apache Spark Associate Developer exam questions are based on the official syllabus of the Databricks Certified Associate Developer for Apache Spark 3.0 exam. They also have a high probability of coming up in the actual Databricks Certified Associate Developer for Apache Spark 3.0 exam.
You will also get free updates for 90 days with our premium Databricks Certified Associate Developer for Apache Spark 3.0 exam. If there is a change in the syllabus of Databricks Certified Associate Developer for Apache Spark 3.0 exam our subject matter experts always update it accordingly.

GET Databricks Certified Associate Developer for Apache Spark 3.0 EXAM PREMIUM ACCESS