Practice English Speaking&Listening with: Spark Tutorial - SQL over dataframes

Normal
(0)
Difficulty: 0

Welcome back to Apache Spark tutorials at Learning Journal.

In the earlier videos, we started our discussion on Spark Data frames.

By now, you must have realized that

all that you need to learn is to model your business requirements using Spark Transformations.

Once you learn to write the transformation that meets your business requirement,

you have almost completed Apache Spark foundation course.

You can continue learning Spark internals, tuning, optimizations,

and other things like streaming and machine learning.

However, modeling your business requirement into a series of transformations

is the most critical part of Spark development.

It is like learning SQL.

Once you know the SQL, you can claim to be a database developer.

Similarly, once you master the transformations, you can claim to be a Spark Developer.

Moreover, if you know SQL, you are already a good Spark developer.

That is the topic of this video.

In this video, we will augment our Dataframe knowledge with our SQL skills.

So, Let's start.

You have already seen some transformation code earlier.

We used the read API to load the data from a CSV file.

Then we applied a select transformation and a filter condition.

If you are a database developer, You will see the above transformation as an SQL expression.

To make this SQL work, all you need is a table and an SQL execution engine.

The good news is that the Spark offers you both of these things.

How? We will see that in a minute.

Before that, let's look at the other things that we did in the earlier video.

We wanted to create a bar chart as shown here,

and for that purpose, we did several things.

We created a user-defined function.

Then we performed three different transformations.

Apply the UDF in a select transformation.

Filter the transgender.

And finally, apply a group by and calculate the count for each gender.

Assuming you are good at writing SQL,

If I allow you to do all of that using a SQL statement,

you would be able to do it as a single SQL expression.

I hope you already get the sense of the point that I want to make.

SQL is an excellent tool for a number of transformation requirements.

Most of us are already skilled and comfortable with SQL.

And for that reason, Apache Spark allows us to use SQL over a Dataframe.

Before we execute this SQL in Spark, let's talk a little about the schema.

A schema is nothing more than a definition for the column names and their data types. Right?

In our earlier example, we allowed the API to infer the schema.

However, there are two approaches to handle schema.

Let the data source define the schema, and we infer it from the source.

Define a schema explicitly in your program and read the data using your schema definition.

When your source system offers a well-defined schema,

schema inference is a reasonable choice.

However, it is a good idea to define your schema manually while working with untyped sources

like CSV and JSON.

In our current example, we are loading data from a CSV file.

So the recommendation is to define the schema instead of using the inferSchema.

In my earlier video, I said that the Spark is a programming language in itself.

The Spark type system is the main reason behind that statement.

Apache Spark maintains its own type information,

and they designed data frames to use Spark types.

What does that mean?

That means, Data frames do not use Scala types or Python types.

No matter which language are you using for your code,

A Spark data frame API always uses Spark types.

And I believe, that was a design decision to bring SQL over the data frames across the languages.

You can get the list of Spark Types in org.apache.spark.sql.types package.

Look at the names. Does it look difficult to remember these names?

No, Right?

Great. So we are all set with the theoretical fundamentals.

Let's do something practical.

Let me load the data with inferSchema as true.

Spark API must have inferred the schema. Right?

Let us see what did it infer?

Excellent. So, Spark data frame schema is a StructType

that contains a set of StructFields.

Each StructField defines a column.

Let me quickly show you the documentation.

So, the StructField is a serializable class under Scala AnyRef.

The constructor can take four values.

The name of the column

The data type of the column.

A boolean that tells if the field is nullable.

This parameter defaults to true.

You can also supply some metadata for each column.

The metadata is nothing but a map of key-value pairs. The default value is empty.

The StructType is also a class that holds an array of StructFields.

If you are using Python, both of these things are same in Python as well.

However, the Python StructType is a list of StructFields

whereas Scala StrcutType is an array of StructField.

Great, Now we know that the load method has inferred this schema.

Everything looks perfectly fine. I just want to change these capital letters to small case.

Just to keep it consistent.

I need to place these field names into a double quote to make it a string.

I should also make sure that it goes as an array of fields.

Good. My schema definition is ready.

You can use this syntax to create a schema and load your data using an explicit schema.

Import all the Spark types.

Create your schema.

The above code is almost same as earlier.

I just added schema method call to the API chain and removed the infer schema option.

Now we have a data frame. How do we execute SQL on this data frame?

Well, we have to convert this data frame into a table or a view.

Apache Spark allows you to create a temporary view using a data frame.

It is just like a view in a database.

Once you have a view, you can execute SQL on that view.

They offer four data frame methods to create a view.

As you can guess by just looking at the method names, there are two types of temporary views.

Temporary view and a Global temporary view.

We can also refer it as a local temporary view and a Global temporary view.

Let's try to understand the difference.

The local temporary view is only visible to the current spark session.

However, a Global temporary view is visible to the current spark application across the sessions.

Wait a minute. Do you mean a SparkSession and a Spark Application are two different things?

Yes. We normally start a Spark Application by creating a Spark session.

To a beginner, It appears that a Spark Application can have a single session.

However, that is not true.

You can have multiple sessions in a single Spark application.

The Spark session internally creates a Spark context.

A SparkContext represents the connection to a Spark cluster.

It also keeps track of all the RDDs, cached data as well as the configurations.

You cannot have more than one Spark Context in a single JVM.

That means, one instance of an application can have only one connection to the cluster

and hence a single Spark context.

You cannot have more than one Spark context.

However, your application can create multiple Spark Sessions.

All of those sessions will point to the same context, but you can have multiple sessions.

In your standard applications, you may not need to create multiple spark sessions.

However, if you are developing an application that needs to support multiple interactive users,

you might want to create one Spark Session for each user session.

Ideally, we should be able to create multiple connections to Spark cluster for each user in the above use case

but creating multiple contexts is not yet supported by Spark.

The documentation claims that they will remove this restriction in the future releases.

So, coming back to local temporary views,

they are only visible to the current session.

However, global temporary views are visible across the spark sessions within the same application.

In all this discussion, one thing is crystal clear.

None of them are visible to other applications.

So, you create a Global temporary view or a local temporary view,

they are always local to your application,

and they live only till your application is alive.

Great, since I am not going to create multiple sessions, Let me create a local temporary view.

The method takes the name of the view as an argument.

Good. This statement must have created a temporary table or a view. Right?

Where can you find it?

Well, a temporary view is maintained by the Spark session. Right?

So, let's check the Spark session.

Spark session offers you a catalog.

A catalog is an interface that allows you to create, drop, alter or query underlying databases, tables, and functions.

I recommend you to at least go through the documentation for the catalog interface.

The above statement is using the list Tables method in the catalog.

You can see that the view that we created is a temporary table

that doesn't belong to any databases.

Let's create a global temporary table and see if we can list that as well.

We used the appropriate method to create a Global temporary view on our data frame.

I named the view as survey_gtbl.

Let's call the catalog listTables method once again.

Oops!

We do not see that global table there. There is a reason for that.

A Global temp table belongs to a system database called global_temp.

So, if you want to access the global temp table,

you must look into the global_temp database.

So, the correct method call should also specify the database name.

Now you can see your global temp table.

Once you register the temp table, executing your SQL statement is a simple thing.

Execute an SQL on the spark session, and you get a data frame in return.

So, if you think that the SQL is more simple to solve your problems,

instead of using lengthy data frame API chains,

you are free to use SQL. And surprisingly, you do not have a performance penalty.

The SQL works as fast as a Data frame transformation.

So, Instead of using a UDF and then a confusing chain of APIs

we can use an SQL statement to achieve whatever we did in the previous video.

You can use API chains or an SQL, and both of them delivers the same performance.

The choice is yours.

Great. I talked about Spark data frame schemas,

and you learned to execute a SQL statement on a Spark data frame.

Thank you for watching learning journal.

See you again in the next video.

Keep learning and keep growing.

The Description of Spark Tutorial - SQL over dataframes