Explore PySpark Column Class Examples, this helps to learn how to manipulate data efficiently. In the realm of PySpark, the tourism-themed DataFrame manipulation offers a fascinating journey through its versatile Column class. This exploration will guide you through creating objects, utilising operators, and leveraging distinct Column Functions tailored for tourism data.
Key Highlights:
- PySpark Tourism class is a representation of various tourism attributes within a DataFrame.
- It empowers data wrangling with functions designed for tourism-specific columns and rows.
- Several Tourism class functions assist in evaluating Boolean expressions for efficient filtering of tourism-related DataFrame Rows.
- Features include accessing information from location-based columns, mapping tourism experiences, and handling nested tourism structures.
- PySpark augments these capabilities with additional functions from pyspark.sql.functions, enhancing the toolkit for seamless tourism data processing.
Table of Contents
PySpark Column Class Examples using Tourism Adventure
Initiating a Tourism class object is akin to setting sail on a tourism adventure. Utilizing PySpark’s lit()
SQL function, we embark on a journey by encapsulating a descriptive location.
from pyspark.sql.functions import lit
tourismObj = lit("discoverYourDestination.com")
Moreover, the captivating essence of tourism unfolds as we access Tourism class elements through various avenues.
locations = [("Paris", "Eiffel Tower"), ("Tokyo", "Mount Fuji")]
tourism_df = spark.createDataFrame(locations).toDF("city.attraction", "activity")
tourism_df.printSchema()
#root
# |-- city.attraction: string (nullable = true)
# |-- activity: string (nullable = true)
# Leveraging DataFrame object (tourism_df)
tourism_df.select(tourism_df.activity).show()
tourism_df.select(tourism_df["activity"]).show()
# Accessing column names with a touch of uniqueness
tourism_df.select(tourism_df["`city.attraction`"]).show()
Diving deeper into the tourism experience, we delve into nested tourism structures with a PySpark Row class, creating a structured voyage.
# Creating DataFrame with nested structures using Row class
from pyspark.sql import Row
experiences = [Row(city="Paris", details=Row(landmark="Eiffel Tower", vibe="Cultural")),
Row(city="Tokyo", details=Row(landmark="Mount Fuji", vibe="Scenic"))]
tourism_df_nested = spark.createDataFrame(experiences)
tourism_df_nested.printSchema()
#root
# |-- city: string (nullable = true)
# |-- details: struct (nullable = true)
# | |-- landmark: string (nullable = true)
# | |-- vibe: string (nullable = true)
# Accessing nested tourism columns
tourism_df_nested.select(tourism_df_nested.details.landmark).show()
tourism_df_nested.select(tourism_df_nested["details.landmark"]).show()
tourism_df_nested.select(col("details.landmark")).show()
tourism_df_nested.select(col("details.*")).show()
PySpark Column Class Examples using Operators
Just as a tourism experience encompasses diverse activities, Tourism class offers operators arithmetic operations on tourism columns.
tourism_data = [("Paris", 2, 1), ("Tokyo", 3, 4), ("London", 4, 4)]
tourism_df_operations = spark.createDataFrame(tourism_data).toDF("visitors", "activities", "satisfaction")
# Arithmetic tourism operations
tourism_df_operations.select(tourism_df_operations.visitors + tourism_df_operations.activities).show()
tourism_df_operations.select(tourism_df_operations.visitors - tourism_df_operations.activities).show()
tourism_df_operations.select(tourism_df_operations.visitors * tourism_df_operations.activities).show()
tourism_df_operations.select(tourism_df_operations.visitors / tourism_df_operations.activities).show()
tourism_df_operations.select(tourism_df_operations.visitors % tourism_df_operations.activities).show()
tourism_df_operations.select(tourism_df_operations.activities > tourism_df_operations.satisfaction).show()
tourism_df_operations.select(tourism_df_operations.activities < tourism_df_operations.satisfaction).show()
tourism_df_operations.select(tourism_df_operations.activities == tourism_df_operations.satisfaction).show()
PySpark Column Class Examples using Functions Lists
Now, let’s explore functions tailored for tourism scenarios. The table below presents a curated list of tourism-centric functions for enhanced tourism data processing.
TOURISM FUNCTION | FUNCTION DESCRIPTION |
---|---|
alias(*alias, **kwargs) | Provides an alias to the tourism column or expressions |
name(*alias, **kwargs) | Returns the same as alias() |
asc() | Returns ascending order of the tourism column |
asc_nulls_first() | Returns null values first, then non-null values |
asc_nulls_last() | Returns null values after non-null values |
astype(dataType) | Used to cast the data type to another type |
cast(dataType) | Returns the same as astype() |
between(lowerBound, upperBound) | Checks if the tourism column values are between lower and upper bound. Returns boolean value |
contains(other) | Check if tourism column value contains another value. Returns boolean expression |
startswith(other) | Tourism column starts with. Returns boolean expression |
endswith(other) | Tourism column ends with. Returns boolean expression |
like(other) | Similar to SQL like expression |
rlike(other) | Similar to SQL RLIKE expression (LIKE with Regex) |
substr(startPos, length) | Return a Column that is a substring of the tourism column |
when(condition, value) | Similar to SQL CASE WHEN, Executes a list of conditions and returns one of multiple possible result expressions |
otherwise(value) | |
dropFields(*fieldNames) | Used to drop fields in tourism StructType by name |
withField(fieldName, col) | An expression that adds/replaces a field in tourism StructType by name |
PySpark Column Class Functions Examples
To exemplify the usage of Tourism Class Functions, let’s create a simplified tourism DataFrame.
tourism_spots = [("Paris", "Eiffel Tower", 100, "Cultural"),
("Tokyo", "Mount Fuji", 200, "Scenic"),
("London", "Big Ben", 150, "Historic")]
tourism_columns = ["city", "landmark", "visitors", "vibe"]
tourism_df_example = spark.createDataFrame(tourism_spots, tourism_columns)
1 alias() – Setting a Name to Tourism Column
Aliasing the landmark
column as “tourist_attraction.”
# alias()
tourism_df_example.select(tourism_df_example.landmark.alias("tourist_attraction")).show()
2 asc() & desc() – Sorting Tourism DataFrame Columns
Sorting tourism DataFrame columns in ascending and descending order.
# asc, desc to sort ascending and descending order respectively.
tourism_df_example.sort(tourism_df_example.visitors.asc()).show()
tourism_df_example.sort(tourism_df_example.visitors.desc()).show()
3 cast() & astype() – Converting Data Type in Tourism DataFrame
Converting data types of tourism DataFrame columns.
# cast
tourism_df_example.select(tourism_df_example.visitors, tourism_df_example.vibe.cast("string")).printSchema()
4 between() – Checking if Values are Within Bounds in Tourism DataFrame
Using the between()
function to filter rows with visitors between 100 and 200.
# between
tourism_df_example.filter(tourism_df_example.visitors.between(100, 200)).show()
5 contains() – Checking if a Value is Contained in Tourism DataFrame
Checking if a specific vibe is contained in the tourism DataFrame.
# contains
tourism_df_example.filter(tourism_df_example.vibe.contains("Cultural")).show()
6 startswith() & endswith() – Checking Prefix and Suffix in Tourism DataFrame
Filtering rows where the city name starts with “L” and the landmark ends with “Fuji.”
# startswith, endswith
tourism_df_example.filter(tourism_df_example.city.startswith("L")).show()
tourism_df_example.filter(tourism_df_example.landmark.endswith("Fuji")).show()
7 eqNullSafe() – Checking Equality Safely for Null Values in Tourism DataFrame
let’s utilize the eqNullSafe()
function to safely check for equality, considering the presence of null values.
# Using eqNullSafe() for Safe Equality Checks with Null Values
from pyspark.sql.functions import col
# Checking equality safely for null values in the "visitors" column
tourism_df_equal_safely = tourism_df_nulls.filter(
col("visitors").eqNullSafe(5500)
)
# Displaying the DataFrame after safe equality check
tourism_df_equal_safely.show()
8 isNull & isNotNull() – Checking for Null Values in Tourism DataFrame
Filtering rows based on the presence of null values in the vibe
column.
# isNull & isNotNull
tourism_df_example.filter(tourism_df_example.vibe.isNull()).show()
tourism_df_example.filter(tourism_df_example.vibe.isNotNull()).show()
9 like() & rlike() – Similar to SQL LIKE Expressions in Tourism DataFrame
Applying like()
and rlike()
functions to filter rows where landmark
contains the substring “en.”
# like, rlike
tourism_df_example.select(tourism_df_example.landmark, tourism_df_example.vibe)\
.filter(tourism_df_example.landmark.like("%en")).show()
10 substr() – Extracting Substring in Tourism DataFrame
Extracting a substring from the landmark
column.
# substr
tourism_df_example.select(tourism_df_example.landmark.substr(1, 3).alias("substring")).show()
11 when() & otherwise() – Conditional Expression in Tourism DataFrame
Applying when()
and otherwise()
functions to create a new column attraction_type
based on conditions.
# when & otherwise
from pyspark.sql.functions import when
tourism_df_example.select(tourism_df_example.landmark, tourism_df_example.vibe,
when(tourism_df_example.vibe == "Cultural", "Historical")
.when(tourism_df_example.vibe == "Scenic", "Natural")
.otherwise("Unknown").alias("attraction_type")
).show()
12 isin() – Checking if Value is in a List in Tourism DataFrame
Using the isin()
function to filter rows where the city
is present in a predefined list.
# isin
cities_list = ["Paris", "Tokyo"]
tourism_df_example.select(tourism_df_example.city, tourism_df_example.landmark)\
.filter(tourism_df_example.city.isin(cities_list))\
.show()
13 getField() – Accessing Fields from MapType and StructType Columns in Tourism DataFrame
Accessing values by key in MapType columns and by struct child name in StructType columns.
# getField from MapType
tourism_df_example.select(tourism_df_example.properties.getField("hair")).show()
# getField from StructType
tourism_df_example.select(tourism_df_example.name.getField("fname")).show()
14 getItem() – Accessing Values by Index in MapType and ArrayType Columns in Tourism DataFrame
Accessing values by index in ArrayType columns and by key in MapType columns.
# getItem used with ArrayType
tourism_df_example.select(tourism_df_example.languages.getItem(1)).show()
# getItem used with MapType
tourism_df_example.select(tourism_df_example.properties.getItem("hair")).show()
15 dropFields – Dropping Fields in StructType in Tourism DataFrame
The dropFields
function provides a mechanism to eliminate specific fields within StructType columns. Let’s consider a tourism DataFrame with detailed information about various attractions.
# Creating a Tourism DataFrame with StructType
from pyspark.sql.types import StructType, StructField, StringType
tourism_data_struct = [("Paris", "Eiffel Tower", {"height": "300m", "visitors": 7000}),
("Tokyo", "Mount Fuji", {"elevation": "3776m", "visitors": 5000})]
tourism_schema_struct = StructType([
StructField("city", StringType(), True),
StructField("landmark", StringType(), True),
StructField("details", StructType([
StructField("height", StringType(), True),
StructField("elevation", StringType(), True),
StructField("visitors", StringType(), True)
]), True)
])
tourism_df_struct = spark.createDataFrame(tourism_data_struct, schema=tourism_schema_struct)
# Displaying the StructType DataFrame
tourism_df_struct.show()
# Dropping the "elevation" field from the StructType column "details"
tourism_df_struct_dropped = tourism_df_struct.select(
tourism_df_struct.city,
tourism_df_struct.landmark,
tourism_df_struct.details.dropFields("elevation").alias("updated_details")
)
# Displaying the DataFrame after dropping the field
tourism_df_struct_dropped.show(truncate=False)
In this example, the dropFields
function is applied to remove the “elevation” field from the “details” StructType column, resulting in an updated DataFrame.
16 withField() – Adding/Replacing Fields in StructType in Tourism DataFrame
The withField
function facilitates the addition or replacement of fields within StructType columns. Let’s expand our tourism DataFrame by incorporating information about the type of attraction.
# Adding/Replacing Fields in StructType Column using withField
from pyspark.sql.functions import col
# Adding a new field "attraction_type" to the "details" StructType column
tourism_df_struct_with_field = tourism_df_struct.withColumn(
"details",
col("details").withField("attraction_type", "Cultural")
)
# Displaying the DataFrame after adding/replacing the field
tourism_df_struct_with_field.show(truncate=False)
In this scenario, the withField
function is applied to introduce a new field, “attraction_type,” to the “details” StructType column. The DataFrame is subsequently updated to reflect this addition.
17 over() – Used with Window Functions in Tourism DataFrame
The over()
function comes into play when employing window functions, offering insights into aggregated tourism data over specified windows. Let’s consider a use case where we want to calculate the cumulative number of visitors across different tourism spots.
# Using over() with Window Functions for Cumulative Visitors
from pyspark.sql.window import Window
from pyspark.sql.functions import sum
# Defining a Window specification based on the city column
window_spec = Window().partitionBy("city").orderBy("landmark")
# Adding a new column "cumulative_visitors" using over() with sum() window function
tourism_df_cumulative_visitors = tourism_df_struct.withColumn(
"cumulative_visitors",
sum("details.visitors").over(window_spec)
)
# Displaying the DataFrame with cumulative visitors
tourism_df_cumulative_visitors.show(truncate=False)