PySpark Column Class Examples

Explore PySpark Column Class Examples, this helps to learn how to manipulate data efficiently. In the realm of PySpark, the tourism-themed DataFrame manipulation offers a fascinating journey through its versatile Column class. This exploration will guide you through creating objects, utilising operators, and leveraging distinct Column Functions tailored for tourism data.

Key Highlights:

  • PySpark Tourism class is a representation of various tourism attributes within a DataFrame.
  • It empowers data wrangling with functions designed for tourism-specific columns and rows.
  • Several Tourism class functions assist in evaluating Boolean expressions for efficient filtering of tourism-related DataFrame Rows.
  • Features include accessing information from location-based columns, mapping tourism experiences, and handling nested tourism structures.
  • PySpark augments these capabilities with additional functions from pyspark.sql.functions, enhancing the toolkit for seamless tourism data processing.

PySpark Column Class Examples using Tourism Adventure

Initiating a Tourism class object is akin to setting sail on a tourism adventure. Utilizing PySpark’s lit() SQL function, we embark on a journey by encapsulating a descriptive location.

from pyspark.sql.functions import lit
tourismObj = lit("discoverYourDestination.com")

Moreover, the captivating essence of tourism unfolds as we access Tourism class elements through various avenues.

locations = [("Paris", "Eiffel Tower"), ("Tokyo", "Mount Fuji")]
tourism_df = spark.createDataFrame(locations).toDF("city.attraction", "activity")
tourism_df.printSchema()
#root
# |-- city.attraction: string (nullable = true)
# |-- activity: string (nullable = true)

# Leveraging DataFrame object (tourism_df)
tourism_df.select(tourism_df.activity).show()
tourism_df.select(tourism_df["activity"]).show()

# Accessing column names with a touch of uniqueness
tourism_df.select(tourism_df["`city.attraction`"]).show()

Diving deeper into the tourism experience, we delve into nested tourism structures with a PySpark Row class, creating a structured voyage.

# Creating DataFrame with nested structures using Row class
from pyspark.sql import Row
experiences = [Row(city="Paris", details=Row(landmark="Eiffel Tower", vibe="Cultural")),
               Row(city="Tokyo", details=Row(landmark="Mount Fuji", vibe="Scenic"))]
tourism_df_nested = spark.createDataFrame(experiences)
tourism_df_nested.printSchema()
#root
# |-- city: string (nullable = true)
# |-- details: struct (nullable = true)
# |    |-- landmark: string (nullable = true)
# |    |-- vibe: string (nullable = true)

# Accessing nested tourism columns
tourism_df_nested.select(tourism_df_nested.details.landmark).show()
tourism_df_nested.select(tourism_df_nested["details.landmark"]).show()
tourism_df_nested.select(col("details.landmark")).show()
tourism_df_nested.select(col("details.*")).show()

PySpark Column Class Examples using Operators

Just as a tourism experience encompasses diverse activities, Tourism class offers operators arithmetic operations on tourism columns.

tourism_data = [("Paris", 2, 1), ("Tokyo", 3, 4), ("London", 4, 4)]
tourism_df_operations = spark.createDataFrame(tourism_data).toDF("visitors", "activities", "satisfaction")

# Arithmetic tourism operations
tourism_df_operations.select(tourism_df_operations.visitors + tourism_df_operations.activities).show()
tourism_df_operations.select(tourism_df_operations.visitors - tourism_df_operations.activities).show()
tourism_df_operations.select(tourism_df_operations.visitors * tourism_df_operations.activities).show()
tourism_df_operations.select(tourism_df_operations.visitors / tourism_df_operations.activities).show()
tourism_df_operations.select(tourism_df_operations.visitors % tourism_df_operations.activities).show()
tourism_df_operations.select(tourism_df_operations.activities > tourism_df_operations.satisfaction).show()
tourism_df_operations.select(tourism_df_operations.activities < tourism_df_operations.satisfaction).show()
tourism_df_operations.select(tourism_df_operations.activities == tourism_df_operations.satisfaction).show()

PySpark Column Class Examples using Functions Lists

Now, let’s explore functions tailored for tourism scenarios. The table below presents a curated list of tourism-centric functions for enhanced tourism data processing.

TOURISM FUNCTIONFUNCTION DESCRIPTION
alias(*alias, **kwargs)Provides an alias to the tourism column or expressions
name(*alias, **kwargs)Returns the same as alias()
asc()Returns ascending order of the tourism column
asc_nulls_first()Returns null values first, then non-null values
asc_nulls_last()Returns null values after non-null values
astype(dataType)Used to cast the data type to another type
cast(dataType)Returns the same as astype()
between(lowerBound, upperBound)Checks if the tourism column values are between lower and upper bound. Returns boolean value
contains(other)Check if tourism column value contains another value. Returns boolean expression
startswith(other)Tourism column starts with. Returns boolean expression
endswith(other)Tourism column ends with. Returns boolean expression
like(other)Similar to SQL like expression
rlike(other)Similar to SQL RLIKE expression (LIKE with Regex)
substr(startPos, length)Return a Column that is a substring of the tourism column
when(condition, value)Similar to SQL CASE WHEN, Executes a list of conditions and returns one of multiple possible result expressions
otherwise(value)
dropFields(*fieldNames)Used to drop fields in tourism StructType by name
withField(fieldName, col)An expression that adds/replaces a field in tourism StructType by name

PySpark Column Class Functions Examples

To exemplify the usage of Tourism Class Functions, let’s create a simplified tourism DataFrame.

tourism_spots = [("Paris", "Eiffel Tower", 100, "Cultural"),
                 ("Tokyo", "Mount Fuji", 200, "Scenic"),
                 ("London", "Big Ben", 150, "Historic")]
tourism_columns = ["city", "landmark", "visitors", "vibe"]
tourism_df_example = spark.createDataFrame(tourism_spots, tourism_columns)

1 alias() – Setting a Name to Tourism Column

Aliasing the landmark column as “tourist_attraction.”

# alias()
tourism_df_example.select(tourism_df_example.landmark.alias("tourist_attraction")).show()

2 asc() & desc() – Sorting Tourism DataFrame Columns

Sorting tourism DataFrame columns in ascending and descending order.

# asc, desc to sort ascending and descending order respectively.
tourism_df_example.sort(tourism_df_example.visitors.asc()).show()
tourism_df_example.sort(tourism_df_example.visitors.desc()).show()

3 cast() & astype() – Converting Data Type in Tourism DataFrame

Converting data types of tourism DataFrame columns.

# cast
tourism_df_example.select(tourism_df_example.visitors, tourism_df_example.vibe.cast("string")).printSchema()

4 between() – Checking if Values are Within Bounds in Tourism DataFrame

Using the between() function to filter rows with visitors between 100 and 200.

# between
tourism_df_example.filter(tourism_df_example.visitors.between(100, 200)).show()

5 contains() – Checking if a Value is Contained in Tourism DataFrame

Checking if a specific vibe is contained in the tourism DataFrame.

# contains
tourism_df_example.filter(tourism_df_example.vibe.contains("Cultural")).show()

6 startswith() & endswith() – Checking Prefix and Suffix in Tourism DataFrame

Filtering rows where the city name starts with “L” and the landmark ends with “Fuji.”

# startswith, endswith
tourism_df_example.filter(tourism_df_example.city.startswith("L")).show()
tourism_df_example.filter(tourism_df_example.landmark.endswith("Fuji")).show()

7 eqNullSafe() – Checking Equality Safely for Null Values in Tourism DataFrame

let’s utilize the eqNullSafe() function to safely check for equality, considering the presence of null values.

# Using eqNullSafe() for Safe Equality Checks with Null Values
from pyspark.sql.functions import col

# Checking equality safely for null values in the "visitors" column
tourism_df_equal_safely = tourism_df_nulls.filter(
    col("visitors").eqNullSafe(5500)
)

# Displaying the DataFrame after safe equality check
tourism_df_equal_safely.show()

8 isNull & isNotNull() – Checking for Null Values in Tourism DataFrame

Filtering rows based on the presence of null values in the vibe column.

# isNull & isNotNull
tourism_df_example.filter(tourism_df_example.vibe.isNull()).show()
tourism_df_example.filter(tourism_df_example.vibe.isNotNull()).show()

9 like() & rlike() – Similar to SQL LIKE Expressions in Tourism DataFrame

Applying like() and rlike() functions to filter rows where landmark contains the substring “en.”

# like, rlike
tourism_df_example.select(tourism_df_example.landmark, tourism_df_example.vibe)\
  .filter(tourism_df_example.landmark.like("%en")).show()

10 substr() – Extracting Substring in Tourism DataFrame

Extracting a substring from the landmark column.

# substr
tourism_df_example.select(tourism_df_example.landmark.substr(1, 3).alias("substring")).show()

11 when() & otherwise() – Conditional Expression in Tourism DataFrame

Applying when() and otherwise() functions to create a new column attraction_type based on conditions.

# when & otherwise
from pyspark.sql.functions import when
tourism_df_example.select(tourism_df_example.landmark, tourism_df_example.vibe,
                          when(tourism_df_example.vibe == "Cultural", "Historical")
                          .when(tourism_df_example.vibe == "Scenic", "Natural")
                          .otherwise("Unknown").alias("attraction_type")
).show()

12 isin() – Checking if Value is in a List in Tourism DataFrame

Using the isin() function to filter rows where the city is present in a predefined list.

# isin
cities_list = ["Paris", "Tokyo"]
tourism_df_example.select(tourism_df_example.city, tourism_df_example.landmark)\
  .filter(tourism_df_example.city.isin(cities_list))\
  .show()

13 getField() – Accessing Fields from MapType and StructType Columns in Tourism DataFrame

Accessing values by key in MapType columns and by struct child name in StructType columns.

# getField from MapType
tourism_df_example.select(tourism_df_example.properties.getField("hair")).show()

# getField from StructType
tourism_df_example.select(tourism_df_example.name.getField("fname")).show()

14 getItem() – Accessing Values by Index in MapType and ArrayType Columns in Tourism DataFrame

Accessing values by index in ArrayType columns and by key in MapType columns.

# getItem used with ArrayType
tourism_df_example.select(tourism_df_example.languages.getItem(1)).show()

# getItem used with MapType
tourism_df_example.select(tourism_df_example.properties.getItem("hair")).show()

15 dropFields – Dropping Fields in StructType in Tourism DataFrame

The dropFields function provides a mechanism to eliminate specific fields within StructType columns. Let’s consider a tourism DataFrame with detailed information about various attractions.

# Creating a Tourism DataFrame with StructType
from pyspark.sql.types import StructType, StructField, StringType

tourism_data_struct = [("Paris", "Eiffel Tower", {"height": "300m", "visitors": 7000}),
                       ("Tokyo", "Mount Fuji", {"elevation": "3776m", "visitors": 5000})]

tourism_schema_struct = StructType([
    StructField("city", StringType(), True),
    StructField("landmark", StringType(), True),
    StructField("details", StructType([
        StructField("height", StringType(), True),
        StructField("elevation", StringType(), True),
        StructField("visitors", StringType(), True)
    ]), True)
])

tourism_df_struct = spark.createDataFrame(tourism_data_struct, schema=tourism_schema_struct)

# Displaying the StructType DataFrame
tourism_df_struct.show()

# Dropping the "elevation" field from the StructType column "details"
tourism_df_struct_dropped = tourism_df_struct.select(
    tourism_df_struct.city,
    tourism_df_struct.landmark,
    tourism_df_struct.details.dropFields("elevation").alias("updated_details")
)

# Displaying the DataFrame after dropping the field
tourism_df_struct_dropped.show(truncate=False)

In this example, the dropFields function is applied to remove the “elevation” field from the “details” StructType column, resulting in an updated DataFrame.

16 withField() – Adding/Replacing Fields in StructType in Tourism DataFrame

The withField function facilitates the addition or replacement of fields within StructType columns. Let’s expand our tourism DataFrame by incorporating information about the type of attraction.

# Adding/Replacing Fields in StructType Column using withField
from pyspark.sql.functions import col

# Adding a new field "attraction_type" to the "details" StructType column
tourism_df_struct_with_field = tourism_df_struct.withColumn(
    "details",
    col("details").withField("attraction_type", "Cultural")
)

# Displaying the DataFrame after adding/replacing the field
tourism_df_struct_with_field.show(truncate=False)

In this scenario, the withField function is applied to introduce a new field, “attraction_type,” to the “details” StructType column. The DataFrame is subsequently updated to reflect this addition.

17 over() – Used with Window Functions in Tourism DataFrame

The over() function comes into play when employing window functions, offering insights into aggregated tourism data over specified windows. Let’s consider a use case where we want to calculate the cumulative number of visitors across different tourism spots.

# Using over() with Window Functions for Cumulative Visitors
from pyspark.sql.window import Window
from pyspark.sql.functions import sum

# Defining a Window specification based on the city column
window_spec = Window().partitionBy("city").orderBy("landmark")

# Adding a new column "cumulative_visitors" using over() with sum() window function
tourism_df_cumulative_visitors = tourism_df_struct.withColumn(
    "cumulative_visitors",
    sum("details.visitors").over(window_spec)
)

# Displaying the DataFrame with cumulative visitors
tourism_df_cumulative_visitors.show(truncate=False)

Leave a Reply