Give input dataframe to find the desired output in pyspark
Input:
Output:
Solution:
#load data in Dataframe(df)
df = spark.read.format("csv").option("header","True").load("file:///home/cloudera/apple.txt")
df.show()
#creates two new columns "prev" and "diff" in a DataFrame (df) using the "withColumn" method.
The first new column "prev" is created by using the "lead" function from the pyspark.sql.functions module and the "over" method from the pyspark.sql.Window class. The "lead" function retrieves the value of the next row in the "num_sold" column for a given partition, which is defined by "sale_date" and ordered by "fruit". The "over" method sets the window for the calculation.
The second new column "diff" is created by using the "expr" function from the pyspark.sql.functions module and subtracting the value of the "prev" column from the value of the "num_sold" column.
dfo = df.withColumn("prev",lead("num_sold").over(Window.partitionBy("sale_date").orderBy("fruit"))).withColumn("diff",expr("num_sold - prev"))
dfo.show()
#creates a new DataFrame (dfg) by grouping the rows of the input DataFrame (dfo) by the "sale_date" column and aggregating the "diff" column using the "sum" function from the pyspark.sql.functions module. The result is a DataFrame with one row for each unique value in the "sale_date" column, and a single "diff" column containing the sum of the "diff" values for each "sale_date".
The "alias" method is used to rename the aggregated column to "diff".
Finally, the "orderBy" method is used to sort the rows of the DataFrame in ascending order based on the "sale_date" column.
dfg = dfo.groupBy("sale_date").agg(sum("diff").alias("diff")).orderBy("sale_date")
dfg.show()
Comments
Post a Comment