Solve within pyspark below questions

Solve within pySpark below questions


Input:

id,dept,date,sales

1,electronics,01-09-22,2000

2,dresses,03-09-22,10000

1,fitness,07-09-2022,23000

2, groceries,10-09-2022,14000

3,sweets,03-09-2022,5000

3,restaurant,04-09-2022,6000

1,shoes,12-09-2022,8000

2,computers,15-09-2022,6700

Solution:

q1)create dataframe using the CSV data given above. assume it is in a CSV file

 data = spark.read.format("csv").option("header","True").load("file:////home/cloudera/lastid.txt")

 data.show()

 


q2)find the details of customer id 1

findid = data.filter("id==1")

findid.show()

q3)find which customer gives maximum sales

data =  data.withColumn("sales",df["sales"].cast(IntegerType()))

maxsale = data.agg(max("sales"))

maxsale.show()


q4)find on which day each customer gets the highest sales

highsale = data.withColumn("high_sale",row_number().over(Window.partitionBy("id").orderBy("sales"))).where("high_sale=1")

highsale.show()


q5)for question 3 , i want the result dataframe to have 4 columns .i.e customer_id,start_date,end_date,total_sales

std = data.groupby("id").agg(first(col("date")).alias("start_date"),last(col("date")).alias("last_date"),sum("sales").alias("total_sales")).orderBy("id")

std.show()



Comments