Solve within pySpark below questions
id,dept,date,sales
1,electronics,01-09-22,2000
2,dresses,03-09-22,10000
1,fitness,07-09-2022,23000
2, groceries,10-09-2022,14000
3,sweets,03-09-2022,5000
3,restaurant,04-09-2022,6000
1,shoes,12-09-2022,8000
2,computers,15-09-2022,6700
q1)create dataframe using the CSV data given above. assume it is in a CSV file
data = spark.read.format("csv").option("header","True").load("file:////home/cloudera/lastid.txt")
data.show()
q2)find the details of customer id 1
findid = data.filter("id==1")
findid.show()
q3)find which customer gives maximum sales
data = data.withColumn("sales",df["sales"].cast(IntegerType()))
maxsale = data.agg(max("sales"))
maxsale.show()
q4)find on which day each customer gets the highest sales
highsale = data.withColumn("high_sale",row_number().over(Window.partitionBy("id").orderBy("sales"))).where("high_sale=1")
highsale.show()
q5)for question 3 , i want the result dataframe to have 4 columns .i.e customer_id,start_date,end_date,total_sales
std = data.groupby("id").agg(first(col("date")).alias("start_date"),last(col("date")).alias("last_date"),sum("sales").alias("total_sales")).orderBy("id")
std.show()
Comments
Post a Comment