How to find out desired output by pyspark
Input:
Output:
Solution:
# load data into dataframe(df)
df = spark.read.format("csv").option("header","True").load("file:///home/cloudera/acbalance.txt")
# This code is grouping the data in the dataframe df by the "account_number" column and aggregating it with the following operations:
Taking the maximum value of "transaction_id" and renaming the result as "transaction_id".
Taking the last value of "balance" and renaming the result as "balance".
The resulting dataframe is sorted by the "balance" column in ascending order using orderBy.
dfg = df.groupby("account_number").agg(max("transaction_id").alias("transaction_id"),last("balance").alias("balance")).orderBy("balance")
dfg.show()
Comments
Post a Comment