How to complex string data convert into dataframe via pyspark
Input:
Jack xxxx ["BL", "ND"]
John yyyy ["CH", "MB"]
Output:
+----+----+---+
| _1| _2| _3|
+----+----+---+
|Jack|xxxx| BL|
|Jack|xxxx| ND|
|John|yyyy| CH|
|John|yyyy| MB|
+----+----+---+
Solve:
dfjack = sc.textFile("file:////home/cloudera/jack.txt")
dfjack.foreach(print)
John yyyy ["CH", "MB"]
Jack xxxx ["BL", "ND"]
task1 = dfjack.map(lambda x : x.replace('["','')).map(lambda x:x.replace('"]','')).map(lambda x:x.replace('", "','-')).map(lambda x : x.replace(" ",","))
task1.foreach(print)
Jack,xxxx,BL-ND
John,yyyy,CH-MB
splitdata = task1.map(lambda x:x.split(","))
splitdata.foreach(print)
['Jack', 'xxxx', 'BL-ND']
['John', 'yyyy', 'CH-MB']
df = splitdata.toDF()
df.show()
+----+----+-----+
| _1| _2| _3|
+----+----+-----+
|Jack|xxxx|BL-ND|
|John|yyyy|CH-MB|
+----+----+-----+
dfr = df.withColumn("_3",explode(split('_3','-')))
dfr.show()
+----+----+---+
| _1| _2| _3|
+----+----+---+
|Jack|xxxx| BL|
|Jack|xxxx| ND|
|John|yyyy| CH|
|John|yyyy| MB|
+----+----+---+
Comments
Post a Comment