How to complex string data convert into dataframe via pyspark.

How to complex string data convert into dataframe via pyspark

altimetric
Input:

Jack xxxx ["BL", "ND"]

John yyyy ["CH", "MB"]



Output:

+----+----+---+

|  _1|  _2| _3|

+----+----+---+

|Jack|xxxx| BL|

|Jack|xxxx| ND|

|John|yyyy| CH|

|John|yyyy| MB|

+----+----+---+






Solve: 

dfjack = sc.textFile("file:////home/cloudera/jack.txt")

dfjack.foreach(print)

John yyyy ["CH", "MB"]

Jack xxxx ["BL", "ND"]

task1 = dfjack.map(lambda x : x.replace('["','')).map(lambda x:x.replace('"]','')).map(lambda x:x.replace('", "','-')).map(lambda x : x.replace(" ",","))

task1.foreach(print)

Jack,xxxx,BL-ND

John,yyyy,CH-MB

splitdata = task1.map(lambda x:x.split(","))

splitdata.foreach(print)

['Jack', 'xxxx', 'BL-ND']

['John', 'yyyy', 'CH-MB']

df = splitdata.toDF()

df.show()

+----+----+-----+

|  _1|  _2|   _3|

+----+----+-----+

|Jack|xxxx|BL-ND|

|John|yyyy|CH-MB|

+----+----+-----+


dfr = df.withColumn("_3",explode(split('_3','-')))

dfr.show()

+----+----+---+

|  _1|  _2| _3|

+----+----+---+

|Jack|xxxx| BL|

|Jack|xxxx| ND|

|John|yyyy| CH|

|John|yyyy| MB|

+----+----+---+


Comments