Complex RDD to make DataFrame via PySpark
Input Data :
Raj India 1000 ~$| John Canada 2000 ~$| Steve USA 3000 ~$| Jason USA 4000
Expected Output:
Name Country Salary
Raj India 1000
John Canada 2000
Steve USA 3000
Jason USA 4000
Solve:
Here, i am trying to write code for easily understand and step by step to show what output find then its code done.
rdd_data = sc.textFile("file:///home/cloudera/task1.txt")
>>> rdd_data.foreach(print)
Raj India 1000 ~$| John Canada 2000 ~$| Steve USA 3000 ~$| Jason USA 4000
>>> repdata = rdd_data.map(lambda x:x.replace(" ~$| ",","))
above line code just describe you insteat we can use flatMap with split the do this flatten data.
>>> flatdata = repdata.flatMap(lambda x:x.split(','))
>>> flatdata.foreach(print)
Raj India 1000
John Canada 2000
Steve USA 3000
Jason USA 4000
>>> divdata = flatdata.map(lambda x : x.replace(" ",","))
>>> divdata.foreach(print)
Raj,India,1000
John,Canada,2000
Steve,USA,3000
Jason,USA,4000
>>> splitdata = divdata.map(lambda x:x.split(','))
>>> splitdata.foreach(print)
['Raj', 'India', '1000']
['John', 'Canada', '2000']
['Steve', 'USA', '3000']
['Jason', 'USA', '4000']
>>> rowrdd = splitdata.map(lambda x:Row(x[0],x[1],x[2]))
>>> rowrdd.foreach(print)
<Row(Raj, India, 1000)>
<Row(John, Canada, 2000)>
<Row(Steve, USA, 3000)>
<Row(Jason, USA, 4000)>
>>> schema = StructType([StructField("name",StringType(),True),StructField("country",StringType(),True),StructField("salary",StringType(),True)])
>>> df = spark.createDataFrame(rowrdd,schema)
>>> df.show()
+-----+-------+------+
| name|country|salary|
+-----+-------+------+
| Raj| India| 1000|
| John| Canada| 2000|
|Steve| USA| 3000|
|Jason| USA| 4000|
+-----+-------+------+
Comments
Post a Comment