Complex RDD to make DataFrame via PySpark

 Complex RDD to make DataFrame via PySpark

HCL
Input Data :

Raj India 1000 ~$| John Canada 2000 ~$| Steve USA 3000 ~$| Jason USA 4000

 

Expected Output:

Name    Country    Salary

Raj India 1000

John Canada 2000

Steve USA 3000

Jason USA 4000


Solve:

Here, i am trying to write code for easily understand and step by step to show what output find then its code done.

rdd_data = sc.textFile("file:///home/cloudera/task1.txt")

>>> rdd_data.foreach(print)

Raj India 1000 ~$| John Canada 2000 ~$| Steve USA 3000 ~$| Jason USA 4000

>>> repdata = rdd_data.map(lambda x:x.replace(" ~$| ",","))

above line code just describe you insteat we can use flatMap with split the do this flatten data.

>>> flatdata = repdata.flatMap(lambda x:x.split(','))

>>> flatdata.foreach(print)

Raj India 1000

John Canada 2000

Steve USA 3000

Jason USA 4000

>>> divdata = flatdata.map(lambda x : x.replace(" ",","))

>>> divdata.foreach(print)

Raj,India,1000

John,Canada,2000

Steve,USA,3000

Jason,USA,4000

>>> splitdata = divdata.map(lambda x:x.split(','))

>>> splitdata.foreach(print)

['Raj', 'India', '1000']

['John', 'Canada', '2000']

['Steve', 'USA', '3000']

['Jason', 'USA', '4000']

>>> rowrdd = splitdata.map(lambda x:Row(x[0],x[1],x[2]))

>>> rowrdd.foreach(print)

<Row(Raj, India, 1000)>

<Row(John, Canada, 2000)>

<Row(Steve, USA, 3000)>

<Row(Jason, USA, 4000)>

>>> schema = StructType([StructField("name",StringType(),True),StructField("country",StringType(),True),StructField("salary",StringType(),True)])

>>> df = spark.createDataFrame(rowrdd,schema)

>>> df.show()

+-----+-------+------+

| name|country|salary|

+-----+-------+------+

|  Raj|  India|  1000|

| John| Canada|  2000|

|Steve|    USA|  3000|

|Jason|    USA|  4000|

+-----+-------+------+

Comments