Writing Files to Persisted Storage from PySpark

Hello World!

So here is the big ticket item; How in the world do I write files to persisted storage from PySpark?  There are tons of docs on RDD.toTextFile() or things of that nature; but that only matters if you are dealing with RDD’s or .csv files.  What if you have a different set of needs.  In this case; I wanted to visualize a decision decision forest I had built; but there are no good bindings that I could find between PySpark’s MLLIB and Matplot lib (or similiar) to visualize the decision forest.

Saving Files from Spark

So here is the skinny of it…

Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
PrintWriter = sc._gateway.jvm.java.io.PrintWriter

fs = FileSystem.get(Configuration())
output = fs.create(Path('/data/open_data/RF_Model.txt'))
writer = PrintWriter(output)
writer.write(RF_model.toDebugString())
writer.close()

This is working on HDInsight v3.5 w/Spark 2.0 and Azure Data Lake Storage as the underlying storage system.  What is nice about this is that my cluster only has access to its cluster section of the folder structure.  I have the structure root/clusters/dasciencecluster.  This particular cluster starts at dasciencecluster, while other clusters may start somewhere else.  Therefor my data is saved to root/clusters/dasciencecluster/data/open_data/RF_Model.txt

filesave

Visualizing the Output

So I found this: https://github.com/ChuckWoodraska/EurekaTrees which needed only a single modification.  Go ahead and git clone it; and inside eurekatrees.py (in the root folder); you simply need to modify line 77; because the initial node on every tree does not contain node.data and you will get an exception…

pyspark_files_modifyeurekatrees

so I simply replaced that part of the dictionary with a string ‘root_node’.  Easy fix and now it works…

simply execute the following line of code

python eurekatrees.py --trees C:/user/youruser/downloads/trees.txt

And you will get all of your trees.  Start by navigating from the home.html page in the folder.

pyspark_files_forestview

And here is one of the trees:

pyspark_files_treeview

 

Viola; now you can save your random files for visualization if you can’t get to it directly or need to integrate some other thing into your pipeline.

Leave a Reply

Your email address will not be published. Required fields are marked *