So here is the big ticket item; How in the world do I write files to persisted storage from PySpark? There are tons of docs on RDD.toTextFile() or things of that nature; but that only matters if you are dealing with RDD’s or .csv files. What if you have a different set of needs. In this case; I wanted to visualize a decision decision forest I had built; but there are no good bindings that I could find between PySpark’s MLLIB and Matplot lib (or similiar) to visualize the decision forest.
Saving Files from Spark
So here is the skinny of it…
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration PrintWriter = sc._gateway.jvm.java.io.PrintWriter fs = FileSystem.get(Configuration()) output = fs.create(Path('/data/open_data/RF_Model.txt')) writer = PrintWriter(output) writer.write(RF_model.toDebugString()) writer.close()
This is working on HDInsight v3.5 w/Spark 2.0 and Azure Data Lake Storage as the underlying storage system. What is nice about this is that my cluster only has access to its cluster section of the folder structure. I have the structure root/clusters/dasciencecluster. This particular cluster starts at dasciencecluster, while other clusters may start somewhere else. Therefor my data is saved to root/clusters/dasciencecluster/data/open_data/RF_Model.txt
Visualizing the Output
So I found this: https://github.com/ChuckWoodraska/EurekaTrees which needed only a single modification. Go ahead and git clone it; and inside eurekatrees.py (in the root folder); you simply need to modify line 77; because the initial node on every tree does not contain node.data and you will get an exception…
so I simply replaced that part of the dictionary with a string ‘root_node’. Easy fix and now it works…
simply execute the following line of code
python eurekatrees.py --trees C:/user/youruser/downloads/trees.txt
And you will get all of your trees. Start by navigating from the home.html page in the folder.
And here is one of the trees:
Viola; now you can save your random files for visualization if you can’t get to it directly or need to integrate some other thing into your pipeline.