Prerequisites
If you plan to read and write from HDFS using DC/OS Data Science Engine, there are two Hadoop configuration files that you should include in the classpath:
hdfs-site.xml
, which provides default behaviors for the HDFS client.core-site.xml
, which sets the default file system name.
You can specify the location of these files at install time or for each DC/OS Data Science Engine instance.
Configuring DC/OS Data Science Engine to work with HDFS
Within the DC/OS Data Science Engine service configuration, set service.jupyter_conf_urls
to be a list of URLs that serves your hdfs-site.xml
and core-site.xml
. The following example uses http://mydomain.com/hdfs-config/hdfs-site.xml
and http://mydomain.com/hdfs-config/core-site.xml
URLs:
You can also specify the URLs through the UI. If you are using the default installation of HDFS from Mesosphere, this would be http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints
for HDFS service installed with the hdfs
name.
Example of Using HDFS with Spark
Here is an example of running Spark Job
using HDFS
as a storage backend.
-
Launch Python3 Notebook from the Notebook UI. Put the following code in a code cell.
The expected output would be:
-
Verify that the file has been saved. Go to the Terminal from the Notebook UI and run following command.
The expected output would be: