Faculty of Informatics / Mathematics

Step-by-step instructions

Here are some step-by-step instructions on various topics and use cases related to machine learning on the HPC cluster of the ZIH. All instructions require a ZIH login and a successful connection to the HPC cluster.

Hello World CNN

This example demonstrates the workflow for recognizing handwritten digits using a CNN. The example can be easily adapted to your own use cases. It is based on the MNIST dataset, which is commonly used for benchmarking and teaching. The example is written in a Jupyter Notebook.

       1. Download the Jupyter Notebook file

       2. Establish a VPN connection to TU Dresden and log in to JupyterHub

       3. Start a new session as shown in the picture

 

       4. Upload the notebook file using the upload button

       5. Open the notebook and execute this step by step

Create your own Python environment with Anaconda

The following example creates a custom Anaconda environment and installs the fast.ai library. Afterward, a notebook with semantic segmentation can be executed via Jupyter Hub.

       1. Establish a VPN connection to TU Dresden and log in to the ZIH login node using a shell.

       2. Start an interactive session with the Alpha Centauri cluster:

              srun -p alpha-interactive -N 1 -n 1 --mem-per-cpu 11000 -c 2 --time=02:00:00 --pty bash

       3. Load the Python package manager Anaconda 

              module load Anaconda3

       ​​​​​​4. ​ Create a new Anaconda environment

             conda create --name myEnv python=3.7

       5.  Activate the environment

              conda activate myEnv

       6. Install the fast.ai library

              conda install -c fastai -c pytorch fastai cudatoolkit=11.0.221 ipykernel ipywidgets

       7. Register the environment in JupyterHub

              python -m ipykernel install --user --name myEnv --display-name="myEnv"

       8.  Log in to JupyterHub

       9. Start a session with the Alpha Centauri cluster

       10. Upload the notebook file to JupyterHub and start it

 

       11. Select the previously created environment from the environment selection.

 

       12. Go through the notebook step by step

 

Create a workspace

By creating a workspace, you request access to various storage systems of the ZIH, which differ in terms of capacity, streaming bandwidth, IOPS rate, etc. You can't have it all in one system. Therefore, there are different storage systems suitable for different use cases. An overview of the various systems can be found here. The following guide shows how to request, monitor, and delete these storage systems. A comprehensive guide can be found here.

  1. Show available storage systems:
    • "ws_find -l"
  2. Show currently used workspaces:
    • "ws_list"
  3. Create a new workspace
    • "ws_allocate -F beegfs myWorkspace 30"  Create a new workspace on the "beegfs" storage system with the name "myWorkspace" and a duration of 30 days
  4. Delete workspace
    • “ws_release -F beegfs myWorkspace” Delete the workspace “myWorkspace.”

Run Singularity container

Containerization encapsulates or packages software code and all its dependencies to run uniformly and consistently on any infrastructure. On ZIH systems, Singularity is used as the standard container solution. Singularity allows users to have full control over their environment. The following example demonstrates how to import a Docker container from the Nvidia NGC Catalog and then run an example within it. The example is a so-called Question/Answering model based on the BERT architecture, pre-trained with the SQuAD dataset. More information on Singularity containers on ZIH systems can be found here.

      1. Establish a VPN connection to TU Dresden and log in to the ZIH login node using a shell.

      2. Starting a session with the AlphaCentauri cluster

              srun -p alpha-interactive -N 1 -n 1 --gres=gpu:1 --mem-per-cpu 11000 -c 6 --time=08:00:00 --pty bash

      3. Create a workspace for the container

              ws_allocate -F scratch ContainerTest 30

      4. Navigate to the workspace directory

              cd /scratch/ws/1/[your scratch]/

      5.  Import the TensorFlow container from Nvidia NGC

              singularity build tensorflow.sif docker://nvcr.io/nvidia/tensorflow:22.01-tf1-py3

      6.  Start the container with a shell (--nv starts the container with Nvidia GPU support)

              singularity shell --nv tensorflow.sif

      7. Display the current hostname of the ZIH node for later port forwarding

             hostname

       8. Start a JupyterLab server. The port and the token are important. The port will be needed for port forwarding in step 9, and the token will be needed for browser login in step 11.

              jupyter lab

       9. To access the Jupyter server, a port forwarding from the local PC to the ZIH compute node must be set up.

              ssh -fNL <local_port>:<zih_node>:<remote_port> <zih_user>@taurus.hrsk.tu-dresden.de

       10. Open a browser and access the Jupyter server using the following address:

              localhost:<local_port>

       11. Enter the token from step 8

       12. Download the Jupyter Notebook file and upload it to Jupyter using the upload button

       13. Execute the notebook step by step and be amazed!