Center for Information Services and High Performance Computing: HTW Dresden

JavaScript is not activated!

Please enable JavaScript in your browser!

Faculty of Informatics / Mathematics

Center for Information Services and High-Performance Computing

The Center for Information Services and High-Performance Computing (ZIH) is not only responsible for the general IT services at TU Dresden but also serves as the Saxon High-Performance Computing Center for all academic institutions. Since January 2021, it has been one of the 8 centers in the NHR network (National High-Performance Computing). The ZIH is also one of the two main partners of the Competence Center for Big Data and Machine Learning, ScaDS.AI Dresden/Leipzig.

The ZIH has been operating high-performance computers for over two decades, which are made available to scientific users for their research tasks free of charge through a computing time application.

Tasks of the ZIH in the KiWi project

The ZIH takes on the following tasks within the framework of the KiWi project:

Advising HTW Dresden on the planned additional acquisition of the HPC cluster at ZIH with a Multi-GPU cluster.
Integrating and operating the Multi-GPU cluster into the existing ZIH infrastructure.
Connecting the Multi-GPU cluster to the existing HPC software stack, allowing researchers at HTW Dresden to benefit from the latest AI tools in the Multi-GPU cluster.
The ZIH and HTW will jointly develop a variant of the HPC workflow through which HTW and its research partners from industry and academia can efficiently access the resources. The allocation of R&D projects to the resources of the Multi-GPU cluster is the responsibility of HTW Dresden.
For regular exchange of experiences on AI technologies and their use in applied research, HTW will become an associate partner in the Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig (ScaDS.AI).

HPC Cluster

With the High-Performance Computing/Storage Complex (HRSK-II) and the High Performance Computing – Data Analytics (HPC-DA) extension, scientists have access to a supercomputer with approximately 60,000 CPU cores:

High-Performance Computing/Storage Complex (HRSK-II)

40,000 Intel Haswell Cores
256 GPUs (Nvidia K80)
For highly parallel, data- and compute-intensive HPC applications
FEM simulations, CFD calculations, molecular dynamics, calculations with Matlab or R

High Performance Computing – Data Analytics (HPC-DA)

1,536 AMD Rome Cores, 256 Nvidia A100 GPUs (AlphaCentauri)
1,408 IBM Power9 Cores, 192 Nvidia V100 GPUs (IBM Power9)
24,576 AMD Rome Cores (Romeo)
2 PB flash storage
10 PB archive
For applications in machine learning and deep learning
Neural network training, data analysis with big data frameworks
Processing of particularly large datasets

Shared-Memory System:

896 Intel CascadeLake Cores
48TB of main memory in a shared address space
400 TB NVMe storage cards as very fast local storage

Access to HPC resources

Steps required to gain access to the HPC resources of ZIH:

ZIH Login:
- Necessary to access HPC services.
- For non-TU users, there is a special HPC-specific registration form that does not require a TU sponsor.
- Login is usually activated within 1-2 days.
- Logins are tied to individuals, meaning that a general login cannot be used for courses.
- However, temporary logins can be created for course participants.
HPC Project Application:
- Required to obtain computing resources.
- The project application is scientifically reviewed and typically approved within a few days.
- A short description of the project is sufficient for up to 3,500 CPU hours and 250 GPU hours per month.
- A more detailed project application is required for additional resources.
- Course-related projects are specially marked and are usually approved without issues.
Assign ZIH Logins to the Requested HPC Project:
- These ZIH logins can then use the allocated resources on the HPC cluster to perform calculations.

Example 1:

For a course, a project application with the appropriate quotas is needed. Each student then has an individual login (which can also be temporary) and is added to the course project. They can now use the project's quotas for computations.

[24 students] x [10 hours of GPU access per student] = quota of 240 GPU hours in the project

Example 2:

A research project corresponds to one project. Each staff member needs an individual login through the registration form and is added to the project. They can now use the project's quotas for computations.

Login to HPC cluster

To gain access to HPC resources, a VPN connection to the TU network is (always) necessary.

The ZIH can, upon agreement, whitelist specific IP addresses (ranges), so that a VPN connection is no longer required from these computers, for example, in the laboratories of HTW Dresden.
In the future, the ZIH plans to implement a global login system for university members to facilitate access for external TU users.
There are two ways to log in to HPC resources:

Login to JupyterHub via the browser: The easiest and fastest way to use the resources. A comprehensive guide can be found here.
Login via shell on the login node: ssh [zih-login]@taurus.hrsk.tu-dresden.de.

Login via shell

Login to the login nodes via ssh [zih-login]@taurus.hrsk.tu-dresden.de. Here, jobs can be set up and managed. The login nodes are not suitable for compute-intensive tasks. From the login nodes, you can interact with the batch system, such as submitting and monitoring jobs.

There are two types of jobs:

1. Interactive Jobs

Good for testing, setup, or compiling.
Example: srun –pty –p ml --ntasks=1 --cpus-per-task=4 –gres=gpu:1 --time=1:00:00 --mem-per-cpu=1700 bash -l

2. Batch Jobs

Once the testing phase is over, batch jobs are strongly recommended.
Job files are submitted to the batch system for later execution.
A job file is a script that contains the resource requirements, environment settings, and commands to execute the application.
For more information, please click here

Environment and Software

There are various ways to work with software on ZIH systems:

1. Modules

A module is a user interface that provides utilities for the user's environment.
Example: Matlab can be loaded with the following command: module load matlab
For more information, click here.

2. Jupyter Notebook

Jupyter Notebook is a web application that allows for easy creation of code, equations, and visualizations.
There is a JupyterHub service on ZIH systems, where notebooks can be easily run on compute nodes with modules, pre-installed, or custom virtual environments (guide).
Additionally, manually installed JupyterServers can be operated for more specific use cases.

3. Containers

Some tasks require the use of containers. This can be done on ZIH systems using Singularity.

Storage systems

Storage systems differ in terms of capacity, streaming bandwidth, IOPS rate, etc. Price and efficiency do not allow for having everything in one system. Therefore, different storage systems are available for different use cases. Apart from home, project, and ext4, all storage systems must be requested through workspaces. Below, the various storage systems are presented. More information can be found here.

File system	Directory	Remarks	Size	Backup
home	/home	Read/write access at any location, for source code and personal data.	50 GB	Yes
project	/projects	For global data in the project. All members have read/write access to the login node. Read-only access to the compute node.	Determined upon application.	Yes
scratch	/scratch/	High streaming bandwidth, e.g. for snapshots	4 PB	No
ssd	/lustre/ssd/	High I/O rate, for many small I/O operationsationen	40 TB	No
beegfs	/beegfs/	Fastest system available, for parallel applications with millions of small I/O operations	232 TB	No
ext4	/tmp	For temporary data, e.g. compilations. Is automatically deleted after the job is completed.	95 GB	No
warm_archive	/warm_archive	For temporary storage in the project, it is deleted after the end of the project	10 PB	Yes