Center for Information Services and High-Performance Computing
The Center for Information Services and High-Performance Computing (ZIH) is not only responsible for the general IT services at TU Dresden but also serves as the Saxon High-Performance Computing Center for all academic institutions. Since January 2021, it has been one of the 8 centers in the NHR network (National High-Performance Computing). The ZIH is also one of the two main partners of the Competence Center for Big Data and Machine Learning, ScaDS.AI Dresden/Leipzig.
The ZIH has been operating high-performance computers for over two decades, which are made available to scientific users for their research tasks free of charge through a computing time application.
Tasks of the ZIH in the KiWi project
The ZIH takes on the following tasks within the framework of the KiWi project:
- Advising HTW Dresden on the planned additional acquisition of the HPC cluster at ZIH with a Multi-GPU cluster.
- Integrating and operating the Multi-GPU cluster into the existing ZIH infrastructure.
- Connecting the Multi-GPU cluster to the existing HPC software stack, allowing researchers at HTW Dresden to benefit from the latest AI tools in the Multi-GPU cluster.
- The ZIH and HTW will jointly develop a variant of the HPC workflow through which HTW and its research partners from industry and academia can efficiently access the resources. The allocation of R&D projects to the resources of the Multi-GPU cluster is the responsibility of HTW Dresden.
- For regular exchange of experiences on AI technologies and their use in applied research, HTW will become an associate partner in the Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig (ScaDS.AI).
HPC Cluster
With the High-Performance Computing/Storage Complex (HRSK-II) and the High Performance Computing – Data Analytics (HPC-DA) extension, scientists have access to a supercomputer with approximately 60,000 CPU cores:
High-Performance Computing/Storage Complex (HRSK-II)
- 40,000 Intel Haswell Cores
- 256 GPUs (Nvidia K80)
- For highly parallel, data- and compute-intensive HPC applications
- FEM simulations, CFD calculations, molecular dynamics, calculations with Matlab or R
High Performance Computing – Data Analytics (HPC-DA)
- 1,536 AMD Rome Cores, 256 Nvidia A100 GPUs (AlphaCentauri)
- 1,408 IBM Power9 Cores, 192 Nvidia V100 GPUs (IBM Power9)
- 24,576 AMD Rome Cores (Romeo)
- 2 PB flash storage
- 10 PB archive
- For applications in machine learning and deep learning
- Neural network training, data analysis with big data frameworks
- Processing of particularly large datasets
Shared-Memory System:
- 896 Intel CascadeLake Cores
- 48TB of main memory in a shared address space
- 400 TB NVMe storage cards as very fast local storage
Access to HPC resources
Steps required to gain access to the HPC resources of ZIH:
- ZIH Login:
- Necessary to access HPC services.
- For non-TU users, there is a special HPC-specific registration form that does not require a TU sponsor.
- Login is usually activated within 1-2 days.
- Logins are tied to individuals, meaning that a general login cannot be used for courses.
- However, temporary logins can be created for course participants.
- HPC Project Application:
- Required to obtain computing resources.
- The project application is scientifically reviewed and typically approved within a few days.
- A short description of the project is sufficient for up to 3,500 CPU hours and 250 GPU hours per month.
- A more detailed project application is required for additional resources.
- Course-related projects are specially marked and are usually approved without issues.
- Assign ZIH Logins to the Requested HPC Project:
- These ZIH logins can then use the allocated resources on the HPC cluster to perform calculations.
Example 1:
For a course, a project application with the appropriate quotas is needed. Each student then has an individual login (which can also be temporary) and is added to the course project. They can now use the project's quotas for computations.
[24 students] x [10 hours of GPU access per student] = quota of 240 GPU hours in the project
Example 2:
A research project corresponds to one project. Each staff member needs an individual login through the registration form and is added to the project. They can now use the project's quotas for computations.
Login to HPC cluster
To gain access to HPC resources, a VPN connection to the TU network is (always) necessary.
- The ZIH can, upon agreement, whitelist specific IP addresses (ranges), so that a VPN connection is no longer required from these computers, for example, in the laboratories of HTW Dresden.
In the future, the ZIH plans to implement a global login system for university members to facilitate access for external TU users.
There are two ways to log in to HPC resources:
- Login to JupyterHub via the browser: The easiest and fastest way to use the resources. A comprehensive guide can be found here.
- Login via shell on the login node: ssh [zih-login]@taurus.hrsk.tu-dresden.de.
Login via shell
Login to the login nodes via ssh [zih-login]@taurus.hrsk.tu-dresden.de. Here, jobs can be set up and managed. The login nodes are not suitable for compute-intensive tasks. From the login nodes, you can interact with the batch system, such as submitting and monitoring jobs.
There are two types of jobs:
1. Interactive Jobs
- Good for testing, setup, or compiling.
- Example: srun –pty –p ml --ntasks=1 --cpus-per-task=4 –gres=gpu:1 --time=1:00:00 --mem-per-cpu=1700 bash -l
2. Batch Jobs
- Once the testing phase is over, batch jobs are strongly recommended.
- Job files are submitted to the batch system for later execution.
A job file is a script that contains the resource requirements, environment settings, and commands to execute the application.
For more information, please click here
Environment and Software
There are various ways to work with software on ZIH systems:
1. Modules
- A module is a user interface that provides utilities for the user's environment.
- Example: Matlab can be loaded with the following command: module load matlab
- For more information, click here.
2. Jupyter Notebook
- Jupyter Notebook is a web application that allows for easy creation of code, equations, and visualizations.
- There is a JupyterHub service on ZIH systems, where notebooks can be easily run on compute nodes with modules, pre-installed, or custom virtual environments (guide).
- Additionally, manually installed JupyterServers can be operated for more specific use cases.
3. Containers
- Some tasks require the use of containers. This can be done on ZIH systems using Singularity.
Storage systems
Storage systems differ in terms of capacity, streaming bandwidth, IOPS rate, etc. Price and efficiency do not allow for having everything in one system. Therefore, different storage systems are available for different use cases. Apart from home, project, and ext4, all storage systems must be requested through workspaces. Below, the various storage systems are presented. More information can be found here.
File system | Directory | Remarks | Size | Backup | |
home | /home | Read/write access at any location, for source code and personal data. | 50 GB | Yes | |
project | /projects | For global data in the project. All members have read/write access to the login node. Read-only access to the compute node. | Determined upon application. | Yes | |
scratch | /scratch/ | High streaming bandwidth, e.g. for snapshots | 4 PB | No | |
ssd | /lustre/ssd/ | High I/O rate, for many small I/O operationsationen | 40 TB | No | |
beegfs | /beegfs/ | Fastest system available, for parallel applications with millions of small I/O operations | 232 TB | No | |
ext4 | /tmp | For temporary data, e.g. compilations. Is automatically deleted after the job is completed. | 95 GB | No | |
warm_archive | /warm_archive | For temporary storage in the project, it is deleted after the end of the project | 10 PB | Yes |