00 - How to login to Maxwell

1

The DESY has a quite powerful compute cluster called the Maxwell cluster. The documentation can be found here [[https:~~/~~/confluence.desy.de/display/MXW/Maxwell+Cluster>>doc:MXW.Maxwell Cluster.WebHome||shape="rect"]], however as this can be confusing sometimes, we will try to condensate this to a step by step manual.

2

= {{id name="00-HowtologintoMaxwell-GettingaDESYAccount"/}}Getting a DESY Account =

8

9

During you beamtime you will encounter multiple systems, where you will need two different types of accounts:

10

11

== {{id name="00-HowtologintoMaxwell-TheDOORAccount"/}}The DOOR Account ==

12

13

Before you arrive you have to create a DOOR account and do all the safety trainings. This account is also being used for the gamma-portal, where you can manage you beamtime data, grant access to other users and manage FTP access. However this account does not work with other resources. For this you will have to request a second account:

14

15

== {{id name="00-HowtologintoMaxwell-ThePSXAccount"/}}The PSX Account ==

16

17

If you decide during a beamtime, you want to have access to the cluster, tell your local contact so, and they will request a PSX account for you. With this you will get access to the Kerberos, Windows and afs resources at DESY, which includes the cluster.

18

19

20

After you got the account, you have to change the initial password within 6 days. For this, go to [[https:~~/~~/passwd.desy.de/>>url:https://passwd.desy.de/||shape="rect"]] and log in with your user name and initial password (you do not need any OTP when you sign in for the first time). Then agree to the terms and change your password.

21

22

= {{id name="00-HowtologintoMaxwell-UsingtheCluster"/}}Using the Cluster =

23

24

== {{id name="00-HowtologintoMaxwell-StructureoftheCluster"/}}Structure of the Cluster ==

25

26

=== {{id name="00-HowtologintoMaxwell-Overview"/}}Overview ===

27

28

The Maxwell Cluster has (status 2021) more than 750 nodes in it. To organize this, you cannot access any node directly, but you have to request compute resources at first. You then can connect form an entrance node to you compute node

29

30

=== {{id name="00-HowtologintoMaxwell-EntranceNodes"/}}Entrance Nodes ===

31

32

If you have successfully obtained an PSX account you can get started. The entrance node are:

33

34

35

[[https:~~/~~/max-display.desy.de:3389/auth/ssh>>url:https://max-display.desy.de:3443/auth/ssh||shape="rect"]] (in any case)

36

37

These nodes are **not **for processing, as you will share them with many other users. So please do not do anything computational intensive on them, like reconstruction or visualization. Viewing images is ok.

38

39

=== {{id name="00-HowtologintoMaxwell-FastX2"/}}Fast X3 ===

40

41

The cluster uses the software FastX3 for connection and virtual desktop. To get the right version of this, use the web interface, log in, and in the bottom right corner is a download link for the desktop client. The version has to match exactly to work properly.

42

43

If you want to add a connection in the desktop client, click the plus, select web, use the address above (including the port), and your username and force ssh authentication. Then you can choose if you want a virtual desktop (XFCE) or a terminal.

44

45

=== {{id name="00-HowtologintoMaxwell-Partitions"/}}Partitions ===

46

47

Starting from an entrance node, you can connect to a compute node. As there are multiple levels of priorities etc. the nodes are organizes in partitions. You can only access some of these. To view which one, open a terminal and use the commad:

my-partitions

Your result will look something like this:

54

55

[[image:attach:P5I.User Guide\: NanoCT.4\. Reconstruction Guide.00 - How to login to Maxwell.WebHome@image2021-5-4_10-28-14.png||queryString="version=1&modificationDate=1620116894626&api=v2"]]

56

57

== {{id name="00-HowtologintoMaxwell-SLURM"/}}SLURM ==

58

59

The access to the resources of the cluster is managed via a scheduler, SLURM.

60

61

SLURM schedules the access to nodes and can revokes access if higher priority jobs come.

62

63

=== {{id name="00-HowtologintoMaxwell-PSXPartition"/}}PSX Partition ===

64

65

Here you cannot be kicked out of your allocation. However, only few nodes are in this partition and you can also only allocate few in parallel (2021: 5). Some of them have GPUs available.

66

67

=== {{id name="00-HowtologintoMaxwell-AllPartition"/}}All Partition ===

68

69

Very large number of nodes available and you can allocate many in parallel (2021: 100). However each allocation can be revoked without a warning if s.o. with higher priority comes. This is very common to happen. If you want to use this partition, be sure to design your job accordingly. Only CPU nodes.

70

71

=== {{id name="00-HowtologintoMaxwell-AllgpuPartition"/}}Allgpu Partition ===

72

73

Like all, but with GPUs

74

75

=== {{id name="00-HowtologintoMaxwell-JhubPartition"/}}Jhub Partition ===

For Jupyter Hub

== {{id name="00-HowtologintoMaxwell-ConnectingtotheCluster"/}}Connecting to the Cluster ==

81

82

Connect to an entrance node via FastX. You will automatically be assigned to a node when you start a session via a load balancer (max-display001-003, max-nova001-002)

83

84

[[image:attach:P5I.User Guide\: NanoCT.4\. Reconstruction Guide.00 - How to login to Maxwell.WebHome@image2021-4-27_13-55-52.png||queryString="version=1&modificationDate=1619524552546&api=v2"]]

85

86

Choose a graphic interface and look around.

87

88

89

== {{id name="00-HowtologintoMaxwell-DataStorage"/}}Data Storage ==

90

91

The Maxwell cluster knows many storage systems. The most important are:

92

93

Your User Folder: This has a hard limit of 30 GB. Be sure not to exceed this.

94

95

The GPFS: here all the beamtime data are stored.

96

97

=== {{id name="00-HowtologintoMaxwell-GPFS"/}}GPFS ===

98

99

Usually you can find you data at: /asap3/petra3/gpfs/<beamline>/<year>/data/<beamtime_id>

100

101

In there you will find a substructure:

102

103

* raw: raw measurement data. Only applicant and beamtime leader can write/delete there

104

* processed: for all processed data

105

* scratch_cc: scratch folder w/o backup

106

* shared: for everything else

107

108

The GPFS has regular snapshots. The whole capacity of this is huge (several PB)

109

110

== {{id name="00-HowtologintoMaxwell-HowtoGetaComputeNode"/}}How to Get a Compute Node ==

111

112

If you want to do some processing, there are two ways to start a job in SLURM:

1. Interactive

1. Batch

In both cases you are the only person working on the node, so use it as much as you like.

118

119

=== {{id name="00-HowtologintoMaxwell-StartinganInteractiveJob"/}}Starting an Interactive Job ===

120

121

To get a node you have to allocate one via SLURM e.g. use:

122

123

124

salloc -N 1 -p psx -t 1-05:00:00

125

126

127

Looking at the individual options:

128

129

* salloc: specifies you want a live allocation

130

* -N 1: for one node

131

* -p psx: on the psx partition. You can also add multiple separated with a comma: -p psx,all

132

* -t 1-05:00:00: for the duration of 1 day and 5h

133

* (((

134

Other options could be: ~-~-mem=500GB with at least 500GB of memory,

(% class="code" %)

(((

if you need gpu: (% class="bash plain" %){{code language="none"}}--constraint=P100{{/code}}

139

)))

140

)))

141

* ... see the SLURM documentation for more options

142

143

If your job is scheduled you see your assigned node and can connect via ssh to it. (in the rare case where you do not see anything use my-jobs to find out the host name).

144

145

=== {{id name="00-HowtologintoMaxwell-Startingabatchjob"/}}Starting a batch job ===

146

147

For a batch job you need a small shell script describing what you want to do. You do not see the job directly, but the output is written to a log file (and results can be stored on disk)

148

149

With a batch job, you can also start an array job, where the same task is executed on multiple servers in parallel.

150

151

An example for such a script:

#!/bin/bash

#SBATCH --time 0-01:00:00

156

#SBATCH --nodes 1

157

#SBATCH --partition all,ps

158

#SBATCH --array 1-80

159

#SBATCH --mem 250GB

160

#SBATCH --job-name ExampleScript

161

162

163

source /etc/profile.d/modules.sh

164

echo "SLURM_JOB_ID $SLURM_JOB_ID"

165

echo "SLURM_ARRAY_JOB_ID $SLURM_ARRAY_JOB_ID"

166

echo "SLURM_ARRAY_TASK_ID $SLURM_ARRAY_TASK_ID"

167

echo "SLURM_ARRAY_TASK_COUNT $SLURM_ARRAY_TASK_COUNT"

168

echo "SLURM_ARRAY_TASK_MAX $SLURM_ARRAY_TASK_MAX"

169

echo "SLURM_ARRAY_TASK_MIN $SLURM_ARRAY_TASK_MIN"

170

171

module load maxwell gcc/8.2

172

173

.local/bin/ipython3 --pylab=qt5 PathToYourScript/Script.py $SLURM_ARRAY_TASK_ID

exit

To run this use

sbatch ./your_script.sh

=== {{id name="00-HowtologintoMaxwell-Viewingyouallocations"/}}Viewing you allocations ===

189

190

To view your pending or running allocations you can use:

squeue -u <username>

or

my-jobs

=== {{id name="00-HowtologintoMaxwell-Whatisrealisticintermsofresources"/}}What is realistic in terms of resources ===

202

203

To be fair, you will not get 100 nodes every time you want them. Especially during a user run, the machines are often quite busy. But if you design your scripts to be tolerant to sudden cancellation, it is still worth trying if you profit from massive parallelization.

204

205

If you want to do some small processing, use one of the psx nodes. This should work most of the time.

206

207

208

== {{id name="00-HowtologintoMaxwell-GrantingDataAccesstootherBeamtimes"/}}Granting Data Access to other Beamtimes ==

209

210

If you have to add other users to a past beamtime, this can be done via the gamma-portal (by PI, leader or beamline scientist). After adding the accounts, these people have to make sure to log off from **all **FastX sessions, etc. to update the permissions.

Wiki source code of 00 - How to login to Maxwell

Applications

Navigation

author	version	line-number	content
		1	The DESY has a quite powerful compute cluster called the Maxwell cluster. The documentation can be found here [[https:~~/~~/confluence.desy.de/display/MXW/Maxwell+Cluster>>doc:MXW.Maxwell Cluster.WebHome\|\|shape="rect"]], however as this can be confusing sometimes, we will try to condensate this to a step by step manual.
		2
		3
		4
		5	{{toc/}}
		6
		7	= {{id name="00-HowtologintoMaxwell-GettingaDESYAccount"/}}Getting a DESY Account =
		8
		9	During you beamtime you will encounter multiple systems, where you will need two different types of accounts:
		10
		11	== {{id name="00-HowtologintoMaxwell-TheDOORAccount"/}}The DOOR Account ==
		12
		13	Before you arrive you have to create a DOOR account and do all the safety trainings. This account is also being used for the gamma-portal, where you can manage you beamtime data, grant access to other users and manage FTP access. However this account does not work with other resources. For this you will have to request a second account:
		14
		15	== {{id name="00-HowtologintoMaxwell-ThePSXAccount"/}}The PSX Account ==
		16
		17	If you decide during a beamtime, you want to have access to the cluster, tell your local contact so, and they will request a PSX account for you. With this you will get access to the Kerberos, Windows and afs resources at DESY, which includes the cluster.
		18
		19
		20	After you got the account, you have to change the initial password within 6 days. For this, go to [[https:~~/~~/passwd.desy.de/>>url:https://passwd.desy.de/\|\|shape="rect"]] and log in with your user name and initial password (you do not need any OTP when you sign in for the first time). Then agree to the terms and change your password.
		21
		22	= {{id name="00-HowtologintoMaxwell-UsingtheCluster"/}}Using the Cluster =
		23
		24	== {{id name="00-HowtologintoMaxwell-StructureoftheCluster"/}}Structure of the Cluster ==
		25
		26	=== {{id name="00-HowtologintoMaxwell-Overview"/}}Overview ===
		27
		28	The Maxwell Cluster has (status 2021) more than 750 nodes in it. To organize this, you cannot access any node directly, but you have to request compute resources at first. You then can connect form an entrance node to you compute node
		29
		30	=== {{id name="00-HowtologintoMaxwell-EntranceNodes"/}}Entrance Nodes ===
		31
		32	If you have successfully obtained an PSX account you can get started. The entrance node are:
		33
		34
		35	[[https:~~/~~/max-display.desy.de:3389/auth/ssh>>url:https://max-display.desy.de:3443/auth/ssh\|\|shape="rect"]] (in any case)
		36
		37	These nodes are not for processing, as you will share them with many other users. So please do not do anything computational intensive on them, like reconstruction or visualization. Viewing images is ok.
		38
		39	=== {{id name="00-HowtologintoMaxwell-FastX2"/}}Fast X3 ===
		40
		41	The cluster uses the software FastX3 for connection and virtual desktop. To get the right version of this, use the web interface, log in, and in the bottom right corner is a download link for the desktop client. The version has to match exactly to work properly.
		42
		43	If you want to add a connection in the desktop client, click the plus, select web, use the address above (including the port), and your username and force ssh authentication. Then you can choose if you want a virtual desktop (XFCE) or a terminal.
		44
		45	=== {{id name="00-HowtologintoMaxwell-Partitions"/}}Partitions ===
		46
		47	Starting from an entrance node, you can connect to a compute node. As there are multiple levels of priorities etc. the nodes are organizes in partitions. You can only access some of these. To view which one, open a terminal and use the commad:
		48
		49	{{code}}
		50	my-partitions
		51	{{/code}}
		52
		53	Your result will look something like this:
		54
		55	[[image:attach:P5I.User Guide\: NanoCT.4\. Reconstruction Guide.00 - How to login to Maxwell.WebHome@image2021-5-4_10-28-14.png\|\|queryString="version=1&modificationDate=1620116894626&api=v2"]]
		56
		57	== {{id name="00-HowtologintoMaxwell-SLURM"/}}SLURM ==
		58
		59	The access to the resources of the cluster is managed via a scheduler, SLURM.
		60
		61	SLURM schedules the access to nodes and can revokes access if higher priority jobs come.
		62
		63	=== {{id name="00-HowtologintoMaxwell-PSXPartition"/}}PSX Partition ===
		64
		65	Here you cannot be kicked out of your allocation. However, only few nodes are in this partition and you can also only allocate few in parallel (2021: 5). Some of them have GPUs available.
		66
		67	=== {{id name="00-HowtologintoMaxwell-AllPartition"/}}All Partition ===
		68
		69	Very large number of nodes available and you can allocate many in parallel (2021: 100). However each allocation can be revoked without a warning if s.o. with higher priority comes. This is very common to happen. If you want to use this partition, be sure to design your job accordingly. Only CPU nodes.
		70
		71	=== {{id name="00-HowtologintoMaxwell-AllgpuPartition"/}}Allgpu Partition ===
		72
		73	Like all, but with GPUs
		74
		75	=== {{id name="00-HowtologintoMaxwell-JhubPartition"/}}Jhub Partition ===
		76
		77	For Jupyter Hub
		78
		79
		80	== {{id name="00-HowtologintoMaxwell-ConnectingtotheCluster"/}}Connecting to the Cluster ==
		81
		82	Connect to an entrance node via FastX. You will automatically be assigned to a node when you start a session via a load balancer (max-display001-003, max-nova001-002)
		83
		84	[[image:attach:P5I.User Guide\: NanoCT.4\. Reconstruction Guide.00 - How to login to Maxwell.WebHome@image2021-4-27_13-55-52.png\|\|queryString="version=1&modificationDate=1619524552546&api=v2"]]
		85
		86	Choose a graphic interface and look around.
		87
		88
		89	== {{id name="00-HowtologintoMaxwell-DataStorage"/}}Data Storage ==
		90
		91	The Maxwell cluster knows many storage systems. The most important are:
		92
		93	Your User Folder: This has a hard limit of 30 GB. Be sure not to exceed this.
		94
		95	The GPFS: here all the beamtime data are stored.
		96
		97	=== {{id name="00-HowtologintoMaxwell-GPFS"/}}GPFS ===
		98
		99	Usually you can find you data at: /asap3/petra3/gpfs/<beamline>/<year>/data/<beamtime_id>
		100
		101	In there you will find a substructure:
		102
		103	* raw: raw measurement data. Only applicant and beamtime leader can write/delete there
		104	* processed: for all processed data
		105	* scratch_cc: scratch folder w/o backup
		106	* shared: for everything else
		107
		108	The GPFS has regular snapshots. The whole capacity of this is huge (several PB)
		109
		110	== {{id name="00-HowtologintoMaxwell-HowtoGetaComputeNode"/}}How to Get a Compute Node ==
		111
		112	If you want to do some processing, there are two ways to start a job in SLURM:
		113
		114	1. Interactive
		115	1. Batch
		116
		117	In both cases you are the only person working on the node, so use it as much as you like.
		118
		119	=== {{id name="00-HowtologintoMaxwell-StartinganInteractiveJob"/}}Starting an Interactive Job ===
		120
		121	To get a node you have to allocate one via SLURM e.g. use:
		122
		123	{{code}}
		124	salloc -N 1 -p psx -t 1-05:00:00
		125	{{/code}}
		126
		127	Looking at the individual options:
		128
		129	* salloc: specifies you want a live allocation
		130	* -N 1: for one node
		131	* -p psx: on the psx partition. You can also add multiple separated with a comma: -p psx,all
		132	* -t 1-05:00:00: for the duration of 1 day and 5h
		133	* (((
		134	Other options could be: ~-~-mem=500GB with at least 500GB of memory,
		135
		136	(% class="code" %)
		137	(((
		138	if you need gpu: (% class="bash plain" %){{code language="none"}}--constraint=P100{{/code}}
		139	)))
		140	)))
		141	* ... see the SLURM documentation for more options
		142
		143	If your job is scheduled you see your assigned node and can connect via ssh to it. (in the rare case where you do not see anything use my-jobs to find out the host name).
		144
		145	=== {{id name="00-HowtologintoMaxwell-Startingabatchjob"/}}Starting a batch job ===
		146
		147	For a batch job you need a small shell script describing what you want to do. You do not see the job directly, but the output is written to a log file (and results can be stored on disk)
		148
		149	With a batch job, you can also start an array job, where the same task is executed on multiple servers in parallel.
		150
		151	An example for such a script:
		152
		153	{{code}}
		154	#!/bin/bash
		155	#SBATCH --time 0-01:00:00
		156	#SBATCH --nodes 1
		157	#SBATCH --partition all,ps
		158	#SBATCH --array 1-80
		159	#SBATCH --mem 250GB
		160	#SBATCH --job-name ExampleScript
		161
		162
		163	source /etc/profile.d/modules.sh
		164	echo "SLURM_JOB_ID $SLURM_JOB_ID"
		165	echo "SLURM_ARRAY_JOB_ID $SLURM_ARRAY_JOB_ID"
		166	echo "SLURM_ARRAY_TASK_ID $SLURM_ARRAY_TASK_ID"
		167	echo "SLURM_ARRAY_TASK_COUNT $SLURM_ARRAY_TASK_COUNT"
		168	echo "SLURM_ARRAY_TASK_MAX $SLURM_ARRAY_TASK_MAX"
		169	echo "SLURM_ARRAY_TASK_MIN $SLURM_ARRAY_TASK_MIN"
		170
		171	module load maxwell gcc/8.2
		172
		173	.local/bin/ipython3 --pylab=qt5 PathToYourScript/Script.py $SLURM_ARRAY_TASK_ID
		174
		175	exit
		176
		177
		178	{{/code}}
		179
		180
		181	To run this use
		182
		183	{{code}}
		184	sbatch ./your_script.sh
		185	{{/code}}
		186
		187
		188	=== {{id name="00-HowtologintoMaxwell-Viewingyouallocations"/}}Viewing you allocations ===
		189
		190	To view your pending or running allocations you can use:
		191
		192	{{code}}
		193	squeue -u <username>
		194
		195	or
		196
		197	my-jobs
		198	{{/code}}
		199
		200
		201	=== {{id name="00-HowtologintoMaxwell-Whatisrealisticintermsofresources"/}}What is realistic in terms of resources ===
		202
		203	To be fair, you will not get 100 nodes every time you want them. Especially during a user run, the machines are often quite busy. But if you design your scripts to be tolerant to sudden cancellation, it is still worth trying if you profit from massive parallelization.
		204
		205	If you want to do some small processing, use one of the psx nodes. This should work most of the time.
		206
		207
		208	== {{id name="00-HowtologintoMaxwell-GrantingDataAccesstootherBeamtimes"/}}Granting Data Access to other Beamtimes ==
		209
		210	If you have to add other users to a past beamtime, this can be done via the gamma-portal (by PI, leader or beamline scientist). After adding the accounts, these people have to make sure to log off from all FastX sessions, etc. to update the permissions.
		211
		212
		213
		214