blob: 4555781ed69daba50942dd93a800b19ee76f4348 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
__Ganglia__ (https://uhhpc.herts.ac.uk/ganglia/) can be useful to see the state of nodes.
If a node goes down while a user’s job is running on it, the job will not terminate properly
and may flood the user’s inbox with notifications. If `Ganglia` or `showstate` report a node
is down, consider rebooting it with
`sudo rebootnode.pl nodexxx`
This will prompt you for the IDRAC password, which is `rianhs4b`. Once a node has been rebooted,
wait a few minutes, then check that you can ssh into it as a normal user and view your home
directory and /beegfs. If so, bring it back on line with
`sudo pbsnodes –c nodexxx`
If a node is misbehaving and you don’t want to/can’t reboot it, you can temporarily remove it
from the pool used the job control system with
`pbsnodes –o nodexxx`
– also reversed by
`pbsnodes –c`
|