Context Navigation

← Previous Ticket
Next Ticket →

#162 closed enhancement (fixed)

SLURM support


Reported by:	ramonb	Owned by:	ramonb
Priority:	normal	Milestone:	1.1
Component:	jobmond	Version:	1.0
Keywords:		Cc:
Estimated Number of Hours:

Description

Would be nice to have SLURM support in jobmond

Change History (14)

comment:1 Changed 11 years ago by ramonb

Owner changed from somebody to ramonb
Status changed from new to assigned

comment:2 Changed 11 years ago by ramonb

In 837:

first attempt at SLURM support
need to test more: not sure if node reporting works for multinode jobs
see #162

comment:3 Changed 11 years ago by ramonb

it seems pyslurm does not work correctly anymore with the latest slurm

ramonb@r7n17:~$ squeue -o '%N' -j 37
NODELIST
r7n[18,20]
ramonb@r7n17:~$

The nodes property is not set when querying nodes:

>>> j.get()[37]
{u'comment': None, u'time_limit': 120L, u'cnode_cnt': None, u'alloc_node': u'r7n17', u'features': [], u'eligible_time': 1366992590, u'contiguous': False, u'resv_id': None, u'ramdisk_image': None, u'block_id': None, u'sockets_per_node': 65534, u'req_switch': 0L, u'resv_name': None, u'licenses': {}, u'qos': None, u'submit_time': 1366992590, u'mloader_image': None, u'num_cpus': 2L, u'conn_type': (None, 'None'), u'show_flags': 0, u'user_id': 31005L, u'network': None, u'restart_cnt': 0, u'work_dir': u'/home/ramonb', u'pn_min_tmp_disk': 0L, u'max_nodes': 0L, u'job_state': (1, 'RUNNING'), u'assoc_id': 0L, u'exit_code': 0L, u'num_nodes': 2L, u'priority': 4294901747L, u'batch_script': None, u'boards_per_node': 0, u'ntasks_per_socket': 65535, u'batch_flag': 1, u'derived_ec': 0L, u'nodes': None, u'preempt_time': 0, u'pn_min_cpus': 1, u'nice': 10000, u'ntasks_per_node': 0, u'linux_image': None, u'altered': None, u'sockets_per_board': 0, u'alloc_sid': 7956L, u'start_time': 1366992590, u'pre_sus_time': 0, u'ionodes': None, u'state_reason': (0, 'None'), u'pn_min_memory': 0L, u'rotate': False, u'reboot': None, u'blrts_image': None, u'shared': 2, u'time_min': 0L, u'wait4switch': 0L, u'ntasks_per_core': 65535, u'wckey': None, u'account': None, u'requeue': True, u'name': u'test.slurm', u'req_nodes': [], u'gres': [], u'suspend_time': 0, u'partition': 'batch', u'cores_per_socket': 65534, u'batch_host': u'r7n18', u'dependency': None, u'max_cpus': 0L, u'state_desc': None, u'command': u'/home/ramonb/test.slurm', u'end_time': 1366999790, u'cpus_per_task': 1, u'resize_time': 0, u'group_id': 31016L, u'exc_nodes': [], u'threads_per_core': 65534}
>>> type( j.get()[37]['nodes'] )
<type 'NoneType'>
>>>

I can't seem to find a a working property or method returning the job's nodeList.

According to this: http://slurm.schedmd.com/slurm_ug_2012/pyslurm.pdf

The jobDict[id]['nodes'] should contain the nodeList. But it does not for me with SLURM 2.5.5

comment:4 Changed 11 years ago by ramonb

Opened a pull request to fix the bug:

https://github.com/gingergeeks/pyslurm/pull/30

It was being set twice and overridden

comment:5 Changed 11 years ago by ramonb

I fixed the bug and pull request was approved. Now I can continue with implementing SLURM support

comment:6 Changed 11 years ago by ramonb

In 851:

jobmond.py:

implemented SLURM job's running node detection
fixed bug where incorrect commandline option would trigger traceback in usage()
see #162

comment:7 Changed 11 years ago by ramonb

One caveat: pyslurm cannot connect remotely to a pyslurm server, since it uses the local slurm c api.

TODO:

print error if BATCH_API == 'slurm' and BATCH_SERVER != 'localhost'
test/fix job's ppn detection for SLURM
test/fix job's requested_memory detection for SLURM

Version 0, edited 11 years ago by ramonb (next)

comment:8 Changed 11 years ago by ramonb

In 852:

jobmond.py:

requesting memory in SLURM is done by specifying minimum real memory required
if no minimum memory is requested it returns: 0
fixed: leave requested_memory empty if 0, so it won't be set in gmetric
see #162

comment:9 Changed 11 years ago by ramonb

PPN is going to be tricky due to the way SLURM CPU allocation works as described here:

http://slurm.schedmd.com/cpu_management.html

comment:10 Changed 11 years ago by ramonb

going to handle ppn the same as with Torque, meaning the cpu's requested per node by submission in the batch script.

comment:11 Changed 11 years ago by ramonb

In 853:

jobmond.py:

fixed ppn to minimum cpu/cores per node as requested by job
allocation and distribution is whole different story with SLURM: http://slurm.schedmd.com/cpu_management.html
see #162

comment:12 Changed 11 years ago by ramonb

In 854:

jobmond.py:

print warning if BATCH_SERVER != localhost and BATCH_API does not support connecting to remote BATCH_SERVER
fun fact: discovered that developers of the SGE/LSF implementation also completely ignore the BATCH_SERVER setting
see #162

comment:13 Changed 11 years ago by ramonb

Resolution set to fixed
Status changed from assigned to closed

SLURM support is now complete!

Will update docs/wiki upon release.

comment:14 Changed 11 years ago by ramonb

In 866:

jobmond.py:

added down/offline node detection for SLURM
state: down = down, drain = offline
see #162

Note: See TracTickets for help on using tickets.

Download in other formats: