#162 closed enhancement (fixed)
SLURM support
Reported by: | ramonb | Owned by: | ramonb |
---|---|---|---|
Priority: | normal | Milestone: | 1.1 |
Component: | jobmond | Version: | 1.0 |
Keywords: | Cc: | ||
Estimated Number of Hours: | |||
Description
Would be nice to have SLURM support in jobmond
Change History (14)
comment:1 Changed 10 years ago by ramonb
- Owner changed from somebody to ramonb
- Status changed from new to assigned
comment:2 Changed 10 years ago by ramonb
comment:3 Changed 10 years ago by ramonb
it seems pyslurm does not work correctly anymore with the latest slurm
ramonb@r7n17:~$ squeue -o '%N' -j 37 NODELIST r7n[18,20] ramonb@r7n17:~$
The nodes property is not set when querying nodes:
>>> j.get()[37] {u'comment': None, u'time_limit': 120L, u'cnode_cnt': None, u'alloc_node': u'r7n17', u'features': [], u'eligible_time': 1366992590, u'contiguous': False, u'resv_id': None, u'ramdisk_image': None, u'block_id': None, u'sockets_per_node': 65534, u'req_switch': 0L, u'resv_name': None, u'licenses': {}, u'qos': None, u'submit_time': 1366992590, u'mloader_image': None, u'num_cpus': 2L, u'conn_type': (None, 'None'), u'show_flags': 0, u'user_id': 31005L, u'network': None, u'restart_cnt': 0, u'work_dir': u'/home/ramonb', u'pn_min_tmp_disk': 0L, u'max_nodes': 0L, u'job_state': (1, 'RUNNING'), u'assoc_id': 0L, u'exit_code': 0L, u'num_nodes': 2L, u'priority': 4294901747L, u'batch_script': None, u'boards_per_node': 0, u'ntasks_per_socket': 65535, u'batch_flag': 1, u'derived_ec': 0L, u'nodes': None, u'preempt_time': 0, u'pn_min_cpus': 1, u'nice': 10000, u'ntasks_per_node': 0, u'linux_image': None, u'altered': None, u'sockets_per_board': 0, u'alloc_sid': 7956L, u'start_time': 1366992590, u'pre_sus_time': 0, u'ionodes': None, u'state_reason': (0, 'None'), u'pn_min_memory': 0L, u'rotate': False, u'reboot': None, u'blrts_image': None, u'shared': 2, u'time_min': 0L, u'wait4switch': 0L, u'ntasks_per_core': 65535, u'wckey': None, u'account': None, u'requeue': True, u'name': u'test.slurm', u'req_nodes': [], u'gres': [], u'suspend_time': 0, u'partition': 'batch', u'cores_per_socket': 65534, u'batch_host': u'r7n18', u'dependency': None, u'max_cpus': 0L, u'state_desc': None, u'command': u'/home/ramonb/test.slurm', u'end_time': 1366999790, u'cpus_per_task': 1, u'resize_time': 0, u'group_id': 31016L, u'exc_nodes': [], u'threads_per_core': 65534} >>> type( j.get()[37]['nodes'] ) <type 'NoneType'> >>>
I can't seem to find a a working property or method returning the job's nodeList.
According to this: http://slurm.schedmd.com/slurm_ug_2012/pyslurm.pdf
The jobDict[id]['nodes'] should contain the nodeList. But it does not for me with SLURM 2.5.5
comment:4 Changed 10 years ago by ramonb
Opened a pull request to fix the bug:
It was being set twice and overridden
comment:5 Changed 10 years ago by ramonb
I fixed the bug and pull request was approved. Now I can continue with implementing SLURM support
comment:6 Changed 10 years ago by ramonb
In 851:
comment:7 Changed 10 years ago by ramonb
One caveat: pyslurm cannot connect remotely to a pyslurm server, since it uses the local slurm c api.
TODO:
- print error if BATCH_API == 'slurm' and BATCH_SERVER != 'localhost'
- test/fix job's ppn detection for SLURM
- test/fix job's requested_memory detection for SLURM
comment:8 Changed 10 years ago by ramonb
In 852:
comment:9 Changed 10 years ago by ramonb
PPN is going to be tricky due to the way SLURM CPU allocation works as described here:
comment:10 Changed 10 years ago by ramonb
going to handle ppn the same as with Torque, meaning the cpu's requested per node by submission in the batch script.
comment:11 Changed 10 years ago by ramonb
In 853:
comment:12 Changed 10 years ago by ramonb
In 854:
comment:13 Changed 10 years ago by ramonb
- Resolution set to fixed
- Status changed from assigned to closed
SLURM support is now complete!
Will update docs/wiki upon release.
comment:14 Changed 10 years ago by ramonb
In 866:
In 837: