Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#162 closed enhancement (fixed)

SLURM support

Reported by: ramonb Owned by: ramonb
Priority: normal Milestone: 1.1
Component: jobmond Version: 1.0
Keywords: Cc:
Estimated Number of Hours:

Description

Would be nice to have SLURM support in jobmond

Change History (14)

comment:1 Changed 7 years ago by ramonb

  • Owner changed from somebody to ramonb
  • Status changed from new to assigned

comment:2 Changed 7 years ago by ramonb

In 837:

  • first attempt at SLURM support
  • need to test more: not sure if node reporting works for multinode jobs
  • see #162

comment:3 Changed 7 years ago by ramonb

it seems pyslurm does not work correctly anymore with the latest slurm

ramonb@r7n17:~$ squeue -o '%N' -j 37
NODELIST
r7n[18,20]
ramonb@r7n17:~$ 

The nodes property is not set when querying nodes:

>>> j.get()[37]
{u'comment': None, u'time_limit': 120L, u'cnode_cnt': None, u'alloc_node': u'r7n17', u'features': [], u'eligible_time': 1366992590, u'contiguous': False, u'resv_id': None, u'ramdisk_image': None, u'block_id': None, u'sockets_per_node': 65534, u'req_switch': 0L, u'resv_name': None, u'licenses': {}, u'qos': None, u'submit_time': 1366992590, u'mloader_image': None, u'num_cpus': 2L, u'conn_type': (None, 'None'), u'show_flags': 0, u'user_id': 31005L, u'network': None, u'restart_cnt': 0, u'work_dir': u'/home/ramonb', u'pn_min_tmp_disk': 0L, u'max_nodes': 0L, u'job_state': (1, 'RUNNING'), u'assoc_id': 0L, u'exit_code': 0L, u'num_nodes': 2L, u'priority': 4294901747L, u'batch_script': None, u'boards_per_node': 0, u'ntasks_per_socket': 65535, u'batch_flag': 1, u'derived_ec': 0L, u'nodes': None, u'preempt_time': 0, u'pn_min_cpus': 1, u'nice': 10000, u'ntasks_per_node': 0, u'linux_image': None, u'altered': None, u'sockets_per_board': 0, u'alloc_sid': 7956L, u'start_time': 1366992590, u'pre_sus_time': 0, u'ionodes': None, u'state_reason': (0, 'None'), u'pn_min_memory': 0L, u'rotate': False, u'reboot': None, u'blrts_image': None, u'shared': 2, u'time_min': 0L, u'wait4switch': 0L, u'ntasks_per_core': 65535, u'wckey': None, u'account': None, u'requeue': True, u'name': u'test.slurm', u'req_nodes': [], u'gres': [], u'suspend_time': 0, u'partition': 'batch', u'cores_per_socket': 65534, u'batch_host': u'r7n18', u'dependency': None, u'max_cpus': 0L, u'state_desc': None, u'command': u'/home/ramonb/test.slurm', u'end_time': 1366999790, u'cpus_per_task': 1, u'resize_time': 0, u'group_id': 31016L, u'exc_nodes': [], u'threads_per_core': 65534}
>>> type( j.get()[37]['nodes'] )
<type 'NoneType'>
>>> 

I can't seem to find a a working property or method returning the job's nodeList.

According to this: http://slurm.schedmd.com/slurm_ug_2012/pyslurm.pdf

The jobDict[id]['nodes'] should contain the nodeList. But it does not for me with SLURM 2.5.5

comment:4 Changed 7 years ago by ramonb

Opened a pull request to fix the bug:

It was being set twice and overridden

comment:5 Changed 7 years ago by ramonb

I fixed the bug and pull request was approved. Now I can continue with implementing SLURM support

comment:6 Changed 7 years ago by ramonb

In 851:

jobmond.py:

  • implemented SLURM job's running node detection
  • fixed bug where incorrect commandline option would trigger traceback in usage()
  • see #162

comment:7 Changed 7 years ago by ramonb

One caveat: pyslurm cannot connect remotely to a SLURM server, since it uses the local slurm c api.

TODO:

  • print error if BATCH_API == 'slurm' and BATCH_SERVER != 'localhost'
  • test/fix job's ppn detection for SLURM
  • test/fix job's requested_memory detection for SLURM
Last edited 7 years ago by ramonb (previous) (diff)

comment:8 Changed 7 years ago by ramonb

In 852:

jobmond.py:

  • requesting memory in SLURM is done by specifying minimum real memory required
  • if no minimum memory is requested it returns: 0
  • fixed: leave requested_memory empty if 0, so it won't be set in gmetric
  • see #162

comment:9 Changed 7 years ago by ramonb

PPN is going to be tricky due to the way SLURM CPU allocation works as described here:

comment:10 Changed 7 years ago by ramonb

going to handle ppn the same as with Torque, meaning the cpu's requested per node by submission in the batch script.

comment:11 Changed 7 years ago by ramonb

In 853:

jobmond.py:

comment:12 Changed 7 years ago by ramonb

In 854:

jobmond.py:

  • print warning if BATCH_SERVER != localhost and BATCH_API does not support connecting to remote BATCH_SERVER
  • fun fact: discovered that developers of the SGE/LSF implementation also completely ignore the BATCH_SERVER setting
  • see #162

comment:13 Changed 7 years ago by ramonb

  • Resolution set to fixed
  • Status changed from assigned to closed

SLURM support is now complete!

Will update docs/wiki upon release.

comment:14 Changed 7 years ago by ramonb

In 866:

jobmond.py:

  • added down/offline node detection for SLURM
  • state: down = down, drain = offline
  • see #162
Note: See TracTickets for help on using tickets.