Modify

Opened 7 years ago

Last modified 6 years ago

#62 assigned defect

job submission fails in pbs_python but works fine with qsub

Reported by: Elena Vataga <e.vataga@…> Owned by: bas
Priority: major Milestone:
Component: pbs Version: 4.6.0
Keywords: pbs_submit Cc: e.vataga@…

Description

Hello,

I am trying to set-up galaxy server on our compute cluster and run into a problem: job submission to pbs is very unreliable, in few cases it works but more often it fails. Restarting pbs_server fixes the problem for some time but later it starts again. We have production system with many users and galaxy is a tiny fraction of them, so restarting pbs_server can not be a solution. Errors in Galaxy are completely irrelevant, but trying a simple python script calling pbs.pbs_submit(...) from command line gives:

15044 Resources temporarily unavailable

In pbs logs I see corresponding entry:

PBS_Server.20463;Svr;PBS_Server;LOG_ERROR::Unauthorized Request (15007) in req_jobscript, cannot authorize request (0-Success)

Other commands from examples like ha_server.py or pbsnodes-a.py are working fine. Google brought this report which sounds similar to our problem: http://www.supercluster.org/pipermail/torqueusers/2014-January/016735.html

Is it indeed the case that pbs_python uses pbs_submit and not pbs_submit_hash? Are there any plans to move to pbs_submit_hash? ( in assumption that it will fix this problem)

We use moab/pbs, pbs/tourque version 4.2.9 Python 2.6.6

Let me know if you need any additional information.
Thank you
Kind regards

Elena

Attachments (1)

config.log (4.9 KB) - added by e.vataga@… 7 years ago.
Added by email2trac

Download all attachments as: .zip

Change History (10)

comment:1 Changed 7 years ago by bas

  • Status changed from new to assigned

Can you reply on this thread:

I asked for more info, but till do day no answer :-(.

comment:2 Changed 7 years ago by bas

At our work we are using torque 5.X and 2.5.X. I will test it with that pbs_python version. That is a different one then for torque 4.X. I assume your are using torque 4.X

comment:3 Changed 7 years ago by e.vataga@…

Thank you for prompt reply.
Our torque version is 4.2.9.
Our University has support agreement with Adaptive,
so I have opened a ticket asking about
difference in pbs_submit and ps_submit_hash.
Hope that will help to move it ahead.

It is irrelevant to this ticket but maybe you could help.
We have another small cluster which runs torque 2.5.9
I tried to install pbs_python there, installation runs fine
but when it fails with:

 >>> import pbs
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/usr/local/lib64/python2.6/site-packages/pbs/pbs.py", line 25,
in <module>
     _pbs = swig_import_helper()
   File "/usr/local/lib64/python2.6/site-packages/pbs/pbs.py", line 21,
in swig_import_helper
     _mod = imp.load_module('_pbs', fp, pathname, description)
ImportError: libtorque.so.2: cannot open shared object file: No such
file or directory


There were other tickets with similar problem but not for torque 2.5.9.
I am attaching config.log.
Making galaxy working at least on one cluster would be of great help.

Kind regards
Elena


On 17/09/2015 18:26, pbs_python wrote:
> #62: job submission fails in pbs_python but works fine with qsub
> ----------------------------------------+------------------------
>    Reporter:  Elena Vataga <e.vataga@…>  |      Owner:  bas
>        Type:  defect                     |     Status:  assigned
>    Priority:  major                      |  Milestone:
>   Component:  pbs                        |    Version:  4.6.0
> Resolution:                             |   Keywords:  pbs_submit
> ----------------------------------------+------------------------
>
> Comment (by bas):
>
>   At our work we are using torque 5.X and 2.5.X. I will test it with that
>   pbs_python version. That is a different one then for torque 4.X. I
assume
>   your are using torque 4.X
>
> --
> Ticket URL: <https://oss.trac.surfsara.nl/pbs_python/ticket/62#comment:2>
> pbs_python <https://oss.trac.surfsara.nl/pbs_python>
> The pbs_python package is a wrapper class for the ​Torque C library.
With this package you now can write utilities/extensions in Python instead
of C.

config.log

Changed 7 years ago by e.vataga@…

Added by email2trac

comment:4 follow-up: Changed 7 years ago by bas

I have a question about how you submit your jobs. Do you have an example script:

  1. Do you open/close the connection for each submit
  2. Or open, submit several jobs and then close

Elane the other problem. Do you have libtorque.2.so and where is it installed and echo $LD_LIBRARY_PATH

comment:5 Changed 7 years ago by anonymous

There is a solution for the pbs_submit problem, see #54

comment:7 Changed 7 years ago by e.vataga@…

Ups, I replied to this ticket yesterday using web interface but can not
see my answer -
probably did not press the right button on web.
First of all sorry for silence - I was on leave last week.
We have got a patch from adaptive last Friday and applied it yesterday
So far problem did not appear but it usually takes some time and we
needed to restart pbs_server
several times for other reasons. Our symptoms are a bit different from
what is described
in torque mailing list - there jobs start but can not find script to
execute,
in our case job is not submitted at all.

Anyhow, I will update you on this patch in few days.

p.s. Thank you for the hint with libtorque.2.so and LD_LIBRARY_PATH -
that fixed the problem.

comment:8 in reply to: ↑ 4 Changed 7 years ago by Elena Vataga <e.vataga@…>

  • Cc e.vataga@… added

Replying to bas:

I have a question about how you submit your jobs. Do you have an example script:

  1. Do you open/close the connection for each submit
  2. Or open, submit several jobs and then close

It must be the first (open/close the connection for each submit):

$ cat test3.py 
import pbs

server_name = pbs.pbs_default()
c = pbs.pbs_connect(server_name)
print "Found server:     " + server_name
print "Connection value: " + str(c)

job_id = pbs.pbs_submit(c, 'NULL', "run_simple.pbs", 'batch', 'NULL')

e, e_txt = pbs.error()
if e:
        print e,e_txt

print job_id

comment:9 follow-up: Changed 6 years ago by anonymous

This was due to a bug in Torque, and it has been fixed.

comment:10 in reply to: ↑ 9 Changed 6 years ago by bas

Replying to anonymous:

This was due to a bug in Torque, and it has been fixed.

Thanks for the update. In which version of torque is this been fixed?

Add Comment

Modify Ticket

Change Properties
Action
as assigned The owner will remain bas.
as The resolution will be set. Next status will be 'closed'.
to The owner will be changed from bas to the specified user. Next status will be 'new'.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.