Opened 9 years ago
Last modified 7 years ago
#62 assigned defect
job submission fails in pbs_python but works fine with qsub
Reported by: | Elena Vataga <e.vataga@…> | Owned by: | bas |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | pbs | Version: | 4.6.0 |
Keywords: | pbs_submit | Cc: | e.vataga@… |
Description
Hello,
I am trying to set-up galaxy server on our compute cluster and run into a problem: job submission to pbs is very unreliable, in few cases it works but more often it fails. Restarting pbs_server fixes the problem for some time but later it starts again. We have production system with many users and galaxy is a tiny fraction of them, so restarting pbs_server can not be a solution. Errors in Galaxy are completely irrelevant, but trying a simple python script calling pbs.pbs_submit(...) from command line gives:
15044 Resources temporarily unavailable
In pbs logs I see corresponding entry:
PBS_Server.20463;Svr;PBS_Server;LOG_ERROR::Unauthorized Request (15007) in req_jobscript, cannot authorize request (0-Success)
Other commands from examples like ha_server.py or pbsnodes-a.py are working fine. Google brought this report which sounds similar to our problem: http://www.supercluster.org/pipermail/torqueusers/2014-January/016735.html
Is it indeed the case that pbs_python uses pbs_submit and not pbs_submit_hash? Are there any plans to move to pbs_submit_hash? ( in assumption that it will fix this problem)
We use moab/pbs, pbs/tourque version 4.2.9 Python 2.6.6
Let me know if you need any additional information.
Thank you
Kind regards
Elena
Attachments (1)
Change History (10)
comment:1 Changed 9 years ago by bas
- Status changed from new to assigned
comment:2 Changed 9 years ago by bas
At our work we are using torque 5.X and 2.5.X. I will test it with that pbs_python version. That is a different one then for torque 4.X. I assume your are using torque 4.X
comment:3 Changed 9 years ago by e.vataga@…
Thank you for prompt reply. Our torque version is 4.2.9. Our University has support agreement with Adaptive, so I have opened a ticket asking about difference in pbs_submit and ps_submit_hash. Hope that will help to move it ahead. It is irrelevant to this ticket but maybe you could help. We have another small cluster which runs torque 2.5.9 I tried to install pbs_python there, installation runs fine but when it fails with: >>> import pbs Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib64/python2.6/site-packages/pbs/pbs.py", line 25, in <module> _pbs = swig_import_helper() File "/usr/local/lib64/python2.6/site-packages/pbs/pbs.py", line 21, in swig_import_helper _mod = imp.load_module('_pbs', fp, pathname, description) ImportError: libtorque.so.2: cannot open shared object file: No such file or directory There were other tickets with similar problem but not for torque 2.5.9. I am attaching config.log. Making galaxy working at least on one cluster would be of great help. Kind regards Elena On 17/09/2015 18:26, pbs_python wrote: > #62: job submission fails in pbs_python but works fine with qsub > ----------------------------------------+------------------------ > Reporter: Elena Vataga <e.vataga@…> | Owner: bas > Type: defect | Status: assigned > Priority: major | Milestone: > Component: pbs | Version: 4.6.0 > Resolution: | Keywords: pbs_submit > ----------------------------------------+------------------------ > > Comment (by bas): > > At our work we are using torque 5.X and 2.5.X. I will test it with that > pbs_python version. That is a different one then for torque 4.X. I assume > your are using torque 4.X > > -- > Ticket URL: <https://oss.trac.surfsara.nl/pbs_python/ticket/62#comment:2> > pbs_python <https://oss.trac.surfsara.nl/pbs_python> > The pbs_python package is a wrapper class for the Torque C library. With this package you now can write utilities/extensions in Python instead of C.
comment:4 follow-up: ↓ 8 Changed 9 years ago by bas
I have a question about how you submit your jobs. Do you have an example script:
- Do you open/close the connection for each submit
- Or open, submit several jobs and then close
Elane the other problem. Do you have libtorque.2.so and where is it installed and echo $LD_LIBRARY_PATH
comment:5 Changed 9 years ago by anonymous
There is a solution for the pbs_submit problem, see #54
comment:7 Changed 9 years ago by e.vataga@…
Ups, I replied to this ticket yesterday using web interface but can not see my answer - probably did not press the right button on web. First of all sorry for silence - I was on leave last week. We have got a patch from adaptive last Friday and applied it yesterday So far problem did not appear but it usually takes some time and we needed to restart pbs_server several times for other reasons. Our symptoms are a bit different from what is described in torque mailing list - there jobs start but can not find script to execute, in our case job is not submitted at all. Anyhow, I will update you on this patch in few days. p.s. Thank you for the hint with libtorque.2.so and LD_LIBRARY_PATH - that fixed the problem.
comment:8 in reply to: ↑ 4 Changed 9 years ago by Elena Vataga <e.vataga@…>
- Cc e.vataga@… added
Replying to bas:
I have a question about how you submit your jobs. Do you have an example script:
- Do you open/close the connection for each submit
- Or open, submit several jobs and then close
It must be the first (open/close the connection for each submit):
$ cat test3.py import pbs server_name = pbs.pbs_default() c = pbs.pbs_connect(server_name) print "Found server: " + server_name print "Connection value: " + str(c) job_id = pbs.pbs_submit(c, 'NULL', "run_simple.pbs", 'batch', 'NULL') e, e_txt = pbs.error() if e: print e,e_txt print job_id
comment:9 follow-up: ↓ 10 Changed 7 years ago by anonymous
This was due to a bug in Torque, and it has been fixed.
comment:10 in reply to: ↑ 9 Changed 7 years ago by bas
Replying to anonymous:
This was due to a bug in Torque, and it has been fixed.
Thanks for the update. In which version of torque is this been fixed?
Can you reply on this thread:
I asked for more info, but till do day no answer :-(.