Modify

Opened 8 years ago

Last modified 6 years ago

#54 assigned enhancement

pbs_submit_hash()

Reported by: glen.beane@… Owned by: bas
Priority: major Milestone:
Component: pbs Version: 4.6.0
Keywords: Cc: nate@…

Description

We've found that pbs_submit() has not been reliable. Occasionally pbs_server gets into a state that causes all jobs submitted via pbs_submit() to fail (this is not a problem with pbs_python -- the problem exists even with the C API directly). However, pbs_submit_hash() continues to work in this case. Since Torque moved to pbs_submit_hash(), I don't feel that pbs_submit() is as well tested.

We'd like to move our applications from pbs_submit() to pbs_submit_hash(), but I don't think all the functionality we need is in pbs_python.

Here is a short snippet of C code using pbs_submit_hash:

int fd = pbs_connect(0); char *new_jobid; memmgr* mm; job_data* job_attrs = 0;

memmgr_init(&mm, 0);

/* pass empty ATTR_v, just to show use of hash_add_or_exit */ hash_add_or_exit(&mm, &job_attrs, ATTR_v, "", ENV_DATA);

pbs_submit_hash(fd, &mm, job_attrs, NULL, "/tmp/test.sh", NULL, NULL, &new_jobid, NULL);

Attachments (0)

Change History (9)

comment:1 Changed 8 years ago by bas

  • Status changed from new to assigned

Glen which version of torque do you use?

For torque 4.x (I do not have installed or can test it)

  • pbs_submit_hash

For torque 5.X i see this function:

  • pbs_submit_hash_ext

I do not see any function named hash_add_or_exit.

comment:2 Changed 8 years ago by glen.beane@…

We are using 4.2, but we are considering an upgrade to Torque 5 (first we need to fully test with our pipeline framework, which uses pbs_python)

I just looked at Torque in github, in the 5.1.0 branch in git pbs_submit_hash_ext() is declared in include/pbs_ifl.h, pbs_submit_hash() is declared in lib/Libifl/lib_ifl.h, hash_add_or_exit() is in u_hash_map_structs.h

qsub still calls pbs_submit_hash() directly

all pbs_submit_hash_ext() does is call pbs_submit_hash(), but it takes void* instead of job_data_container*:

int pbs_submit_hash_ext(

int socket, void *job_attr, void *res_attr, char *script, char *destination, char *extend, /* (optional) */ char return_jobid, char msg) { return pbs_submit_hash(socket,

(job_data_container *)job_attr, (job_data_container *)res_attr, script,destination,extend,return_jobid,msg);

}

/* END pbsD_submit.c */

comment:3 Changed 8 years ago by bas

Glen thanks for sorting this out. I am only using pbs_ifl.h, that is the public available functions api for libtorque. In torque 5.X there is only a definition for:

  • pbs_submit_hash_ext

So i have to test in our 5.X test cluster. I do no want to include more and more header files like lib_ifl.h', maybe we can use the new function pbs_submit_hash_ext`. As said I can not test torque 4.X. Adaptive does a lot of interface changing (API) between versions. That is hard to keep up with ;-(

comment:4 Changed 7 years ago by anonymous

The guys from adaptive computing found the problem in the pbs_submit function and come up with a patch, thanks to David Beer:

diff --git a/src/lib/Libifl/pbsD_submit.c b/src/lib/Libifl/pbsD_submit.c
index e096aa8..ca41cca 100644
--- a/src/lib/Libifl/pbsD_submit.c
+++ b/src/lib/Libifl/pbsD_submit.c
@@ -131,7 +131,7 @@ char *pbs_submit_err(
 
   if ((script != NULL) && (*script != '\0'))
     {
-    if (PBSD_jscript(c, script, NULL) != 0)
+    if (PBSD_jscript(c, script, return_jobid) != 0)
       {
       *local_errno = PBSE_BADSCRIPT;

comment:5 Changed 7 years ago by anonymous

see also #62

comment:6 Changed 6 years ago by nate@…

  • Cc nate@… added

Users of pbs_python in Galaxy have reported this issue as well and the fix in comment:4 is working for them. If this could be included in a pbs_python release that'd be great. Here's the issue thread from Galaxy:

https://github.com/galaxyproject/galaxy/issues/2500

comment:7 Changed 6 years ago by glen.beane@…

We also fixed this by patching libtorque on a system that was still running Torque 4. The fix was described on the Torque mailing list and is the same one in the Galaxy issue linked in comment #6.

Note that this is fixed in more recent versions of Torque, so if you have an up to date Torque, then this isn't an issue.

Nate: I don't think you can include this fix with pbs_python because it involves patching libtorque, which is not provided by pbs_python.

comment:8 Changed 6 years ago by nate@…

Hey Glen, thanks for the hint, I totally missed that point.

comment:9 Changed 6 years ago by bas

It is interesting to read all comments and where pbs_python is used.

Add Comment

Modify Ticket

Change Properties
Action
as assigned The owner will remain bas.
as The resolution will be set. Next status will be 'closed'.
to The owner will be changed from bas to the specified user. Next status will be 'new'.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.