Modify

Opened 13 years ago

Last modified 10 years ago

#29 assigned defect

Connection expiry and maximum connection number issues

Reported by: sniffer Owned by: bas
Priority: major Milestone:
Component: pbs Version: 4.3.0
Keywords: Cc: bas@…

Description

Hi again,

We seem to be having trouble maitaining open connections to Torque server from within pbs bindings in python.

  • if an invalid connection id is passed, for a connection that is probably no longer active, methods like pbs_statjob/statserver/statnode return an empty list instead of raising an exception or returning an error code
  • error raised after that is 15022 - "No access permission for queue"
  • most common error code we get otherwise is a 15033 - "No free connections"
  • after connection id is invalid, and running pbs.pbs_connect(pbs.pbs_default()) again, the pbs.get_error() still reports error 15033 but a query to pbs.pbs_statjob(c, None, [], None) returns all jobs in batch_result, then pbs.get_error() returns 0 so either pbs_error is not cleared at pbs.pbs_connect() or a connection is successfully established but an error is reported

Best way to diagnose it is to run python in console, connect to torque server and query it from time to time. You shoud notice that an empty list is returned from pbs.pbs_statnode() once the connection is invalid (we have queued jobs so I know for sure that the result is wrong)

We use Torque 2.5.8 (2.5.5 before, no change). The connection limit was raised from 5 to 10 in 2.5.6.

It seems that once the connection limit is reached in interpreter there is no way of connecting again (I tried a reconection decorator to auto reconnect, no luck there either).

Limits are defined in src/include/server_limits.h in Torque

CCd myself this time, any suggestions?

Thanks in advance,

Łukasz Czuja

Attachments (0)

Change History (7)

comment:1 Changed 13 years ago by bas

  • Owner changed from somebody to bas
  • Status changed from new to assigned

Thanks for the detailed reported. We also have several daemon programs that use pbs_python to connect to the torque server, We always open/close the connection to the pbs_server. We noticed the exact same behaviour you described. The pbs_server will close the connection after certain amout of time. With the open/close behaviour we solved the issue. The connection limit was also raised in the 2.4.X series.

We still using the 2.4 version. I have to port the pbs_python to the new version of torque. For now i will skip the 2.5 version and go for the 3.0 version.

I shall try to reset the pbs_error when a new connection is made.

comment:2 Changed 13 years ago by l.czuja@…

I don't believe there were many changes between 2.4.x and 2.5.x API wise. We use pbs_python with 2.5 and the methods (most, no pbs_manager and resource reservation) we tested work (besides what is described above of course). An update for 2.5.x should be cosmetical.

I'm going to change the code to attempt new connection/disconnect every time it is necessary. I'm awaiting any changes in this field, please keep this bug report updated.

comment:3 Changed 13 years ago by bas

I will update it and close it when it is fixed ;-)

comment:4 follow-up: Changed 11 years ago by cjfields@…

Have there been any updates to this applied to newer versions of pbs_python? I have run into a very similar problem with this when using the Galaxy framework, which uses pbs_python 4.1 (works for a period of time, then reaches this limit). We're using Torque 3.0.5.

comment:5 in reply to: ↑ 4 Changed 11 years ago by bas

  • Cc cjfields@… added

Replying to cjfields@…:

Have there been any updates to this applied to newer versions of pbs_python? I have run into a very similar problem with this when using the Galaxy framework, which uses pbs_python 4.1 (works for a period of time, then reaches this limit). We're using Torque 3.0.5.

Which framework do you use PBSQuery? Then there will be open/close with every query. If you use pbs_python you have to close the connetction and open it again. The number of connections is handle by the pbs_server and also the time that a connection can be open. This has nothing to do with pbs python code, maybe i can improve the error codes.

comment:6 follow-up: Changed 11 years ago by cjfields@…

  • Cc bas@… added; l.czuja@… cjfields@… removed
On May 20, 2013, at 8:25 AM, pbs_python <pbs_python@surfsara.nl> wrote:

> #29: Connection expiry and maximum connection number issues
> --------------------+-----------------------
> Reporter:  sniffer  |       Owner:  bas
>    Type:  defect   |      Status:  assigned
> Priority:  major    |   Component:  pbs
> Version:  4.3.0    |  Resolution:
> Keywords:           |
> --------------------+-----------------------
> Changes (by bas):
>
> * cc: cjfields@… (added)
>
>
> Comment:
>
> Replying to [comment:4 cjfields@…]:
>> Have there been any updates to this applied to newer versions of
> pbs_python?  I have run into a very similar problem with this when using
> the Galaxy framework, which uses pbs_python 4.1 (works for a period of
> time, then reaches this limit).  We're using Torque 3.0.5.
>
> Which framework do you use PBSQuery? Then there will be open/close with
> every query. If you use pbs_python you have to close the connetction and
> open it again. The number of connections is handle by the pbs_server and
> also the time that a connection can be open. This has nothing to do with
> pbs python code, maybe i can improve the error codes.

This is within the Galaxy framework (http://wiki.galaxyproject.org/), so my
guess is the specific workers involved are caching or hoarding (e.g. not
closing) the connections over time; odd, b/c this is pretty commonly used
code and the problem only recently popped up locally for us.  My guess is
that we were close to the NCONNECTS cutoff and a small change (changes
cluster-side for instance) may have bumped us over the edge.

chris

comment:7 in reply to: ↑ 6 Changed 10 years ago by depasse@…

The following hack seems to work for me. In pbs.PBSQuery, I edited the _connect method to:

Code highlighting:

classPBSQuery:
    def _connect(self):
        """Connect to the PBS/Torque server"""
        if hasattr(self, 'con'):
            if self.con >= 0:
                return
        self.con = pbs.pbs_connect(self.server)
        if self.con < 0:
            str = "Could not make a connection with %s\n" %(self.server)
            raise PBSError(str)

I've also tried pbs_disconnect(self.con) if con is present. This seems to work too.

Any feedback on this much appreciated

Thanks,

-- Jay [ depasse -at- psc.edu ]

Replying to cjfields@…:

On May 20, 2013, at 8:25 AM, pbs_python <pbs_python@surfsara.nl> wrote:

> #29: Connection expiry and maximum connection number issues
> --------------------+-----------------------
> Reporter:  sniffer  |       Owner:  bas
>    Type:  defect   |      Status:  assigned
> Priority:  major    |   Component:  pbs
> Version:  4.3.0    |  Resolution:
> Keywords:           |
> --------------------+-----------------------
> Changes (by bas):
>
> * cc: cjfields@… (added)
>
>
> Comment:
>
> Replying to [comment:4 cjfields@…]:
>> Have there been any updates to this applied to newer versions of
> pbs_python?  I have run into a very similar problem with this when using
> the Galaxy framework, which uses pbs_python 4.1 (works for a period of
> time, then reaches this limit).  We're using Torque 3.0.5.
>
> Which framework do you use PBSQuery? Then there will be open/close with
> every query. If you use pbs_python you have to close the connetction and
> open it again. The number of connections is handle by the pbs_server and
> also the time that a connection can be open. This has nothing to do with
> pbs python code, maybe i can improve the error codes.

This is within the Galaxy framework (http://wiki.galaxyproject.org/), so my
guess is the specific workers involved are caching or hoarding (e.g. not
closing) the connections over time; odd, b/c this is pretty commonly used
code and the problem only recently popped up locally for us.  My guess is
that we were close to the NCONNECTS cutoff and a small change (changes
cluster-side for instance) may have bumped us over the edge.

chris

Add Comment

Modify Ticket

Change Properties
Action
as assigned The owner will remain bas.
as The resolution will be set. Next status will be 'closed'.
to The owner will be changed from bas to the specified user. Next status will be 'new'.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.