Opened 16 years ago

Closed 16 years ago

Last modified 16 years ago

#22 closed defect (fixed)

jobmond.py consumes too much cpu system time when there are no jobs

Reported by: gastineau@… Owned by: bastiaans
Priority: major Milestone: 0.3
Component: jobmond Version: 0.2
Keywords: Cc:
Estimated Number of Hours:

Description

Hi,

I run the last stable version of jobmonarch on an IA64 server with redhat Linux AS 4 and torque 2.1.8. I install yesterday jobmarch and it works fine until a long job (which was running about 12hours) finished this night. After that, the process jobmond.py consumes about 20% of "cpu system time". With the ps command , I see that jobmond.py run very frequently pbs_iff.

root 1772 4.8 0.1 67280 8096 ? S Apr25 60:21 /usr/bin/python -v /usr/local/sbin/jobmond.py -c /etc/jobmond.conf root 4146 0.0 0.0 752 320 ? R 09:58 0:00 /usr/local/torque.2.1.8.fPIC/sbin/pbs_iff localhost 15001 4

How can I correct this problem ?

Thanks,

Mickael,

I attach the content of my file /etc/jobmond.conf

Attachments (1)

jobmond.conf (1.1 KB) - added by gastineau@… 16 years ago.
configuration files

Download all attachments as: .zip

Change History (12)

Changed 16 years ago by gastineau@…

configuration files

comment:1 Changed 16 years ago by bastiaans

  • Owner changed from somebody to bastiaans
  • Status changed from new to assigned

comment:2 Changed 16 years ago by bastiaans

Possibly a bug in the internal handling of the joblist compilation after a change in the numbers of jobs? Will investigate.

comment:3 Changed 16 years ago by bastiaans

  • Milestone set to 0.2.1
  • Priority changed from normal to major
Ramon Bastiaans a écrit :
> Hi Mickael,
>
> I have some additional questions to pinpoint the origin of the issue.
>
> Does jobmond.py still report the job information properly when you notice the 20% cpu time consumption, or does it stop functioning?
It stops functioning
> I.e.: do you still see the jobs on the web front end, after when this occurs?
>

New jobs are not visible on the web page and no archive.

> And does restarting jobmond fix the problem temporarily / cpu usage?
>
No, if I start a new job. jobmond (or child process created by jobmond) starts to take cpu usage.

> Do you happen to know how many jobs there were in your batch system at the time of the cpu time increase?
>
I don't know, I only try with few jobs (one or two jobs at the same time).
But the log file of pbs_server becomes very large (more than 5Go in 6 hours) since the problem starts. I see that many requests are performed since the problem starts.


Mickael,

comment:4 Changed 16 years ago by anonymous

I too have seen a similar issue with the system making lots of entries in the logs, and a higher CPU load on the jobmond.py reporting node.

I have been able to detect that this happens when there are no jobs in the queue. Whilst this doesn't mean it doesn't happen at other times, this is the only time I have noticed it.

The requests for updates from the pbs server appear to be continuous when there are no jobs in the queue.

Further to this it actually filled the Logfile (2.0G file limit) for the pbs_server yesterday and caused the pbs_server to crash. I am running the latest stable version (0.2).

My BATCH_POLL_INTERVAL is set to 30.

comment:5 Changed 16 years ago by bastiaans

  • Summary changed from jobmond.py consumes too much cpu system time to jobmond.py consumes too much cpu system time when there are no jobs

comment:6 Changed 16 years ago by bastiaans

It seems more people are experiencing this

Hi Ramon,

Firstly, I'd like to say thanks for jobmond. Its a great little utility that I have installed to monitor the cluster queue status. I've been using ganglia for a long time now and this was an addon that made it all that much better.

I thought I'd just bring it to your attention that there appears to be a bug in jobmond.py
When I have no jobs in the queue (although it is not all that often), I get many requests every second from the jobmond.py host - that is the machine running jobmond.py (which is a compute node). The load on the node also increases during this time.
Further to that as there are so many requests it pollutes the server_log for pbs_server.
This caused my pbs_server to crash over the weekend when the file got to 2.0G in size.

If at all possible, I was wondering if you could take a look into what is causing this. I'm happy to provide more information and outputs from the programs if that helps. As well as testing for you.

Cheers
Craig...

-- 
Craig West
HPC Administrator
Astronomy Department
University of Massachusetts

At least now it is for certain that this bug occurs whenever there are no jobs in the batch. However the origin of this bug might reside in the Torque library itself, which would be unfortunate. I'm going to run some test cases to confirm.

comment:7 Changed 16 years ago by bastiaans

Found the origin of the bug in a unnecessary while statement.

Should be fixed now in changeset r359.

comment:8 Changed 16 years ago by bastiaans

Thanks to Bas van der Vlies @ SARA for discovering the bug where I overlooked it.

comment:9 Changed 16 years ago by bastiaans

Craig West confirms this fixes the bug:

Ramon,

Must have been your luck day. I had an shutdown of the cluster due to aircon issues. The cluster is up and running again and I've been able to test the jobmond.py script. It appears to no longer be probing the queue when there are no jobs.

Thanks for the fix.

Craig. 

So this can go into 0.2.1 and almost ready for release.

comment:10 Changed 16 years ago by bastiaans

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.