#22 closed defect (fixed)
jobmond.py consumes too much cpu system time when there are no jobs
Reported by: | gastineau@… | Owned by: | bastiaans |
---|---|---|---|
Priority: | major | Milestone: | 0.3 |
Component: | jobmond | Version: | 0.2 |
Keywords: | Cc: | ||
Estimated Number of Hours: | |||
Description
Hi,
I run the last stable version of jobmonarch on an IA64 server with redhat Linux AS 4 and torque 2.1.8. I install yesterday jobmarch and it works fine until a long job (which was running about 12hours) finished this night. After that, the process jobmond.py consumes about 20% of "cpu system time". With the ps command , I see that jobmond.py run very frequently pbs_iff.
root 1772 4.8 0.1 67280 8096 ? S Apr25 60:21 /usr/bin/python -v /usr/local/sbin/jobmond.py -c /etc/jobmond.conf root 4146 0.0 0.0 752 320 ? R 09:58 0:00 /usr/local/torque.2.1.8.fPIC/sbin/pbs_iff localhost 15001 4
How can I correct this problem ?
Thanks,
Mickael,
I attach the content of my file /etc/jobmond.conf
Attachments (1)
Change History (12)
Changed 16 years ago by gastineau@…
comment:1 Changed 16 years ago by bastiaans
- Owner changed from somebody to bastiaans
- Status changed from new to assigned
comment:2 Changed 16 years ago by bastiaans
Possibly a bug in the internal handling of the joblist compilation after a change in the numbers of jobs? Will investigate.
comment:3 Changed 16 years ago by bastiaans
- Milestone set to 0.2.1
- Priority changed from normal to major
Ramon Bastiaans a écrit : > Hi Mickael, > > I have some additional questions to pinpoint the origin of the issue. > > Does jobmond.py still report the job information properly when you notice the 20% cpu time consumption, or does it stop functioning? It stops functioning > I.e.: do you still see the jobs on the web front end, after when this occurs? > New jobs are not visible on the web page and no archive. > And does restarting jobmond fix the problem temporarily / cpu usage? > No, if I start a new job. jobmond (or child process created by jobmond) starts to take cpu usage. > Do you happen to know how many jobs there were in your batch system at the time of the cpu time increase? > I don't know, I only try with few jobs (one or two jobs at the same time). But the log file of pbs_server becomes very large (more than 5Go in 6 hours) since the problem starts. I see that many requests are performed since the problem starts. Mickael,
comment:4 Changed 16 years ago by anonymous
I too have seen a similar issue with the system making lots of entries in the logs, and a higher CPU load on the jobmond.py reporting node.
I have been able to detect that this happens when there are no jobs in the queue. Whilst this doesn't mean it doesn't happen at other times, this is the only time I have noticed it.
The requests for updates from the pbs server appear to be continuous when there are no jobs in the queue.
Further to this it actually filled the Logfile (2.0G file limit) for the pbs_server yesterday and caused the pbs_server to crash. I am running the latest stable version (0.2).
My BATCH_POLL_INTERVAL is set to 30.
comment:5 Changed 16 years ago by bastiaans
- Summary changed from jobmond.py consumes too much cpu system time to jobmond.py consumes too much cpu system time when there are no jobs
comment:6 Changed 16 years ago by bastiaans
It seems more people are experiencing this
Hi Ramon, Firstly, I'd like to say thanks for jobmond. Its a great little utility that I have installed to monitor the cluster queue status. I've been using ganglia for a long time now and this was an addon that made it all that much better. I thought I'd just bring it to your attention that there appears to be a bug in jobmond.py When I have no jobs in the queue (although it is not all that often), I get many requests every second from the jobmond.py host - that is the machine running jobmond.py (which is a compute node). The load on the node also increases during this time. Further to that as there are so many requests it pollutes the server_log for pbs_server. This caused my pbs_server to crash over the weekend when the file got to 2.0G in size. If at all possible, I was wondering if you could take a look into what is causing this. I'm happy to provide more information and outputs from the programs if that helps. As well as testing for you. Cheers Craig... -- Craig West HPC Administrator Astronomy Department University of Massachusetts
At least now it is for certain that this bug occurs whenever there are no jobs in the batch. However the origin of this bug might reside in the Torque library itself, which would be unfortunate. I'm going to run some test cases to confirm.
comment:7 Changed 16 years ago by bastiaans
Found the origin of the bug in a unnecessary while statement.
Should be fixed now in changeset r359.
comment:8 Changed 16 years ago by bastiaans
Thanks to Bas van der Vlies @ SARA for discovering the bug where I overlooked it.
comment:9 Changed 16 years ago by bastiaans
Craig West confirms this fixes the bug:
Ramon, Must have been your luck day. I had an shutdown of the cluster due to aircon issues. The cluster is up and running again and I've been able to test the jobmond.py script. It appears to no longer be probing the queue when there are no jobs. Thanks for the fix. Craig.
So this can go into 0.2.1 and almost ready for release.
comment:10 Changed 16 years ago by bastiaans
- Resolution set to fixed
- Status changed from assigned to closed
comment:11 Changed 16 years ago by anonymous
baidu 163 sina beiing hotel hotel 太阳能4 star hotel2 star hotel
5 star hotelwrought iron stair railings stair railings outdoor stair railings aluminum stair railing wrought iron stair baluster wrought iron stair railing parts staircase handrail iron stair rails interior metal staircase
configuration files