#26 closed defect (fixed)
stop_timestamp is not correct when only one job run.
Reported by: | alexis.michon@… | Owned by: | bastiaans |
---|---|---|---|
Priority: | normal | Milestone: | 0.3 |
Component: | jobarchived | Version: | 0.2 |
Keywords: | Cc: | ||
Estimated Number of Hours: | |||
Description
When a job run alone on the cluster, stop_timestamp save in the database isn't correct.
Same job on the same cluster has a correct stop_timestamp when there is at least one other job running at the same time.
Change History (17)
comment:1 Changed 16 years ago by bastiaans
- Component changed from general to jobarchived
- Owner changed from somebody to bastiaans
- Status changed from new to assigned
comment:2 in reply to: ↑ description Changed 16 years ago by bastiaans
- Cc alexis.michon@… added
comment:3 Changed 16 years ago by bastiaans
- Cc alexis.michon@… removed
comment:4 Changed 16 years ago by alexis.michon@…
yes, jobarchived et jobmond have running from job start until job end. But, it is possible that the XML stream or PBS has lag. It's not the first time this phenomenon occurs. When the cluster will be empty, i will make some trys. Can you leave open this ticket until i make my tests?
A question : In the file jobarchive.py (class TorqueXMLHandler): When there is jobs on the cluster self.heartbeat = 1 and informations on jobs are stored (function enDocument) only when self.heartbeat == 1. But when the last job is finished, self.heartbeat changes from 1 to 0 and the job is stored when the next job runs, is is right or I made a mistake in my reasoning?
comment:5 Changed 16 years ago by alexis.michon@…
oups i made a mistake.
self.heartbeat is always equal at 1.
comment:6 follow-up: ↓ 7 Changed 16 years ago by alexis.michon@…
oups i made a mistake.
self.heartbeat is always equal at 1 until jobmond runs.
comment:7 in reply to: ↑ 6 Changed 16 years ago by bastiaans
heartbeat is set to 0, because in python this also means false, where not 0 means true.
so jobs are only stored if an appropriate heartbeat from jobmond was found, to prevent the storage of ghost jobs/metrics.
Replying to alexis.michon@ibcp.fr:
oups i made a mistake.
self.heartbeat is always equal at 1 until jobmond runs.
comment:8 Changed 16 years ago by alexis.michon@…
okay, after some investigations. All jobs are affected, running time is always 60 seconds plus or minus some seconds.
That's crazy, in your configuration all is ok and on mine is buggy. I have made a patch which seem to work. Any ideas?
patch_jobarchive.py --- jobarchived.py.old 2007-05-03 23:01:24.465386350 +0200 +++ jobarchived.py 2007-05-03 22:52:33.897825334 +0200 @@ -623,6 +623,8 @@
self.jobs_to_store.append( job_id )
debug_msg( 6, 'jobinfo for job %s has changed' %job_id )
+ else: + self.jobAttrs[ job_id ]reported? = jobinforeported?
else:
self.jobAttrs[ job_id ] = jobinfo
comment:9 Changed 16 years ago by anonymous
patch_jobarchive.py with good page-setting
--- jobarchived.py.old 2007-05-03 23:01:24.465386350 +0200 +++ jobarchived.py 2007-05-03 22:52:33.897825334 +0200 @@ -623,6 +623,8 @@ self.jobs_to_store.append( job_id ) debug_msg( 6, 'jobinfo for job %s has changed' %job_id ) + else: + self.jobAttrs[ job_id ]['reported'] = jobinfo['reported'] else: self.jobAttrs[ job_id ] = jobinfo
comment:10 Changed 16 years ago by a.michon@…
After some days of tests, my patch make it possible to record jobs with correct start and stop timestamps but the last job isn't archived as long as another job isn't submitted on the cluster. When it is archived, that's with correct timestamp.
comment:11 Changed 16 years ago by bastiaans
- Cc alexis.michon@… added
thanks a lot for your testing/debugging and patch! I will try to check it out soon and incorporate it in the source tree. Sorry for the late response.
comment:12 Changed 16 years ago by bastiaans
- Cc alexis.michon@… removed
comment:13 Changed 16 years ago by bastiaans
I think I have found a bug that may have fixed this, it's in changeset r360
comment:14 Changed 16 years ago by bastiaans
- Milestone set to 0.2.1
comment:15 Changed 16 years ago by bastiaans
- Milestone 0.2.1 deleted
This may very well be related to the bug in jobmond, where it hangs/blocks reporting and sends no heartbeat when there are 0 jobs in the cluster/is empty.
Removing it from milestone 0.2.1 for now.
comment:16 Changed 16 years ago by bastiaans
- Milestone set to 0.2.1
- Resolution set to fixed
- Status changed from assigned to closed
Yes this is due to the jobmond bug, so its fixed in 0.2.1 then.
comment:17 Changed 16 years ago by anonymous
baidu 163 sina beiing hotel hotel 太阳能4 star hotel2 star hotel
5 star hotelwrought iron stair railings stair railings outdoor stair railings aluminum stair railing wrought iron stair baluster wrought iron stair railing parts staircase handrail iron stair rails interior metal staircase
Even after the job has finished, with jobarchived running the entire time, from job start until job end?
jobarchived updates the stop_timestamp whenever it sees a job no longer being reported by jobmond.
So if jobmond is not running, or the XML stream is lagging behind in time, or a pbs server query fails: as a result sometimes stop_timestamp's may be set incorrectly.
These should however be updated and corrected whenever a job reappears and finishes for real.
Replying to alexis.michon@ibcp.fr: