Opened 14 years ago

Closed 14 years ago

Last modified 14 years ago

#26 closed defect (fixed)

stop_timestamp is not correct when only one job run.

Reported by: alexis.michon@… Owned by: bastiaans
Priority: normal Milestone: 0.3
Component: jobarchived Version: 0.2
Keywords: Cc:
Estimated Number of Hours:

Description

When a job run alone on the cluster, stop_timestamp save in the database isn't correct.

Same job on the same cluster has a correct stop_timestamp when there is at least one other job running at the same time.

Change History (17)

comment:1 Changed 14 years ago by bastiaans

  • Component changed from general to jobarchived
  • Owner changed from somebody to bastiaans
  • Status changed from new to assigned

comment:2 in reply to: ↑ description Changed 14 years ago by bastiaans

  • Cc alexis.michon@… added

Even after the job has finished, with jobarchived running the entire time, from job start until job end?

jobarchived updates the stop_timestamp whenever it sees a job no longer being reported by jobmond.

So if jobmond is not running, or the XML stream is lagging behind in time, or a pbs server query fails: as a result sometimes stop_timestamp's may be set incorrectly.

These should however be updated and corrected whenever a job reappears and finishes for real.

  • Ramon.

Replying to alexis.michon@ibcp.fr:

When a job run alone on the cluster, stop_timestamp save in the database isn't correct.

Same job on the same cluster has a correct stop_timestamp when there is at least one other job running at the same time.

comment:3 Changed 14 years ago by bastiaans

  • Cc alexis.michon@… removed

comment:4 Changed 14 years ago by alexis.michon@…

yes, jobarchived et jobmond have running from job start until job end. But, it is possible that the XML stream or PBS has lag. It's not the first time this phenomenon occurs. When the cluster will be empty, i will make some trys. Can you leave open this ticket until i make my tests?

A question : In the file jobarchive.py (class TorqueXMLHandler): When there is jobs on the cluster self.heartbeat = 1 and informations on jobs are stored (function enDocument) only when self.heartbeat == 1. But when the last job is finished, self.heartbeat changes from 1 to 0 and the job is stored when the next job runs, is is right or I made a mistake in my reasoning?

comment:5 Changed 14 years ago by alexis.michon@…

oups i made a mistake.

self.heartbeat is always equal at 1.

comment:6 follow-up: Changed 14 years ago by alexis.michon@…

oups i made a mistake.

self.heartbeat is always equal at 1 until jobmond runs.

comment:7 in reply to: ↑ 6 Changed 14 years ago by bastiaans

heartbeat is set to 0, because in python this also means false, where not 0 means true.

so jobs are only stored if an appropriate heartbeat from jobmond was found, to prevent the storage of ghost jobs/metrics.

Replying to alexis.michon@ibcp.fr:

oups i made a mistake.

self.heartbeat is always equal at 1 until jobmond runs.

comment:8 Changed 14 years ago by alexis.michon@…

okay, after some investigations. All jobs are affected, running time is always 60 seconds plus or minus some seconds.

That's crazy, in your configuration all is ok and on mine is buggy. I have made a patch which seem to work. Any ideas?

patch_jobarchive.py --- jobarchived.py.old 2007-05-03 23:01:24.465386350 +0200 +++ jobarchived.py 2007-05-03 22:52:33.897825334 +0200 @@ -623,6 +623,8 @@

self.jobs_to_store.append( job_id )

debug_msg( 6, 'jobinfo for job %s has changed' %job_id )

+ else: + self.jobAttrs[ job_id ]reported? = jobinforeported?

else:

self.jobAttrs[ job_id ] = jobinfo

comment:9 Changed 14 years ago by anonymous

patch_jobarchive.py with good page-setting

--- jobarchived.py.old  2007-05-03 23:01:24.465386350 +0200
+++ jobarchived.py      2007-05-03 22:52:33.897825334 +0200
@@ -623,6 +623,8 @@
                                                        self.jobs_to_store.append( job_id )

                                                debug_msg( 6, 'jobinfo for job %s has changed' %job_id )
+                                       else:
+                                               self.jobAttrs[ job_id ]['reported'] = jobinfo['reported']
                                else:
                                        self.jobAttrs[ job_id ] = jobinfo

comment:10 Changed 14 years ago by a.michon@…

After some days of tests, my patch make it possible to record jobs with correct start and stop timestamps but the last job isn't archived as long as another job isn't submitted on the cluster. When it is archived, that's with correct timestamp.

comment:11 Changed 14 years ago by bastiaans

  • Cc alexis.michon@… added

thanks a lot for your testing/debugging and patch! I will try to check it out soon and incorporate it in the source tree. Sorry for the late response.

comment:12 Changed 14 years ago by bastiaans

  • Cc alexis.michon@… removed

comment:13 Changed 14 years ago by bastiaans

I think I have found a bug that may have fixed this, it's in changeset r360

comment:14 Changed 14 years ago by bastiaans

  • Milestone set to 0.2.1

comment:15 Changed 14 years ago by bastiaans

  • Milestone 0.2.1 deleted

This may very well be related to the bug in jobmond, where it hangs/blocks reporting and sends no heartbeat when there are 0 jobs in the cluster/is empty.

Removing it from milestone 0.2.1 for now.

comment:16 Changed 14 years ago by bastiaans

  • Milestone set to 0.2.1
  • Resolution set to fixed
  • Status changed from assigned to closed

Yes this is due to the jobmond bug, so its fixed in 0.2.1 then.

Note: See TracTickets for help on using tickets.