Opened 14 years ago
Closed 11 years ago
#76 closed defect (worksforme)
jobarchived does not change status to "F"
Reported by: | j.kasiak@… | Owned by: | ramonb |
---|---|---|---|
Priority: | major | Milestone: | 1.0 |
Component: | jobarchived | Version: | 0.3.1 |
Keywords: | Cc: | ||
Estimated Number of Hours: | |||
Description
Jobarchived does not update a jobs status to "F" once it finishes. Jobmond runs on the head node. gmetad runs on a seperate box. I've narrowed down the problem: when I do on my gmetad box
telnet -l ganglia localhost 8651 | grep -i monarch | grep -i 23055
<METRIC NAME="MONARCH-JOB-23055-0" VAL="status=R start_timestamp=1269222985 name=STDIN poll_interval=30 queue=batch reported=1269223164 requested_time=100:00:00 queued_timestamp=1269222984 owner=user1 nodes=p340050" TYPE="string" UNITS="" TN="442" TMAX="60" DMAX="0" SLOPE="both" SOURCE="gmond"> Connection closed by foreign host.
The job is still there!!! Only a restart of gmetad clears this. This is a problem, since jobarchived parses this xml file and puts this node in an array of active nodes, and never gets to set the job_status to "F".
How can I fix this? Thanks, Jan
Attachments (1)
Change History (11)
comment:1 Changed 14 years ago by ramonb
comment:2 Changed 14 years ago by ramonb
- Cc j.kasiak@… added
Have you tried waiting a minute or so?
It will only get the status F, once the reported_timestamp + poll_interval is higher the current timestamp.
comment:3 Changed 14 years ago by j.kasiak@…
Yes I've waited. It's been running for over a week now and no jobs have changed to 'F'.
comment:4 Changed 14 years ago by ramonb
can you check if the "start_timestamp" value is set for that job in the SQL database?
If you run jobarchived in debug level 1, does it log any "Found xx timed out jobs in database" messages or anything out of the ordinary?
comment:5 Changed 14 years ago by ramonb
- Owner changed from somebody to ramonb
- Status changed from new to assigned
When I test it here locally it seems to work fine. The fact that the job stays in the XML in R state is the problem. Since that metric has a TMAX of 60 seconds, it should disappear if the jobs finished and the XML is not updated (job no longer found in batch or jobmond down).
This seems to indicate that either:
- jobmond does not properly update from the batch/Torque
- the job is still running
What does it say when you do:
qstat -f 23055
?
comment:6 Changed 14 years ago by j.kasiak@…
When I do qstat -f while the job is running I get: Job Id: 32113.headnode
Job_Name = STDIN Job_Owner = janek job_state = R queue = batch server = headnode Checkpoint = u ctime = Thu Apr 15 13:17:30 2010 Error_Path = /dev/pts/1 exec_host = node4_51/1 Hold_Types = n interactive = True Join_Path = n Keep_Files = n Mail_Points = a mtime = Thu Apr 15 13:17:31 2010 Output_Path = /dev/pts/1 Priority = 0 qtime = Thu Apr 15 13:17:30 2010 Rerunable = False Resource_List.nodect = 1 Resource_List.nodes = 1 Resource_List.walltime = 100:00:00 session_id = 19903 Variable_List = PBS_O_HOME=/nfs/admin/janek,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=janek, PBS_O_PATH=/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games, PBS_O_MAIL=/var/mail/janek,PBS_O_SHELL=/bin/bash, PBS_SERVER=sirius,PBS_O_HOST=sirius, PBS_O_WORKDIR=/nfs/admin/janek,PBS_O_QUEUE=batch
etime = Thu Apr 15 13:17:30 2010 submit_args = -I
and when the job is done qstat -f 32113 qstat: Unknown Job Id 32113.headnode
comment:7 Changed 14 years ago by ramonb
can you try removing this from your gmond.conf:
host_dmax = 0 /*secs */
and restarting gmond/jobmond/gmetad, in that order
comment:8 Changed 14 years ago by ramonb
also remove
cleanup_threshold = 300 /*secs */
please.
comment:9 Changed 14 years ago by j.kasiak@…
I removed both of the lines from the configure file for gmond and jobmonarch (jobmonarch has a separate configure file because I dont want it to broadcast on the same channel as the headnode)
It didn't work. I also tried setting the values to 60 seconds.
comment:10 Changed 11 years ago by ramonb
- Cc j.kasiak@… removed
- Milestone set to 1.0
- Resolution set to worksforme
- Status changed from assigned to closed
fixed in 1.0
Have you tried waiting a minute or so?
It will only get the status F, once the reported_timestamp + poll_interval is higher the current timestamp.