Opened 11 years ago

Closed 8 years ago

#76 closed defect (worksforme)

jobarchived does not change status to "F"

Reported by: j.kasiak@… Owned by: ramonb
Priority: major Milestone: 1.0
Component: jobarchived Version: 0.3.1
Keywords: Cc:
Estimated Number of Hours:

Description

Jobarchived does not update a jobs status to "F" once it finishes. Jobmond runs on the head node. gmetad runs on a seperate box. I've narrowed down the problem: when I do on my gmetad box

telnet -l ganglia localhost 8651 | grep -i monarch | grep -i 23055

<METRIC NAME="MONARCH-JOB-23055-0" VAL="status=R start_timestamp=1269222985 name=STDIN poll_interval=30 queue=batch reported=1269223164 requested_time=100:00:00 queued_timestamp=1269222984 owner=user1 nodes=p340050" TYPE="string" UNITS="" TN="442" TMAX="60" DMAX="0" SLOPE="both" SOURCE="gmond"> Connection closed by foreign host.

The job is still there!!! Only a restart of gmetad clears this. This is a problem, since jobarchived parses this xml file and puts this node in an array of active nodes, and never gets to set the job_status to "F".

How can I fix this? Thanks, Jan

Attachments (1)

initial_conf.tar (20.0 KB) - added by j.kasiak@… 11 years ago.
My initial jobmonarch conf files

Download all attachments as: .zip

Change History (11)

comment:1 Changed 11 years ago by ramonb

Have you tried waiting a minute or so?

It will only get the status F, once the reported_timestamp + poll_interval is higher the current timestamp.

comment:2 Changed 11 years ago by ramonb

  • Cc j.kasiak@… added

Have you tried waiting a minute or so?

It will only get the status F, once the reported_timestamp + poll_interval is higher the current timestamp.

comment:3 Changed 11 years ago by j.kasiak@…

Yes I've waited. It's been running for over a week now and no jobs have changed to 'F'.

comment:4 Changed 11 years ago by ramonb

can you check if the "start_timestamp" value is set for that job in the SQL database?

If you run jobarchived in debug level 1, does it log any "Found xx timed out jobs in database" messages or anything out of the ordinary?

comment:5 Changed 11 years ago by ramonb

  • Owner changed from somebody to ramonb
  • Status changed from new to assigned

When I test it here locally it seems to work fine. The fact that the job stays in the XML in R state is the problem. Since that metric has a TMAX of 60 seconds, it should disappear if the jobs finished and the XML is not updated (job no longer found in batch or jobmond down).

This seems to indicate that either:

  • jobmond does not properly update from the batch/Torque
  • the job is still running

What does it say when you do:

qstat -f 23055

?

Changed 11 years ago by j.kasiak@…

My initial jobmonarch conf files

comment:6 Changed 11 years ago by j.kasiak@…

When I do qstat -f while the job is running I get: Job Id: 32113.headnode

Job_Name = STDIN Job_Owner = janek job_state = R queue = batch server = headnode Checkpoint = u ctime = Thu Apr 15 13:17:30 2010 Error_Path = /dev/pts/1 exec_host = node4_51/1 Hold_Types = n interactive = True Join_Path = n Keep_Files = n Mail_Points = a mtime = Thu Apr 15 13:17:31 2010 Output_Path = /dev/pts/1 Priority = 0 qtime = Thu Apr 15 13:17:30 2010 Rerunable = False Resource_List.nodect = 1 Resource_List.nodes = 1 Resource_List.walltime = 100:00:00 session_id = 19903 Variable_List = PBS_O_HOME=/nfs/admin/janek,PBS_O_LANG=en_US.UTF-8,

PBS_O_LOGNAME=janek, PBS_O_PATH=/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games, PBS_O_MAIL=/var/mail/janek,PBS_O_SHELL=/bin/bash, PBS_SERVER=sirius,PBS_O_HOST=sirius, PBS_O_WORKDIR=/nfs/admin/janek,PBS_O_QUEUE=batch

etime = Thu Apr 15 13:17:30 2010 submit_args = -I

and when the job is done qstat -f 32113 qstat: Unknown Job Id 32113.headnode

comment:7 Changed 11 years ago by ramonb

can you try removing this from your gmond.conf:

  host_dmax = 0 /*secs */

and restarting gmond/jobmond/gmetad, in that order

comment:8 Changed 11 years ago by ramonb

also remove

  cleanup_threshold = 300 /*secs */

please.

comment:9 Changed 11 years ago by j.kasiak@…

I removed both of the lines from the configure file for gmond and jobmonarch (jobmonarch has a separate configure file because I dont want it to broadcast on the same channel as the headnode)

It didn't work. I also tried setting the values to 60 seconds.

comment:10 Changed 8 years ago by ramonb

  • Cc j.kasiak@… removed
  • Milestone set to 1.0
  • Resolution set to worksforme
  • Status changed from assigned to closed

fixed in 1.0

Note: See TracTickets for help on using tickets.