Opened 11 years ago

Closed 11 years ago

#163 closed defect (fixed)

Exception in thread store_metric_thread

Reported by: vitt@… Owned by: ramonb
Priority: minor Milestone: 1.1
Component: jobarchived Version: 1.0
Keywords: Cc:
Estimated Number of Hours:

Description

I have tried version 1.0. Jobarchive is available in web, it shows list archived jobs, but there is no store metrics.

[root@master ~]# service jobarchived start
Starting Job Archiving Daemon: Sun 05 May 2013 23:53:00 - XML: Handler created
Sun 05 May 2013 23:53:00 - Checking database..
Sun 05 May 2013 23:53:00 - Check done.
Sun 05 May 2013 23:53:00 - Checking rrd archive..
Sun 05 May 2013 23:53:00 - Check done.
Sun 05 May 2013 23:53:00 - job_xml_thread(): started.
Sun 05 May 2013 23:53:00 - job_xml_thread(): Retrieving XML data..
Sun 05 May 2013 23:53:00 - job_xml_thread(): Done retrieving: data size 37183
Sun 05 May 2013 23:53:00 - job_xml_thread(): Parsing XML..
Sun 05 May 2013 23:53:00 - main threading started.
Sun 05 May 2013 23:53:00 - ganglia_xml_thread(): started.
Sun 05 May 2013 23:53:00 - ganglia_xml_thread(): Sleeping.. (15s)
Sun 05 May 2013 23:53:00 - ganglia_parse_thread(): started.
Sun 05 May 2013 23:53:00 - ganglia_parse_thread(): Retrieving XML data..
Sun 05 May 2013 23:53:00 - ganglia_parse_thread(): Done retrieving: data size 37183
Sun 05 May 2013 23:53:00 - ganglia_parse_thread(): Parsing XML..
Sun 05 May 2013 23:53:00 - ganglia_store_metric_thread(): started.
Sun 05 May 2013 23:53:00 - ganglia_store_metric_thread(): Storing data..
Sun 05 May 2013 23:53:00 - ganglia_store_thread(): started.
Sun 05 May 2013 23:53:00 - Entering storeMetrics()
Sun 05 May 2013 23:53:00 - size of cluster 'Test Cluster': 0 hosts 0 metrics 0 values 0 bits 0 bytes 
Sun 05 May 2013 23:53:00 - ganglia_store_thread(): Sleeping.. (60s)
Sun 05 May 2013 23:53:00 - Leaving storeMetrics()
Sun 05 May 2013 23:53:00 - ganglia_store_metric_thread(): Done storing.
Sun 05 May 2013 23:53:00 - ganglia_store_metric_thread(): finished.
Sun 05 May 2013 23:53:00 - XML: Start document
Sun 05 May 2013 23:53:00 - XML: Processed 518 elements - found 0 jobs
Sun 05 May 2013 23:53:00 - job_xml_thread(): Found 0 updated jobs.
Sun 05 May 2013 23:53:00 - job_xml_thread(): No jobs to store.
Sun 05 May 2013 23:53:00 - job_xml_thread(): Done parsing.
Sun 05 May 2013 23:53:00 - job_xml_thread(): Sleeping.. (15s)
Sun 05 May 2013 23:53:00 - ganglia_parse_thread(): Done parsing.
Sun 05 May 2013 23:53:00 - ganglia_parse_thread(): finished.
Sun 05 May 2013 23:53:15 - ganglia_xml_thread(): Done sleeping.
Sun 05 May 2013 23:53:15 - ganglia_xml_thread(): finished.
Sun 05 May 2013 23:53:15 - ganglia_parse_thread(): started.
Sun 05 May 2013 23:53:15 - ganglia_parse_thread(): Retrieving XML data..
Sun 05 May 2013 23:53:15 - ganglia_xml_thread(): started.
Sun 05 May 2013 23:53:15 - ganglia_xml_thread(): Sleeping.. (15s)
Sun 05 May 2013 23:53:15 - ganglia_parse_thread(): Done retrieving: data size 37196
Sun 05 May 2013 23:53:15 - ganglia_parse_thread(): Parsing XML..
Sun 05 May 2013 23:53:15 - job_xml_thread(): Retrieving XML data..
Sun 05 May 2013 23:53:15 - job_xml_thread(): Done retrieving: data size 37196
Sun 05 May 2013 23:53:15 - job_xml_thread(): Parsing XML..
Sun 05 May 2013 23:53:15 - XML: Start document
Sun 05 May 2013 23:53:15 - XML: Processed 518 elements - found 0 jobs
Sun 05 May 2013 23:53:15 - job_xml_thread(): Found 0 updated jobs.
Sun 05 May 2013 23:53:15 - job_xml_thread(): No jobs to store.
Sun 05 May 2013 23:53:15 - job_xml_thread(): Done parsing.
Sun 05 May 2013 23:53:15 - job_xml_thread(): Sleeping.. (15s)
Sun 05 May 2013 23:53:15 - ganglia_parse_thread(): Done parsing.
Sun 05 May 2013 23:53:15 - ganglia_parse_thread(): finished.
Sun 05 May 2013 23:53:30 - ganglia_xml_thread(): Done sleeping.
Sun 05 May 2013 23:53:30 - ganglia_xml_thread(): finished.
Sun 05 May 2013 23:53:30 - ganglia_xml_thread(): started.
Sun 05 May 2013 23:53:30 - ganglia_xml_thread(): Sleeping.. (15s)
Sun 05 May 2013 23:53:30 - ganglia_parse_thread(): started.
Sun 05 May 2013 23:53:30 - ganglia_parse_thread(): Retrieving XML data..
Sun 05 May 2013 23:53:30 - ganglia_parse_thread(): Done retrieving: data size 37194
Sun 05 May 2013 23:53:30 - ganglia_parse_thread(): Parsing XML..
Sun 05 May 2013 23:53:30 - job_xml_thread(): Retrieving XML data..
Sun 05 May 2013 23:53:30 - job_xml_thread(): Done retrieving: data size 37194
Sun 05 May 2013 23:53:30 - job_xml_thread(): Parsing XML..
Sun 05 May 2013 23:53:30 - XML: Start document
Sun 05 May 2013 23:53:30 - XML: Processed 518 elements - found 0 jobs
Sun 05 May 2013 23:53:30 - job_xml_thread(): Found 0 updated jobs.
Sun 05 May 2013 23:53:30 - job_xml_thread(): No jobs to store.
Sun 05 May 2013 23:53:30 - job_xml_thread(): Done parsing.
Sun 05 May 2013 23:53:30 - job_xml_thread(): Sleeping.. (15s)
Sun 05 May 2013 23:53:30 - ganglia_parse_thread(): Done parsing.
Sun 05 May 2013 23:53:30 - ganglia_parse_thread(): finished.
Sun 05 May 2013 23:53:45 - ganglia_xml_thread(): Done sleeping.
Sun 05 May 2013 23:53:45 - ganglia_xml_thread(): finished.
Sun 05 May 2013 23:53:45 - ganglia_parse_thread(): started.
Sun 05 May 2013 23:53:45 - ganglia_parse_thread(): Retrieving XML data..
Sun 05 May 2013 23:53:45 - ganglia_xml_thread(): started.
Sun 05 May 2013 23:53:45 - ganglia_xml_thread(): Sleeping.. (15s)
Sun 05 May 2013 23:53:45 - ganglia_parse_thread(): Done retrieving: data size 37162
Sun 05 May 2013 23:53:45 - ganglia_parse_thread(): Parsing XML..
Sun 05 May 2013 23:53:45 - ganglia_parse_thread(): Done parsing.
Sun 05 May 2013 23:53:45 - ganglia_parse_thread(): finished.
Sun 05 May 2013 23:53:45 - job_xml_thread(): Retrieving XML data..
Sun 05 May 2013 23:53:45 - job_xml_thread(): Done retrieving: data size 37162
Sun 05 May 2013 23:53:45 - job_xml_thread(): Parsing XML..
Sun 05 May 2013 23:53:45 - XML: Start document
Sun 05 May 2013 23:53:45 - XML: Processed 518 elements - found 0 jobs
Sun 05 May 2013 23:53:45 - job_xml_thread(): Found 0 updated jobs.
Sun 05 May 2013 23:53:45 - job_xml_thread(): No jobs to store.
Sun 05 May 2013 23:53:45 - job_xml_thread(): Done parsing.
Sun 05 May 2013 23:53:45 - job_xml_thread(): Sleeping.. (15s)
Sun 05 May 2013 23:54:00 - ganglia_store_thread(): Done sleeping.
Sun 05 May 2013 23:54:00 - ganglia_store_thread(): finished.
Sun 05 May 2013 23:54:00 - ganglia_store_metric_thread(): started.
Sun 05 May 2013 23:54:00 - ganglia_store_metric_thread(): Storing data..
Sun 05 May 2013 23:54:00 - Entering storeMetrics()
Sun 05 May 2013 23:54:00 - size of cluster 'Test Cluster': 1 hosts 97 metrics 388 values 6172 bits 771 bytes 
Sun 05 May 2013 23:54:00 - ganglia_store_thread(): started.
Sun 05 May 2013 23:54:00 - ganglia_store_thread(): Sleeping.. (60s)
Exception in thread store_metric_thread:
Traceback (most recent call last):
  File "/usr/lib64/python2.4/threading.py", line 442, in __bootstrap
    self.run()
  File "/usr/lib64/python2.4/threading.py", line 422, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/sbin/jobarchived", line 1464, in storeThread
    ret = self.myXMLHandler.storeMetrics()
  File "/usr/sbin/jobarchived", line 1188, in storeMetrics
    ret = rrdh.storeMetrics()
  File "/usr/sbin/jobarchived", line 1843, in storeMetrics
    create_ret = self.createCheck( hostname, metricname, period )
  File "/usr/sbin/jobarchived", line 1982, in createCheck
    heartbeat    = 8 * int( interval )
TypeError: int() argument must be a string or a number

Sun 05 May 2013 23:54:00 - ganglia_xml_thread(): Done sleeping.
Sun 05 May 2013 23:54:00 - ganglia_xml_thread(): finished.
Sun 05 May 2013 23:54:00 - job_xml_thread(): Retrieving XML data..
Sun 05 May 2013 23:54:00 - job_xml_thread(): Done retrieving: data size 37162
Sun 05 May 2013 23:54:00 - job_xml_thread(): Parsing XML..
Sun 05 May 2013 23:54:00 - XML: Start document
Sun 05 May 2013 23:54:00 - XML: Processed 518 elements - found 0 jobs
Sun 05 May 2013 23:54:00 - job_xml_thread(): Found 0 updated jobs.
Sun 05 May 2013 23:54:00 - job_xml_thread(): No jobs to store.
Sun 05 May 2013 23:54:00 - job_xml_thread(): Done parsing.
Sun 05 May 2013 23:54:00 - job_xml_thread(): Sleeping.. (15s)

Change History (5)

comment:1 Changed 11 years ago by ramonb

  • Owner changed from somebody to ramonb
  • Status changed from new to assigned

So there are jobs stored in the database, but no RRD graph's stored?

Will investigate

comment:2 Changed 11 years ago by ramonb

  • Milestone set to 1.1
  • Priority changed from normal to minor

This exception is probably triggered when the jobmond metrics are not completely/correctly reported, as caused in a network issue described below.

It is not a big bug, nevertheless should catch this in job archived and continue along, perhaps only issuing a warning that jobmond is not running (correctly)

Hi,
 
I had wrong ganglia configuration.
Torque on my test server station provides follows names of hosts: master.cluster (head node), node01.cluster (compute node).
But gmond provides other names of host (from /etc/hosts) that correspond to the configuration of my second network interface (internet) on ones.
I have added static route to my first (cluster) interface and the problem has resolved (route add -host 239.2.11.71 dev eth0).
Currently the jobs and RRD graph's are stored in the database.
....
Tue 07 May 2013 00:20:42 - ganglia_store_thread(): started.
Tue 07 May 2013 00:20:42 - ganglia_store_thread(): Sleeping.. (60s)
Tue 07 May 2013 00:20:42 - size of cluster 'TestCluster': 2 hosts 192 metrics 768 values 12153 bits 1519 bytes 
Tue 07 May 2013 00:20:42 - Leaving storeMetrics()
Tue 07 May 2013 00:20:42 - Entering storeMetrics()
Tue 07 May 2013 00:20:42 - size of cluster 'Test Cluster': 2 hosts 192 metrics 0 values 0 bits 0 bytes 
Tue 07 May 2013 00:20:42 - Leaving storeMetrics()
Tue 07 May 2013 00:20:42 - ganglia_store_metric_thread(): Done storing.
Tue 07 May 2013 00:20:42 - ganglia_store_metric_thread(): finished.
Tue 07 May 2013 00:20:44 - job_xml_thread(): Retrieving XML data..
Tue 07 May 2013 00:20:44 - job_xml_thread(): Done retrieving: data size 70591
Tue 07 May 2013 00:20:44 - job_xml_thread(): Parsing XML..
....
Anyway I still working around configurations of the network, ganglia and jobmonarch. And I don't completely understand how to I should to make those configurations.
 
Best Regards

comment:3 Changed 11 years ago by ramonb

I'm having a hard time reproducing this, but it is related to Ganglia config parsing.

comment:4 Changed 11 years ago by ramonb

In 855:

jobarchived.py:

  • made data source polling interval parsing simpler
  • fixed: close() gmetad.conf after parsing is done
  • added: check if ARCHIVE_DATASOURCES is present in gmetad.conf: fatal error if not found
  • see #163

comment:5 Changed 11 years ago by ramonb

  • Resolution set to fixed
  • Status changed from assigned to closed

While I am unable to reproduce what exactly caused this Exception (regardless of (mis)configuration issues) I have now made the interval determination more robust. In addition a check is now performed to prevent jobarchived.conf misconfiguration.

This Exception should no longer happen.

Note: See TracTickets for help on using tickets.