wiki:Documentation/Configuration

Configuration

After installation each component requires additional configuration.

jobmond

Here is an example of a typical jobmond.conf file contents:

[DEFAULT]
# Specify debugging level here;
#
# 10 = gemtric cmd's
#
DEBUG_LEVEL     : 0

# Wether or not to run as a daemon in background
#
DAEMONIZE       : 1

# What Batch type is the system
# 
# Currently supported: pbs, slurm, sge (experimental), lsf (experimental)
#
BATCH_API       : pbs

# Which Batch server to monitor
#
BATCH_SERVER        : localhost

# Which queue(s) to report jobs of
# (optional)
#
#QUEUE          : long, short

# How many seconds interval for polling of jobs
#
# this will effect directly how accurate the
# end time of a job can be determined
#
BATCH_POLL_INTERVAL : 30

# Location of gmond.conf
#
# Default: /etc/gmond.conf
#
# DEPRECATED!:      use GMETRIC_TARGET!
#
#GMOND_CONF     : /etc/gmond.conf

# Location of gmetric binary
#
# Default: /usr/bin/gmetric
#
# DEPRECATED!:      use GMETRIC_TARGET!
#
#GMETRIC_BINARY     : /usr/bin/gmetric

# Target of Gmetric's: where should we report to
# (usually: your udp_send_channel from gmond)
#
# Syntax: <ip>:<port>
#
GMETRIC_TARGET      : 239.2.11.71:8649

# Enable logging to syslog?
#
USE_SYSLOG                      : 1
# What level msg'es should be logged to syslog?
#
# usually: lvl 0 (errors)
#
SYSLOG_LEVEL                    : 0

# Which facility to use in syslog
#
# Known:
#       KERN, USER, MAIL, DAEMON, AUTH, LPR,
#       NEWS, UUCP, CRON and LOCAL0 through LOCAL7
#
SYSLOG_FACILITY                 : DAEMON


# Wether or not to detect differences in
# time from Torque server and local time.
#
# Ideally both machines (if not the same)
# should have the same time (via ntp or whatever)
#
DETECT_TIME_DIFFS   : 1

# Regexp style hostname translation
#
# Usefull if your Batch hostnames are not the same as your
# Ganglia hostnames (different network interfaces)
#
# Syntax: /orig/new/, /orig/new/
#
BATCH_HOST_TRANSLATE    :

DEBUG_LEVEL

  • required
  • valid values: any number between 0 - 20

This level sets which level of messages are either syslogged (in daemon mode) and/or printed to stdout (in foreground mode)

DAEMONIZE

  • required
  • valid values: 0 or 1
    • 0 : Don't daemonize: run in the foreground : any DEBUG_LEVEL messages are sent to stdout
    • 1 : Daemonize: run in the background : any DEBUG_LEVEL messages are sent to syslog

Determines wether or not jobmond should run as daemon in background.

BATCH_API

What type of batch (api) system is used.

BATCH_SERVER

  • optional
  • valid values: any text string

Tell's jobmond wether or not to connect to a remote batch server (of type BATCH_API) or not.

If set: connect with BATCH_API to BATCH_SERVER If not set: use BATCH_API on local system where jobmond is running (should be on batch server)

QUEUE

  • optional
  • valid values: any text string or comma seperated list

Specifies which queue's of the batch system to monitor. If you would like to limit job reporting to only certain queue's, you can specify them here.

  • If set: only jobs are reported that reside in QUEUE
  • If not set: all jobs are reported

BATCH_POLL_INTERVAL

  • required
  • valid values: any number (of seconds)

Sets how often jobmond will poll the BATCH_API and how often this info will be reported.

This directly affects how accurately jobarchived can monitor for finished jobs. For example: if this is set to 180 seconds and a job has finished it may take jobarchived up to 180 seconds to set an finished time in the job database

GMOND_CONF

  • optional
  • default: /etc/ganglia/gmond.conf
  • valid values: any text string

Specifies location of Ganglia's gmond.conf:

  • If set: jobmond checks GMOND_CONF for which udp_send_channel's to use for reporting job metrics
  • If not set: jobmond uses GMETRIC_TARGET for reporting jobs metrics

GMETRIC_BINARY

  • deprecated
  • optional
  • valid values: any text string

Specifies location of Ganglia's gmetric binary. This forces jobmond to use Ganglia's gmetric binary to report jobs.

This should not be needed or used: jobmond uses it's own internal gmetric handling, which is much faster.

  • If set: disables jobmond internal gmetric handling: submit gmetrics using GMETRIC_BINARY : requires GMOND_CONF to be set
  • If not set: jobmond internal gmetric handling is used

GMETRIC_TARGET

  • optional
  • valid values:
    • <host>:<port>

Specifies where to report job information to.

This can be a multicast or unicast address. There must be a gmond running that has this address set as udp_receive_channel and proper network routes have to be set up to this network address.

  • If set: report job information to GMETRIC_TARGET
  • If not set: report job information to udp_send_channel's found in GMOND_CONF : requires a valid GMOND_CONF

USE_SYSLOG

  • required
  • valid values: 0 or 1:
    • 0: Don't log messages
    • 1: Log any messages at DEBUG_LEVEL to syslog's SYSLOG_FACILITY

Specifies wether or not to use syslog for any messages

SYSLOG_FACILITY

  • required
  • valid values: KERN, USER, MAIL, DAEMON, AUTH, LPR, NEWS, UUCP, CRON, LOCAL0, LOCAL1, LOCAL2, LOCAL3, LOCAL4, LOCAL5, LOCAL6, LOCAL7

Specifies to which syslog facility any syslog messages are sent.

DETECT_TIME_DIFFS

  • required
  • valid values: 0 or 1
    • 0: Don't detect time differences
    • 1: Detect time difference between BATCH_SERVER and localhost

When a remote BATCH_SERVER is used, this will tell jobmond to detect and compensate for any time difference's between localhost and remote BATCH_SERVER.

Ideally both servers should utilize NTP to maintain the same date/time.

BATCH_HOST_TRANSLATE

  • required
  • default: (empty)
  • valid values: a comma seperated list of: /<search pattern|regexp>/<replace pattern|regexp>/

Specifies if to use a search and replace (regular expressions allowed) on batch node hostnames before reporting them.

This is useful when your batch nodes hostnames and ganglia hostnames are not the same.

For example a job runs on batch node with hostname: infiniband-host1 but in Ganglia the node is named: host1 - Then you can set: BATCH_HOST_TRANSLATE: /infiniband// and it will strip the infiniband portion from the hostname

  • If not empty: all batch nodes names are passed through all specified regular expression search/replace statements before reported
  • If empty: no search/replace done

jobarchived

Here is an example of a typical jobmond.conf file contents:

[DEFAULT]
# Wether or not to run as a daemon in background
#
DAEMONIZE           : 1

# Specify debugging level here (only when _not_ DAEMONIZE)
#
# 11 = XML: metrics
# 10 = XML: host, cluster, grid, ganglia
# 9  = RRD activity, gmetad config parsing
# 8  = RRD file activity
# 6  = SQL
# 1  = daemon threading
# 0  = errors
#
# default: 0
#
DEBUG_LEVEL         : 1

# Enable logging to syslog?
#
USE_SYSLOG          : 1

# What level msg'es should be logged to syslog?
#
# usually: lvl 0 (errors)
#
SYSLOG_LEVEL            : 0

# Which facility to use in syslog
#
# Known:
#       KERN, USER, MAIL, DAEMON, AUTH, LPR,
#       NEWS, UUCP, CRON and LOCAL0 through LOCAL7
#
SYSLOG_FACILITY         : DAEMON

# Where is the gmetad.conf located
#
GMETAD_CONF         : /etc/ganglia/gmetad.conf

# Where to grab XML data from
# Usually: local gmetad (port 8651)
#
# Syntax: <hostname>:<port>
#
ARCHIVE_XMLSOURCE       : localhost:8651

# List of data_source names to archive for
#
# Syntax: [ "<clustername>", "<clustername>" ]
#
ARCHIVE_DATASOURCES     : [ "My Cluster" ]

# Amount of hours to store in one single archived rrd
#
# If you would like less files you can set this bigger
# but could degrade performance
#
# For now 12 hours seems to work: 2 periods per day
#
ARCHIVE_HOURS_PER_RRD       : 12

# Which metrics to exclude from archiving
# NOTE: This can be a regexp or a string
#
ARCHIVE_EXCLUDE_METRICS     : ".*Temp.*", ".*RPM.*", ".*Version.*", ".*Tag$", "boottime", "gexec", "os.*", "machine_type"

# Where to store the archived rrd's
#
ARCHIVE_PATH            : /usr/local/jobmonarch

# Archive's SQL dbase to use
#
# Syntax: <hostname>/<database>
#
JOB_SQL_DBASE           : localhost/jobarchive
JOB_SQL_USER                    : jobarchive

#JOB_SQL_PASSWORD        : 

# Timeout for jobs in archive
#
# Assume job has already finished while jobarchived was not running
# after this amount of hours: the it will be finished anyway in the database
#
JOB_TIMEOUT         : 168

# Location of rrdtool binary
#
RRDTOOL             : /usr/bin/rrdtool

DEBUG_LEVEL

  • required
  • valid values: any number between 0 - 20

This level sets which level of messages are either syslogged (in daemon mode) and/or printed to stdout (in foreground mode)

DAEMONIZE

  • required
  • valid values: 0 or 1
    • 0 : Don't daemonize: run in the foreground : any DEBUG_LEVEL messages are sent to stdout
    • 1 : Daemonize: run in the background : any DEBUG_LEVEL messages are sent to syslog

Determines wether or not jobarchived should run as daemon in background.

USE_SYSLOG

  • required
  • valid values: 0 or 1:
    • 0: Don't log messages
    • 1: Log any messages at DEBUG_LEVEL to syslog's SYSLOG_FACILITY

Specifies wether or not to use syslog for any messages

SYSLOG_FACILITY

  • required
  • valid values: KERN, USER, MAIL, DAEMON, AUTH, LPR, NEWS, UUCP, CRON, LOCAL0, LOCAL1, LOCAL2, LOCAL3, LOCAL4, LOCAL5, LOCAL6, LOCAL7

Specifies to which syslog facility any syslog messages are sent.

GMETAD_CONF

  • required
  • valid value: any text string

Specifies location of Ganglia's gmetad.conf

ARCHIVE_XMLSOURCE

  • required
  • valid values:
    • <host>:<port>

Specifies where to get XML from to store in archive.

Normally this is a gmetad daemon's tcp_accept_channel

ARCHIVE_DATASOURCES

  • required
  • valid values: comma seperated list of (ganglia) cluster names

Specifies which (Ganglia) clusters to store in the archive

ARCHIVE_HOURS_PER_RRD

  • required
  • valid values: any number (of hours)

How big should archive RRD's be? This determines how often new RRD's are created in the archive

ARCHIVE_EXCLUDE_METRICS

  • required
  • default: (empty)
  • valid values: a comma seperated list of (regexp) patterns

Specifies any metrics to ignore and not store in the archive. Useful for

ARCHIVE_PATH

  • required
  • valid value: any existing directory

Specifies where jobarchived should store archive RRD's.

jobarchived will create: ARCHIVE_PATH/<cluster name>/<hostname>/<timeperiod

JOB_SQL_DBASE

  • required
  • valid value: <hostname>/<database>

Specifies which (Postgres) database to use to store jobs in.

JOB_SQL_USER

  • required
  • valid value: a postgres user with the correct database permissions

Specifies which username to use when connecting to JOB_SQL_DBASE

JOB_SQL_PASSWORD

  • required
  • valid value: any text string

Specifies which password to use when connecting to JOB_SQL_DBASE

JOB_TIMEOUT

  • required
  • valid value: any number (of hours)

Specifies after how many hours jobarchived's Housekeeping considers a job timedout.

When a running job in the database is no longer present when jobarchived is started: it will then be closed and considered finished.

RRDTOOL

  • deprecated
  • optional
  • valid value: path

Normally job archived uses py-rrdtool to handle RRD operations.

This option is only used as failback when no py-rrdtool is present.

NOTE: this can slow down jobarchived significantly!

web

  1. Change your Ganglia's web template to Job Monarch
    vi /var/www/ganglia/conf.php
    
    $template_name = "job_monarch";
    
  2. Change Job Monarch's config to reflect your settings:
    vi /var/www/ganglia/addons/job_monarch/conf.php
    

( see config comments for syntax and explanation )

Last modified 4 years ago Last modified on 05/24/13 13:00:13