DESCRIPTION =========== Job Monarch is a set of tools to monitor and optionally archive (batch)job information. It is a addon for the Ganglia monitoring system and plugs in to a existing Ganglia setup. To view a operational setup with Job Monarch, have a look here: http://ganglia.sara.nl/ Job Monarch stands for 'Job Monitoring and Archiving' tool and consists of three (3) components: * jobmond The Job Monitoring Daemon. Gathers PBS/Torque batch statistics on jobs/nodes and submits them into Ganglia's XML stream. Through this daemon, users are able to view the PBS/Torque batch system and the jobs/nodes that are in it (be it either running or queued). * jobarchived (optionally) The Job Archiving Daemon. Listens to Ganglia's XML stream and archives the job and node statistics. It stores the job statistics in a Postgres SQL database and the node statistics in RRD files. Through this daemon, users are able to lookup a old/finished job and view all it's statistics. Optionally: You can either choose to use this daemon if your users have use for it. As it can be a heavy application to run and not everyone may have a need for it. - Multithreaded: Will not miss any data regardless of (slow) storage - Staged writing: Spread load over bigger time periods - High precision RRDs: Allow for zooming on old periods with large precision - Timeperiod RRDs: Allow for smaller number of files while still keeping advantage of small disk space * web The Job Monarch web interface. This interfaces with the jobmond data and (optionally) the jobarchived and presents the data and graphs. It does this in a similar layout/setup as Ganglia itself, so the navigation and usage is intuitive. - Graphical usage: Displays graphical cluster overview so you can see the cluster (job) state in one view/image and additional pie chart with relevant information on your current view - Filters: Ability to filter output to limit information displayed (usefull for those clusters with 500+ jobs). This also filters the graphical overview images output and pie chart so you only see the filter relevant data - Archive: When enabling jobarchived, users can go back as far as recorded in the database or archived RRDs to find out what happened to a crashed or old job - Zoom ability: Users can zoom into a timepriod as small as the smallest grain of the RRDS (typically up to 10 seconds) when a jobarchived is present REQUIREMENTS ============ all: - Python 2.3 or higher jobmond: - pbs_python v2.8.2 or higher https://subtrac.sara.nl/oss/pbs_python/ - gmond v3.0.1 or higher http://www.ganglia.info/ jobarchived: - Postgres SQL v7.xx http://www.postgres.org/ - rrdtool v1.xx http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/ - py-rrdtool http://sourceforge.net/projects/py-rrdtool/ - python-pgsql v4.x.x http://sourceforge.net/projects/pypgsql/ - gmetad v3.x.x http://www.ganglia.info/ web: - PHP v4.1 or higher http://www.php.net - php-pgsql v4.x.x (should come with Postgres) - GD v2.x http://www.boutell.com/gd/ - Ganglia web frontend v3.x.x http://www.ganglia.info INSTALLATION ============ Prior to installing the software make sure you meet the necessary requirements as mentioned above. NOTE: You can choose to install to other path/directories if your setup is different. * jobmond 1. Copy jobmond.py: > cp jobmond/jobmond.py /usr/local/sbin/jobmond.py 2. Copy jobmond.conf: > cp jobmond/jobmond.conf /etc/jobmond.conf * jobarchived 1. Create a Postgres SQL database for jobarchived: > createdb jobarchive 2. Setup jobarchived's tables: > psql -f jobarchived/job_dbase.sql jobarchive 3. Copy jobarchived/jobarchived.conf: > cp jobarchived/jobarchived.conf /etc/jobarchived.conf 4. Copy jobarchived.py and DBClass.py: > cp jobarchived/jobarchived.py /usr/local/sbin/jobarchived.py > cp jobarchived/DBClass.py /usr/local/sbin/DBClass.py * web 1. Copy the Job Monarch Template to your Ganglia installation > cp -a web/templates/job_monarch /var/www/ganglia/templates 2. Copy the web interface files to the addon directory in Ganglia > cp -a web/addons/job_monarch /var/www/ganglia/addons CONFIGURATION ============= After installation each component requires additional configuration. * jobmond 1. Edit Jobmond's config to reflect your settings: - In /etc/jobmond.conf ( see config comments for syntax and explanation ) * jobarchived 1. Edit Jobarchived's config to reflect your settings: - In /etc/jobarchived.conf ( see config comments for syntax and explanation ) * web 1. Change your Ganglia's web template to Job Monarch - In /var/www/ganglia/conf.php: > $template_name = "job_monarch"; 2. Change Job Monarch's config to reflect your settings: - In /var/www/ganglia/addons/job_monarch/conf.php ( see config comments for syntax and explanation ) START ===== * jobmond The Job Monitor has to be run on a machine that is allowed to query the PBS/Torque server. Make sure that if you have 'acl_hosts' enabled on your PBS/Torque server that jobmond's machine is in it. 1. Start the Job Monitor: > /usr/local/sbin/jobmond.py -c /etc/jobmond.conf * jobarchived 1. Start the Job Archiver: > /usr/local/sbin/jobarchived.py -c /etc/jobarchived.conf * web Doesn't require you to (re)start anything. ( make sure the Postgres database is running though ) CONTACT ======= To contact the author for anything from bugfixes to flame/hate mail: * Ramon Bastiaans