wiki:WikiStart

Welcome to Job Monarch!

January 20th, 2014: Version 1.1.2 released

Download:

CHANGES:

RELEASE NOTES:

This release contains mainly bugfixes.

August 6th, 2013: Version 1.1.1 released

Download:

CHANGES:

RELEASE NOTES:

This release contains mainly bugfixes.

May 23rd, 2013: Version 1.1 released

Download:

CHANGES:

RELEASE NOTES:

This release adds support for the SLURM Workload Manager (requires the pyslurm module) and contains several bug fixes and improvements.

The packaging has been completely redone and rewritten thanks to Olivier Lahaye.

April 12th, 2013: Version 1.0 released

Download:

CHANGES:

RELEASE NOTES:

This is a new major version and in such is not compatible with older versions of Job Monarch, due to:

  • changes in the jobarchived database schema
  • changes in the jobmond protocol

If you have an extensive job database you might want to check the new database schema first before upgrading.

REQUIREMENTS:

The requirements have changed with the following differences:

  • web interface now requires ganglia-web2 version 3.5.0+
  • jobmond now requires Ganglia (gmond/gmetric) version of at least 3.4.0+
  • jobarchived now requires the python module "psycopg2" and no longer uses PyPgSQL

Job Monarch is an addon to the Ganglia Monitoring System that provides (batch) job monitoring and graphical overview of clusters and assorted batch systems. Monarch is an abbreviation for Monitoring and Archiving, as Monarch also provides the ability to archive these job (monitoring) statistics so that your (batch) cluster users may lookup job information of old (and possibly failed) jobs to analyze possible problems.

Features

Job Monarch stands for 'Job Monitoring and Archiving' tool and consists of three (3) components:

jobmond

The Job Monitoring Daemon.

Gathers batch statistics on jobs/nodes and submits them into Ganglia's XML stream.

Through this daemon, users are able to view the PBS/Torque batch system and the jobs/nodes that are in it (be it either running or queued).

Batch systems fully supported by Job Monarch:

Batch systems experimental support:

jobarchived (optionally)

The Job Archiving Daemon.

Listens to Ganglia's XML stream and archives the job and node statistics. It stores the job statistics in a (Postgres) SQL database and the node statistics in RRD files.

Through this daemon, users are able to lookup a old/finished job and view all it's statistics.

Optionally: You can either choose to use this daemon if your users have use for it. As it can be a heavy application to run and not everyone may have a need for it.

  • Key features
    • Multithreaded
      Will not miss any data regardless of (slow) storage
    • Staged writing
      Spread load over bigger time periods
    • High precision RRDs
      Allow for zooming on old periods with large precision
    • Timeperiod RRDs
      Allow for smaller number of files while still keeping advantage of small disk space

web

The Job Monarch web interface.

This interfaces with the jobmond data and (optionally) the jobarchived and presents the data and graphs.

It does this in a similar layout/setup as Ganglia itself, so the navigation and usage is intuitive.

  • Key features
    • Graphical usage
      Displays graphical cluster overview so you can see the cluster (job) state in one view/image and additional pie chart with relevant information on your current view
    • Filters
      Ability to filter output to limit information displayed (usefull for those clusters with 500+ jobs). This also filters the graphical overview images output and pie chart so you only see the filter relevant data
    • Archive
      When enabling jobarchived, users can go back as far as recorded in the database or archived RRDs to find out what happened to a crashed or old job
    • Zoom ability
      Users can zoom into a timepriod as small as the smallest grain of the RRDS (typically up to 10 seconds) when a jobarchived is present

Documentation

Visit our online documentation here:

Screenshots

You can have a look at a number of screenshots, displaying Job Monarch in action:

Working example preview

You can see a working preview/example here:

Download

You can grab the tarball from our ftp site:

There are also DEB and RPM packages available.

Source code

You can browse the current code here:

Or you can check out code (anonymous read-only) through subversion:

  • svn co https://oss.trac.surfsara.nl/jobmonarch/svn/ -- everything
  • svn co https://oss.trac.surfsara.nl/jobmonarch/svn/tags -- releases (stable)
  • svn co https://oss.trac.surfsara.nl/jobmonarch/svn/branches -- branches (stable)
  • svn co https://oss.trac.surfsara.nl/jobmonarch/svn/trunk -- current development (unstable)

Build packages

You can build the RPM and DEB packages or tarballs yourself from the SVN tree, through the Makefile.

make deb
make rpm
make tarball

If you want to change the web installdir of your packages for example, or simply test a development version.

Report bugs

You can create tickets and/or submit patches in our ticket system:

Links

  • pbs_python -- Homepage of pbs_python, this python module is used for gathering job statistics from PBS/Torque
  • Ganglia -- Homepage of The Ganglia Monitoring System

Contributing

If you have anything to contribute, you're always welcome. See the contact/mailing list details below on how to contact us.

Some examples:

  • Testing Job Monarch
  • Implementing/maintaining more batch systems support
    • LoadLeveler, SLURM support
    • LSF, SGE: regular maintainers, testers are needed
  • If you think you can contribute in any other way

Development team

Author: Ramon Bastiaans
Maintainer(s): Ramon Bastiaans, Sil Westerveld

Donations

While some time is spent on behalf of SURFsara, the project author also spends lots of his personal free time on Job Monarch.

If you would like to make a donation, that is possible through SourceForge:

Donate to Ramon Bastiaans

Current development

Look at the roadmap to see the current status of development:

Contact & Community

Two mailinglists have been set up to support the Job Monarch project.

For usage discussion and help:

For project development progress and discussion:

Last modified 3 months ago Last modified on 01/20/14 17:34:11