[221] | 1 | DESCRIPTION |
---|
| 2 | =========== |
---|
| 3 | |
---|
| 4 | Job Monarch is a set of tools to monitor and optionally archive (batch)job information. |
---|
| 5 | |
---|
| 6 | It is a addon for the Ganglia monitoring system and plugs in to a existing Ganglia setup. |
---|
| 7 | |
---|
[222] | 8 | To view a operational setup with Job Monarch, have a look here: http://ganglia.sara.nl/ |
---|
[221] | 9 | |
---|
| 10 | |
---|
| 11 | Job Monarch stands for 'Job Monitoring and Archiving' tool and consists of three (3) components: |
---|
| 12 | |
---|
| 13 | * jobmond |
---|
| 14 | |
---|
| 15 | The Job Monitoring Daemon. |
---|
| 16 | |
---|
| 17 | Gathers PBS/Torque batch statistics on jobs/nodes and submits them into |
---|
| 18 | Ganglia's XML stream. |
---|
| 19 | |
---|
| 20 | Through this daemon, users are able to view the PBS/Torque batch system and the |
---|
| 21 | jobs/nodes that are in it (be it either running or queued). |
---|
| 22 | |
---|
[232] | 23 | * jobarchived (optionally) |
---|
[221] | 24 | |
---|
[232] | 25 | The Job Archiving Daemon. |
---|
[221] | 26 | |
---|
| 27 | Listens to Ganglia's XML stream and archives the job and node statistics. |
---|
| 28 | It stores the job statistics in a Postgres SQL database and the node statistics |
---|
| 29 | in RRD files. |
---|
| 30 | |
---|
| 31 | Through this daemon, users are able to lookup a old/finished job |
---|
| 32 | and view all it's statistics. |
---|
| 33 | |
---|
| 34 | Optionally: You can either choose to use this daemon if your users have use for it. |
---|
[232] | 35 | As it can be a heavy application to run and not everyone may have a need for it. |
---|
| 36 | |
---|
| 37 | - Multithreaded: Will not miss any data regardless of (slow) storage |
---|
| 38 | |
---|
| 39 | - Staged writing: Spread load over bigger time periods |
---|
| 40 | |
---|
| 41 | - High precision RRDs: Allow for zooming on old periods with large precision |
---|
| 42 | |
---|
| 43 | - Timeperiod RRDs: Allow for smaller number of files while still keeping advantage |
---|
| 44 | of small disk space |
---|
[221] | 45 | |
---|
| 46 | * web |
---|
| 47 | |
---|
| 48 | The Job Monarch web interface. |
---|
| 49 | |
---|
| 50 | This interfaces with the jobmond data and (optionally) the jobarchived and presents the |
---|
| 51 | data and graphs. |
---|
| 52 | |
---|
| 53 | It does this in a similar layout/setup as Ganglia itself, so the navigation and usage is intuitive. |
---|
| 54 | |
---|
[232] | 55 | - Graphical usage: Displays graphical cluster overview so you can see the cluster (job) state |
---|
| 56 | in one view/image and additional pie chart with relevant information on your |
---|
| 57 | current view |
---|
| 58 | |
---|
| 59 | - Filters: Ability to filter output to limit information displayed (usefull for those |
---|
| 60 | clusters with 500+ jobs). This also filters the graphical overview images output |
---|
| 61 | and pie chart so you only see the filter relevant data |
---|
| 62 | |
---|
| 63 | - Archive: When enabling jobarchived, users can go back as far as recorded in the database |
---|
| 64 | or archived RRDs to find out what happened to a crashed or old job |
---|
| 65 | |
---|
| 66 | - Zoom ability: Users can zoom into a timepriod as small as the smallest grain of the RRDS |
---|
| 67 | (typically up to 10 seconds) when a jobarchived is present |
---|
| 68 | |
---|
[221] | 69 | REQUIREMENTS |
---|
| 70 | ============ |
---|
| 71 | |
---|
[222] | 72 | all: |
---|
| 73 | |
---|
| 74 | - Python 2.3 or higher |
---|
| 75 | |
---|
[221] | 76 | jobmond: |
---|
| 77 | |
---|
[230] | 78 | - pbs_python v2.8.2 or higher |
---|
[366] | 79 | https://subtrac.sara.nl/oss/pbs_python/ |
---|
[221] | 80 | |
---|
[222] | 81 | - gmond v3.0.1 or higher |
---|
[366] | 82 | http://www.ganglia.info/ |
---|
[221] | 83 | |
---|
| 84 | jobarchived: |
---|
| 85 | |
---|
[223] | 86 | - Postgres SQL v7.xx |
---|
[366] | 87 | http://www.postgres.org/ |
---|
[221] | 88 | |
---|
| 89 | - rrdtool v1.xx |
---|
| 90 | http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/ |
---|
| 91 | |
---|
[366] | 92 | - py-rrdtool |
---|
| 93 | http://sourceforge.net/projects/py-rrdtool/ |
---|
| 94 | |
---|
[849] | 95 | - python-psycopg2 |
---|
[222] | 96 | http://sourceforge.net/projects/pypgsql/ |
---|
| 97 | |
---|
| 98 | - gmetad v3.x.x |
---|
[366] | 99 | http://www.ganglia.info/ |
---|
[221] | 100 | |
---|
| 101 | web: |
---|
| 102 | |
---|
[222] | 103 | - PHP v4.1 or higher |
---|
[221] | 104 | http://www.php.net |
---|
| 105 | |
---|
[222] | 106 | - php-pgsql v4.x.x |
---|
| 107 | (should come with Postgres) |
---|
[221] | 108 | |
---|
[843] | 109 | - php-mbstring |
---|
| 110 | |
---|
[222] | 111 | - GD v2.x |
---|
| 112 | http://www.boutell.com/gd/ |
---|
| 113 | |
---|
| 114 | - Ganglia web frontend v3.x.x |
---|
[223] | 115 | http://www.ganglia.info |
---|
[222] | 116 | |
---|
| 117 | |
---|
[221] | 118 | INSTALLATION |
---|
| 119 | ============ |
---|
| 120 | |
---|
| 121 | Prior to installing the software make sure you meet the necessary requirements as |
---|
| 122 | mentioned above. |
---|
| 123 | |
---|
[222] | 124 | NOTE: You can choose to install to other path/directories if your setup is different. |
---|
[221] | 125 | |
---|
[222] | 126 | * jobmond |
---|
[221] | 127 | |
---|
[222] | 128 | 1. Copy jobmond.py: |
---|
| 129 | |
---|
| 130 | > cp jobmond/jobmond.py /usr/local/sbin/jobmond.py |
---|
| 131 | |
---|
| 132 | 2. Copy jobmond.conf: |
---|
| 133 | |
---|
| 134 | > cp jobmond/jobmond.conf /etc/jobmond.conf |
---|
| 135 | |
---|
| 136 | * jobarchived |
---|
| 137 | |
---|
| 138 | 1. Create a Postgres SQL database for jobarchived: |
---|
| 139 | |
---|
| 140 | > createdb jobarchive |
---|
| 141 | |
---|
| 142 | 2. Setup jobarchived's tables: |
---|
| 143 | |
---|
| 144 | > psql -f jobarchived/job_dbase.sql jobarchive |
---|
| 145 | |
---|
| 146 | 3. Copy jobarchived/jobarchived.conf: |
---|
| 147 | |
---|
| 148 | > cp jobarchived/jobarchived.conf /etc/jobarchived.conf |
---|
| 149 | |
---|
[489] | 150 | 4. Copy jobarchived.py: |
---|
[222] | 151 | |
---|
| 152 | > cp jobarchived/jobarchived.py /usr/local/sbin/jobarchived.py |
---|
| 153 | |
---|
| 154 | * web |
---|
| 155 | |
---|
| 156 | 1. Copy the Job Monarch Template to your Ganglia installation |
---|
| 157 | |
---|
| 158 | > cp -a web/templates/job_monarch /var/www/ganglia/templates |
---|
| 159 | |
---|
| 160 | 2. Copy the web interface files to the addon directory in Ganglia |
---|
| 161 | |
---|
[493] | 162 | > mkdir -p /var/www/ganglia/addons |
---|
[222] | 163 | > cp -a web/addons/job_monarch /var/www/ganglia/addons |
---|
| 164 | |
---|
[221] | 165 | CONFIGURATION |
---|
| 166 | ============= |
---|
| 167 | |
---|
[222] | 168 | After installation each component requires additional configuration. |
---|
[221] | 169 | |
---|
[222] | 170 | * jobmond |
---|
| 171 | |
---|
| 172 | 1. Edit Jobmond's config to reflect your settings: |
---|
| 173 | |
---|
| 174 | - In /etc/jobmond.conf |
---|
| 175 | |
---|
| 176 | ( see config comments for syntax and explanation ) |
---|
| 177 | |
---|
| 178 | * jobarchived |
---|
| 179 | |
---|
| 180 | 1. Edit Jobarchived's config to reflect your settings: |
---|
| 181 | |
---|
| 182 | - In /etc/jobarchived.conf |
---|
| 183 | |
---|
| 184 | ( see config comments for syntax and explanation ) |
---|
| 185 | |
---|
| 186 | * web |
---|
| 187 | |
---|
| 188 | 1. Change your Ganglia's web template to Job Monarch |
---|
| 189 | |
---|
| 190 | - In /var/www/ganglia/conf.php: |
---|
| 191 | |
---|
| 192 | > $template_name = "job_monarch"; |
---|
| 193 | |
---|
| 194 | 2. Change Job Monarch's config to reflect your settings: |
---|
| 195 | |
---|
| 196 | - In /var/www/ganglia/addons/job_monarch/conf.php |
---|
| 197 | |
---|
| 198 | ( see config comments for syntax and explanation ) |
---|
| 199 | |
---|
[221] | 200 | START |
---|
| 201 | ===== |
---|
| 202 | |
---|
[222] | 203 | * jobmond |
---|
[221] | 204 | |
---|
[222] | 205 | The Job Monitor has to be run on a machine that is allowed to |
---|
| 206 | query the PBS/Torque server. |
---|
| 207 | Make sure that if you have 'acl_hosts' enabled on your PBS/Torque |
---|
| 208 | server that jobmond's machine is in it. |
---|
[221] | 209 | |
---|
[222] | 210 | 1. Start the Job Monitor: |
---|
| 211 | |
---|
| 212 | > /usr/local/sbin/jobmond.py -c /etc/jobmond.conf |
---|
| 213 | |
---|
| 214 | * jobarchived |
---|
| 215 | |
---|
| 216 | 1. Start the Job Archiver: |
---|
| 217 | |
---|
| 218 | > /usr/local/sbin/jobarchived.py -c /etc/jobarchived.conf |
---|
| 219 | |
---|
| 220 | * web |
---|
| 221 | |
---|
| 222 | Doesn't require you to (re)start anything. |
---|
| 223 | ( make sure the Postgres database is running though ) |
---|
| 224 | |
---|
[221] | 225 | CONTACT |
---|
| 226 | ======= |
---|
| 227 | |
---|
| 228 | To contact the author for anything from bugfixes to flame/hate mail: |
---|
| 229 | |
---|
[222] | 230 | * Ramon Bastiaans |
---|
| 231 | |
---|
[235] | 232 | <bastiaans ( a t ) sara ( d o t ) nl> |
---|