ALICE Grid monitoring with MonALISA

 

Grid monitoring requirements

When talking about a worldwide distributed system, like the ALICEs Grid, you have to take into consideration various platforms, software and consequently, error conditions. In order to quickly understand what is happening in a system of this scale, monitoring should provide a global view of the entire system.

It is important to be able to correlate the evolution of various monitored parameters, on different grid sites or in relation with the central services parameters. Aside that, the monitoring system must be non intrusive, accurate and it should provide both historical and near real-time image of the Grids status and performance.

Based on these requirements, MonALISA framework was chosen to monitor the entire AliEn Grid system.

 

Monitored parameters

Currently almost all AliEn components are monitored:

  • Central Services
    • Task Queue, Information Service, Optimizers, API Service etc.
  • Site Services
    • Job Agents, Cluster Monitor, Computing and Storage Elements
    • LCG Services (on VO Boxes)
  • Job status and resource usage
  • Jobs network traffic  inter and intra site

Also, various host parameters both for head nodes (where central and site services run) and for worker nodes (where jobs run) are monitored.

 

Monitoring architecture in AliEn

AliEn monitoring follows closely the MonALISA architecture: each AliEn service, including the Job Agent, is instrumented with ApMon, the Perl and C++ versions. It regularly sends monitoring data to the local MonALISA service running on the site. Here, data from all the services, jobs and nodes is aggregated, the site profile being generated with a resolution of 2 minutes. Local on site MonALISA services keep a short (in memory only) history about each received or aggregated parameter. All these can be queried with a MonALISA GUI Client. Only the aggregated data is collected by the MonALISA Repository for long term histories.

 

monitoring-architecture-in-alien

 

Deployment and configuration

For AliEn monitoring, MonALISA is packaged and prepared for installation by the AliEn Build and Test System. When you install AliEn, you can also install MonALISA simply by checking the monitor metapackage in the AliEn-Installer.

alien-installer_monalisa.png

Configuration files for MonALISA are generated automatically from AliEn LDAP at startup. If a MonALISA entry for the site is not present in LDAP, MonALISA won't start. Then, MonALISA behaves like any other AliEn service:

  • Start it with alien StartMonaLisa
  • Stop it with alien StopMonaLisa
  • Check status with alien StatusMonaLisa

 

Sample plots

Here are some of the key monitoring points accessible via the MonALISA ALICE Repository:

  • Jobs status - summaries per sites and users, errors, running and cumulative parameters
  • Jobs resource usage - consumed cpu, network traffic, disk usage etc.
  • AliEn and LCG services - current status for all AliEn and LCG services running on each site
  • VO Boxes - machine parameters for all head (master) nodes, on sites, pledged resources etc.