Debugging & Troubleshooting the ALICE LCG Vo-Box

his document is a repository of common problems (with solutions) pertaining to the configuration and operation of the ALICE LCG VO-Box. For details on the VO-Box description and installation, please see the LCGVO-Box_HowTo and the ALICE-specific installation guide. For xrootd-over-DPM installation and debugging, please check the relevant HowTo?.

 

 

General Notes

Please remember that the proxy you use to log in on the VO-Box via gsissh (the login proxy) is not the same that is used to start AliEn services and submit jobs to the LCG (the user proxy; please check HowToManageVOBoxProxies for definitions and many more details about proxy management on the VO-Box).

In order to run AliEn with your personal login proxy, do

 

VO-Box> alien [options] [commands]

and to use the registered renewable user proxy used for services do

 

VO-Box> lcgAlien.sh [options] [commands]

A very useful option to AliEn is --debug 10. Please see the AliEn User Guide for more information about AliEn itself.

 

How to find LCG, AliEn and JobAgent job IDs

Coming soon...

 

Where to find log files, configurations & such

 

  • All the AliEn services log files (ClusterMonitor.log, CE.log, SE.log, PackMan.log, MonaLisa.log), plus some other interesting stuff are on the VO-Box, in the ~alicesgm/alien-logs/ directory.

  • The JobAgent execution logs are in ~alicesgm/alien-logs/proc/<JobId>, where <JobId> is the AliEn JobId (not the LCG one, nor the JobAgent ID).

  • The DB with the correspondence between JobAgentId, LCG JobID and (when available) Alien JobId, plus some more information about currently running and queued JobAgents, is in ~alicesgm/alien-logs/CE.db/JOBAGENT; another file in that directory, ~alicesgm/alien-logs/CE.db/JOBIDS holds a list of all LCG JobIds ever submitted through the VO-Box.

  • The local configuration files are ~alicesgm/.alien/Environment (that is sourced just before the execution of any AliEn command) and, when relevant, ~alicesgm/.alien/alice.conf. See the VO-Box installation guide guide for details about those two files.

  • The central configuration repository is ldap://aliendb5.cern.ch:8389/o=alice,dc=cern,dc=ch; you can use ldapsearch or any LDAP browser (like GQ) to browse the entries.

 

Downloading the stdout/stderr of a JobAgent

To download the stdout/stderr of a given JobAgent, you first need to locate its LCG unique JobId, which is an HTTPS URL with the RB address and a GUID, e.g. https://gridit-rb-01.cnaf.infn.it:9000/ukVYsNY0m5IlN4cdtS88BA. This is printed out on the CE logfile ~alicesgm/alien-logs/CE.log when the job is first submitted, and on the Cluster Monitor log in a couple of other occurrences, for example

 

May 23 23:21:04 info Agent https://gdrb11.cern.ch:9000/fTVE7zgkjQmvzA3NROBlXQ is dead!!

It is also saved in ~alicesgm/alien-logs/CE.db/JOBAGENT and ~alicesgm/alien-logs/CE.db/JOBIDS (see above). You can download the output of the JobAgent by issuing

 

VO-Box> edg-job-get-output [--dir <path>] <LCG JobId>

Example:

[alicesgm@alibox alicesgm]$ edg-job-get-output --dir /tmp https://egee-rb-01.cnaf.infn.it:9000/l137l6zjPlFM8gBH5TVhWw

Retrieving files from host: egee-rb-01.cnaf.infn.it ( for https://egee-rb-01.cnaf.infn.it:9000/l137l6zjPlFM8gBH5TVhWw )

*********************************************************************************
JOB GET OUTPUT OUTCOME

Output sandbox files for the job:
- https://egee-rb-01.cnaf.infn.it:9000/l137l6zjPlFM8gBH5TVhWw
have been successfully retrieved and stored in the directory:
/tmp/alicesgm_l137l6zjPlFM8gBH5TVhWw

*********************************************************************************

 

The jobs are owned by the owner of the user proxy that was used to submit them. If the subject of such user proxy is not the same as yours (i.e. the same of your login proxy) you will not have the privilege to download those job's output. Try to guess which one was used (very likely to be one of those in /opt/vobox/alice/proxy-repository/) and use that one:

 

VO-Box> env X509_USER_PROXY=<proxy file> edg-job-get-output <LCG JobId>

where <proxy file> will be something like /opt/vobox/alice/proxy_repository/+2fC+3dCH+2fO+3dCERN+2fOU+3dGRID+2fCN+3dPatricia+20Mendez+20Lorenzo-ALICE+2fCN+3dproxy.

 

Restarting the services

To restart all of the services at once, use the aliend startup script:

 

VO-Box> $VO_ALICE_SW_DIR/alien/etc/rc.d/init.d/aliend [stop|start]

To start a single service, use the script that sets the right user proxy:

 

VO-Box> $VO_ALICE_SW_DIR/alien/scripts/lcg/lcgAlien.sh Start<Service>

where can be any of CE, SE, Monitor, MonaLisa, PackMan.

Whenever restarting the services at any site, please check that the ALIEN_USER set in the ~alicesgm/.alien/Environment file matches one of the proxies registered in the VO-Box proxy DB. You can check them by issuing

 

VO-Box> vobox-proxy --vo alice --dn all query

Example:

[alicesgm@alibox alicesgm]$ vobox-proxy --vo alice --dn all query
*******************************************************************************
DN: /C=IT/O=INFN/OU=Personal Certificate/L=Torino/CN=Stefano Bagnasco/CN=proxy/CN=proxy/CN=proxy
File: /opt/vobox/alice/proxy_repository/+2fC+3dIT+2fO+3dINFN+2fOU+3dPersonal+20Certificate+2fL+3dTorino+2fCN+3dStefano+20Bagnasco+2fCN+3dproxy
Proxy Expiration Trigger (seconds): 0
Myproxy Expiration Trigger (seconds): 0
Email Address: N/A
Proxy Time Left (seconds): 97055
Myproxy Time Left (seconds): 1069790
Status: OK
*******************************************************************************
*******************************************************************************
DN: /C=CH/O=CERN/OU=GRID/CN=Patricia Mendez Lorenzo-ALICE/CN=proxy/CN=proxy/CN=proxy
File: /opt/vobox/alice/proxy_repository/+2fC+3dCH+2fO+3dCERN+2fOU+3dGRID+2fCN+3dPatricia+20Mendez+20Lorenzo-ALICE+2fCN+3dproxy
Proxy Expiration Trigger (seconds): 0
Myproxy Expiration Trigger (seconds): 0
Email Address: N/A
Proxy Time Left (seconds): 95130
Myproxy Time Left (seconds): 750812
Status: OK
*******************************************************************************

 

You can find the certificate subject corresponding to a given AliEn user name by issuing

 

VO-Box> ldapsearch -x -LLL -H ldap://aliendb5.cern.ch:8389\
-b uid=<alien-username>,o=people,o=alice,dc=cern,dc=ch\
subject | perl -p -00 -s 's/\n//'

The funny perl command in the end is needed to take care of the nasty habit of ldapsearch of splitting long lines.

If none of the proxies matches the ALIEN_USER, simplest thing is to change it to your AliEn user name and register another proxy for yourself. For more details about proxy management on the VO-Box, please see the HowToManageVOBoxProxies document.

Important: If you change the ALIEN_USER in ~alicesgm/.alien/Environment, it is mandatory to restart all the services, in order to have them running with the appropriate credentials.

 

Useful AliEn commands

Here's a synopsis of some AliEn commands that can be useful while debugging the system:

 

  • top  -id <jobid>
  • top [-status <status>|-all_status] [-host <host>]
  • registerOutput <jobId>
  • ps [options]
  • ps jdl <jobId>
  • ps trace [all|trace|...] <jobId>
  • spy
  • queueinfo [-jdl] [<queuename>]

 

Useful LCG commands

Here's a synopsis of some LCG/gLite commands that can be useful while debugging the system:

 

  • lcg-infosites --vo alice [ce|se|rb|...]
  • lcg-version

 


 

Problem: Cannot start the services

Symptoms: You cannot start any service: AliEn dies with messages similar to the following:

 

May 22 10:39:01 info Could not connect to database: Cannot log in to
DBI::ProxyServer: Refused by server: Error user aliprod is not authorised to log in to database at
/opt/exp_soft/alice/alien/lib/perl5/site_perl/5.8.7/RPC/PlClient.pm line 87.

Database: In _validateUser validation of user dainesea (as aliprod) failed.
May 22 10:39:01 error Not able to get database with SSH. You cant authenticate as aliprod

Diagnosis: This is probably a problem with the central services; a possible cause is a crash of the Authen server, but there are others. You can check the status of the central services by using the central services page of the ALICE MonALISA repository.

 


Symptoms: You cannot start any service. The aliend startup script complains about proxies

 

Doing Start for ALICE
Service MonitorNo matching DN found
Error setting proxy
ERROR 3 [FAILED]
Service SENo matching DN found
Error setting proxy
ERROR 3 [FAILED]
Service CENo matching DN found
Error setting proxy
ERROR 3 [FAILED]
Service PackManNo matching DN found
Error setting proxy
ERROR 3 [FAILED]
Service MonaLisaNo matching DN found
Error setting proxy
ERROR 3 [FAILED]

Diagnosis: There is no valid proxy registered in the VO-Box database that matches the ALIEN_USER set in ~alicesgm/.alien/Environment. Check it by issuing the following commands:

 

VO-Box> grep ALIEN_USER ~alicesgm/.alien/Environment
export ALIEN_USER=sbagnasc
VO-Box> vobox-proxy --vo alice --dn all query
*******************************************************************************
DN: /C=IT/O=INFN/OU=Personal Certificate/L=Torino/CN=Stefano Bagnasco/CN=proxy/CN=proxy/CN=proxy
File: /opt/vobox/alice/proxy_repository/+2fC+3dIT+2fO+3dINFN+2fOU+3dPersonal+20Certificate+2fL+3dTorino+2fCN+3dStefano+20Bagnasco+2fCN+3dproxy
Proxy Expiration Trigger (seconds): 0
Myproxy Expiration Trigger (seconds): 0
Email Address: N/A
Proxy Time Left (seconds): 172773
Myproxy Time Left (seconds): 13040497
Status: OK
*******************************************************************************

Solution: If there is no proxy registered (i.e. the above command returns nothing) register one and check that the ALIEN_USER in ~alicesgm/.alien/Environment matches the owner of the proxy. Alternatively, if there is a valid proxy, change ALIEN_USER to the owner of that proxy.

 

In some cases, the proxy DB on the VO-Box (/opt/vobox/alice/_registerer_proxies.db) can become corrupt. Evidence for this is usually that the list of proxies returned by vobox-proxy --vo alice --dn all query does not match the files in /opt/vobox/alice/proxy_repository, i.e. there are some extra files in the repository.

Solution: Clean up the repository and register again a fresh proxy:

 

VO-Box> rm /opt/vobox/alice/_registerer_proxies.db
VO-Box> rm /opt/vobox/alice/proxy_repository/*
VO-Box> vobox-proxy --vo alice --dn all register

 

Please check also HowToManageVOBoxProxies for (many) more details on proxy management for the VO-Box.

 


Symptoms: AliEn services do not start at boot time. There are messages in the boot log that complain about proxies:

 

ERROR: Couldn&#39;t find a valid proxy.
Use -debug for further information

Diagnosis: The scripts that starts VO-Box services (/opt/vobox/alice/agents/alice-box-proxyrenewal) goes through the startup script in /opt/vobox/alice/start/ /before/ starting the Proxy Renewal Service. So, if while the service or the machine were down, the proxy expired the services cannot start.

Solution: The startup script will likely be changed; in the meanwhile, the services need to be started by hand after the boot sequence:

 

VO-Box> $VO_ALICE_SW_DIR/alien/etc/rc.d/init.d aliend start

 


 

Problem: Services up but no jobs running

Symptoms: All the services are running on the VO-Box, but it does not submit any JobAgent, i.e. no ALICE jobs seen by the LRMS.

Diagnosis: One possibility is a problem with the CE GRIS (or BDII). This can be checked by looking at the CE log ~alicesgm/alien-logs/CE.log, where periodically a message like this can be found:

 

CPUs for t2-ce-02.lnl.infn.it:2119/jobmanager-lcglsf-alice: 40/160, (R:7, W:0)
Total for this VO Box: 40/160 (R:7, W:2, JA:9)

If all the numbers (with the possible exception of JA:) are 0, then it's highly likely that the VO-Box can't get these numbers from the CE GRIS. Another possibility is having all zeroes, except W:4444. This is a problem with the site LRMS.

Solution: Check with the LCG site admins to check the CE GRIS.

You can find out the name and status of CEs, SEs or other LCG services using the lcg-infosites command. For example (please try lcg-infosites --help for much more options):

 

LCG UI>lcg-infosites --vo alice ce | grep to.infn.it
60 5 43 43 0 t2-ce-01.to.infn.it:2119/jobmanager-lcgpbs-alice

 


Symptoms: The system submits no JobAgent to the RB, and there are messages like "No job matches your JDL" in the ~alicesgm/alien-logs/CE.log. Furthermore, there's no package listing in the head of the same file.

Diagnosis: The PackMan is dead or somehow stuck. This can happen e.g. if the CE is restarted with a different proxy. In normal conditions, the site JDL which is printed out at the beginning of the CE logfile contains package info from the PackMan:

 

Comng soon...

The full site JDL is also printed out when you issue alien login. You can also check that PackMan is working by issuing (from the AliEn command line) packman list or packman listInstalled.

Solution: Restart the PackMan:

 

VO-Box> $VO_ALICE_SW_DIR/alien/scripts/lcg/lcgAlien.sh StartPackMan

 


Symptoms: The certificate files look OK, and the error in the ~alicesgm/alien-logs/CE.log is different:

 

Selected Virtual Organisation name (from --config-vo option): alice
Connecting to host gridit-rb-01.cnaf.infn.it, port 7772
Logging to host egee-rb-01.cnaf.infn.it, port 9002
**** Error: API_NATIVE_ERROR ****
Error while calling the "edg_wll_RegisterJobSync" native api
Unable to Register the Job:
https://egee-rb-02.cnaf.infn.it:9000/SnDmy_yTnj_bKLk32Bvo7g
to the LB logger at: egee-rb-01.cnaf.infn.it:9002
SSL Error (certificate verify failed)

The lines starting with 'Connecting to host' and 'Logging to host' show different host names (i.e. the Network Server and the Loggin&Bookkeeping hosts are different); the actual error messages may be different.

Diagnosis: There is very likely a problem with the RB. You can try using a different one (see below). On top of that, the system is trying to send logging information to a different default host, and this can mess up the things a bit.

Solution: The submission commands take their configuration (including the RB to use) from a file which by default is in /opt/edg/etc/alice/edg_wl_ui.conf. The file looks something like

 

[
VirtualOrganisation = "alice";
NSAddresses = "egee-rb-01.cnaf.infn.it:7772";
LBAddresses = "egee-rb-01.cnaf.infn.it:9000";
## HLR location is optional.
#HLRLocation = "fake HLR Location"
MyProxyServer = "myproxy.cern.ch"
]

Root privileges are needed to change the default file; however, it is possible to tell the system to use another file. Copy the relevant configuration file to some place where you can edit it, change it to point to another RB and instruct AliEn to pass it to edg-job-submit:

 

  1. Copy /opt/edg/etc/alice/edg_wl_ui.conf to e.g. /home/alicesgm/.alien/edg-wl-ui.conf

  2. Edit it to point to your favorite RB. Be sure to change both entries (NSAddresses and LBAddresses).

  3. Add to the ~alicesgm/.alien/alice.conf a line like

 

CE_EDG_WL_UI_CONF /home/alicesgm/.alien/edg_wl_ui.conf

  1. Restart the CE: $VO_ALICE_SW_DIR/alien/scripts/lcg/lcgAlien.sh StartCE

To check whether you're passing any option to edg-job-submit, look in the CE log ~alicesgm/alien-logs/CE.log for a line like

Jul 11 10:06:43 info Submitting to LCG with '--config-vo /home/alicesgm/.alien/edg-wl-ui.conf'.

To fix the logging problem, the LoggingDestination entry in the /opt/edg/etc/edg_wl_ui_cmd_var.conf file should be removed (this requires root access).

 


Symptoms: Move this discussion to the installation HowTo Start the CE with --debug 10 option and you may see something like "No match found" etc

 

Diagnosis: AliEn can't match user requirements to your site. Check the user's JDL file, as reported by ~alicesgm/alien-logs/CE.log, and "your CE jdl" printed in the beginning of CE.log. You can see, for example the LocalDiskSpace entry, which may be less that user's JDL wants.

Solution: Define in your ~/.alien/Environment

 

export ALIEN_WORKDIR=<some path>

and put there some dir where there are few GB of space, to satisfy most of requirements.

There are actually two different environment files. The ~/.alien/Environment, which we usually use for configuration on VO-Boxes, is local to the VO-Box and takes precedence upon the $VO_ALICE_SW_DIR/alien/.Environment, which (since it resides on the shared partition) is visible also from Worker Nodes.

 

 


Symptoms: All the services are running on the VO-Box, JobAgents are submitted (i.e. the LCG site see many ALICE jobs) but they don't pick up any real jobs (i.e. they 're about 3 minutes long and use next to no CPU; from MonaLisa, you see no AliEn jobs running on that site).

Diagnosis: The most likely problem (provided all the correct ports between WNs, the VO-Box and the AliEn server are open) is that you forgot the -t 48 option while registering the myProxy, and the proxy renewal get short proxies. You can check this by downloading (with lcg-job-get-output, see above) the output of one of the JobAgents. The file should contain something like

 

May 11 14:41:39 info We still have 99871 seconds to live
May 11 14:41:39 info The proxy is valid for 8927 seconds

if the second number is much less than the first, this is likely to be the problem.

From AliEn 2.10, an error message will be printed in the ~alicesgm/alien-logs/CE.log file if the renewed proxy is too short because of this problem.

Solution: Re-register the MyProxy. From any LCG VO-Box, issue

 

LCG-UI> myproxy-init -s myproxy.cern.ch -d -n -t 48 -c 720

Then log in on the VO-Box and issue

 

VO-Box>vobox-proxy --vo alice --proxy-safe 3600 --myproxy-safe 259200 --email <email> register

answering 'yes' (twice) when asked whether you want to overwrite the existing proxy. No AliEn services need restarting.

 


 

Problem: Cannot start the Clustermonitor

Symptoms: In the Clustermonitor.log you can see an error like:

 

4:45 error The Manager/Job returned an error: In InsertHost domain sinp.msu.ru not known in the LDAP server

Diagnosis: The central IS does not recognize your site as an ALICE site

Solution: The domain of the site has to be added to the entry 'ou=LCG,ou=Sites,o=alice,dc=cern,dc=ch' in the LDAP server

 


 

 

Problem: Jobs finish with status ERROR_V

Symptoms: There are a lot of jobs in your site that finish with 'ERROR_V' (error validating).

Diagnosis: ERROR_V means that from the point of view of AliEn, the job ran without problems. However, the job has a validation procedure, and that validation procedure thought that there was something wrong. For instance, this error can happen if the job tries to access a library that it is not in the worker node.

Solution: The first thing to do is to check the output of any of those jobs. The jobs that finish with ERROR_V do not register the output in the catalogue. However, you can use the command registerOutput <jobid> to create the links and check what happened.