New VO troubleshoot

This is a collection of common errors, how to investigate them and their most effective solutions. Consult this list after you made sure that the source of your error is not due to some trivial misconfiguration.Please reread the relevant installation wikis.

 

 

 

Expired alien proxy

Your alien proxy might have expired - by default it lasts only 12 hours. You can make it last longer, say 100h with

alien proxy-init -valid 100:0

or have it renewed automatically by a crontab job like in this example

-bash-3.00$ crontab -l
0 0 * * * /home/pandaprod/installation/alien/bin/alien proxy-init -valid 100:0 >> /home/pandaprod/PANDA/proxy_refresh.log 2>&1

Use the command crontab -e to edit your crontab. Eventually, use the -pwstdin option if your certificate requires a passphrase:

-bash-3.00$ crontab -l
0 0 * * * /home/pandaprod/installation/alien/bin/alien proxy-init -valid 100:0 -pwstdin > /home/pandaprod/PANDA/proxy_refresh.log 2>&1

and store your certificate passphrase in clear in the file ~/.alien/.my.secret (no very safe).

 

No sites with domain ...

If you get the error 'No sites with domain XXX in panda' where 'XXX' is a domain other than your site's domain then add to the .alien/Environment the line:

ALIEN_DOMAIN=your.domain

where 'your.domain' is the internet domain of your site (e.g. gsi.de).

 

The server_step error

You may get a message like this when you feel everything is in place (your certificate is valid and corresponds to your alien uid, the usercert.pem and user.key pem have the right permissions, the alien proxy is valid)

 

An error occured during AliEn authentication:
Jul 2 14:38:49 info Could not connect to database: Not able to authenticate (1)
Database: In _validateUser validation of user pbarprod (as pbarprod) failed.

and then, trying with option -debug=5 you get in the log tthis bit about server_step error

 

Server said: server_step error. Error code is 0 Outis: 2502

There can be several sources for this error:

  • Your system time does not match the grid server time and the mismatch trows off the GSS authentication process. Solution: start the ntpd service on your machine
  • The CA that signed your certificate is not recognized by AliEn. Grep for it in the $ALIEN_ROOT/globus/share/certificates/*= files and notify the grid admin if not there.
  • Your certificate has expired. Check its validity, for example:

-bash-3.00$ export ALIEN_ROOT=/your/path/to/alien
-bash-3.00$ export LD_LIBRARY_PATH=$ALIEN_ROOT/api/lib:$ALIEN_ROOT/globus/lib:$ALIEN_ROOT/lib:$ALIEN_ROOT/lib64
-bash-3.00$ openssl verify -CApath $ALIEN_ROOT/globus/share/certificates/ ~/.globus/usercert.pem

If it is expired, obtain a new certificate from your local CA, preferably with an empty passphrase.

 

Dialog missing

Sometimes at installation, the alien-installer script will abort with ERROR: Dialog could not be invoked. Is dialog installed ? Eventually your terminal is unresponsive and you have to reset it. When trying

-bash-3.00$ which dialog
~/installation/alien/bin/dialog

you see that it uses the alien dialog binary.

The solution I found was to remove the alien path link (and eventually the installer.rc) file before running alien-installer

-bash-3.00$ rm ~/installation/alien
-bash-3.00$ which dialog
/usr/bin/dialog
(-bash-3.00$ rm ~/.alien/installer.rc)
-bash-3.00$ ./alien-installer

 

Cannot do `initialize' in Term::ReadLine::Gnu

On versions newer than 2.14.27 at least on x86_64 machines, the alien session may fail with:

Cannot do `initialize' in Term::ReadLine::Gnu at /opt/alien/lib/perl5/site_perl/5.8.8/AliEn/UI/Catalogue.pm line 522

The solution seems to be adding to the $ALIEN_ROOT/.Environment the lines

ALIEN_PATH=$ALIEN_ROOT/api/bin:$ALIEN_ROOT/globus/bin
ALIEN_LD_LIBRARY_PATH=$ALIEN_ROOT/api/lib:$ALIEN_ROOT/globus/lib:$ALIEN_ROOT/lib:$ALIEN_ROOT/lib64:$ALIEN_ROOT/lib/mysql
GLOBUS_LOCATION=$ALIEN_ROOT/globus
X509_CERT_DIR=$ALIEN_ROOT/globus/share/certificates
GAPI_LOCATION=$ALIEN_ROOT/api
MYPROXY_LOCATION=$ALIEN_ROOT/globus

 

Daemons listening on local ports

There is a bug in the SL default configuration, so please check that your /etc/hosts is correct. It should be like:

[protopop@panda ~]$ cat /etc/hosts
#127.0.0.1 localhost.localdomain localhost panda.gla.ac.uk panda # WRONG!
127.0.0.1 localhost.localdomain localhost # OK
# Add here
#'your IP number' 'your computer's name' 'your computer's shortname'
130.209.45.237 panda.gla.ac.uk panda # OK

The line commented and labeled 'WRONG!' would make your daemons listen on the localhost ports, hence not visible from outside.

 

MonaLisa does not start anymore

The MonaLlisa service does not start and when looking ito the log file one finds something like:

The JVM Cannot take the lock for /tmp/panda/log/MonaLisa/.ml.lock because got an exception: java.io.IOException: No locks available
at sun.nio.ch.FileChannelImpl.lock0(Native Method)
at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:882)
at java.nio.channels.FileChannel.tryLock(FileChannel.java:962)
at lia.Monitor.JiniSerFarmMon.RegFarmMonitor.m(RegFarmMonitor.java)
at lia.Monitor.JiniSerFarmMon.RegFarmMonitor.main(RegFarmMonitor.java)
Possible reasons are:
1. ML already runns in this account and it is a bug in the ML_SER script. Please email developers: developers@monalisa.cern.ch
2. ML already runns in another account on this machine and has the same FARM_HOME defined in CMD/ml_env.
3. If the FARM_HOME defined in CMD/ml_env is a network file system (AFS, NFS, etc) it is possible that
the service already runs on another machine.
Please check the file /tmp/panda/log/MonaLisa/.ml.lock to see when was the service started last time

Sometimes this happens after an update, but this is of no consequence. Check the items in the lsit above. If it is neither 1 nor 2, then the problem is 3, i.e. your network file system (NFS) either doesn't provide the locking mechanism or has run out of lock handles. One solution is to disable the check by adding the following property in the local configuration file (or in LDAP to "addProperties" key for the MonaLisa service):

lia.Monitor.JiniSerFarmMon.RegFarmMonitor.disableLockChecking=true

 

Can not stop processes

Alien processes can not be stopped with ~/aliend stop. You can force kill all processes with something like:

kill -9 `ps aux | grep alien | grep -v ssh | awk -F" " '{print $2}'`

 

Recover lost files

You can recover lost catalogue entries by running a script on the machine where you have the SE that contains the PFN. Edit the variables indicated in the orphan.pl script (attached below) and run it with:

alien -x orphan.pl

ALERT! Only do this if you are sure you understand what you are doing.

           
           

SE Table

When defining a new SE in LDAP part of the new registered information is automatically written into the database. The output of the command getSEio executed as admin inside alien is shown below. One can see that the seStoragePath and the seioDeamons are missing.

 

seQoS NULL
seName CBM::GSI::xrdlustre
seNumber 7
seUsedSpace 0
seStoragePath
seNumFiles 0
seioDaemons

To solve the problem one can set these two values with the following commans setSEio . In principle the is a command addSE to add a new SE in the database, but i was not brave enough to test it when i had already an existing entry in the database.

Alex, could you please document your experience here ?

 

Packages Definition

To use the package manager packman it is needed to create some package definition. The following description has to go to the //tags directory, where has do be exchanged by the name of your organisation. The filename has to be PackageDef?.

dependencies varchar(255), executable varchar(255), description varchar(255),size int(10), md5sum int(1),setup varchar(255),unpack varchar(255), compile varchar(255), install varchar(255), pre_install varchar(255), post_install varchar(255), pre_rm varchar(255), post_rm varchar(255), config varchar(255), path varchar(255), shared int(1)

The entries are part of the entries in the table TcbmgridVPackageDef? in the alien_system section of the MySQL? database. I am not shure why the values are not created automatically when they are somehow default values. Maybe Pablo can comment on this.

 

The services running under https

  • Manage the config files

The original copies for httpd.confand startup.pl are from
$ALEIN_ROOT/httpd/conf/httpd.conf
$ALIEN_ROOT/httpd/conf/startup.pl
1) If youIf you want to do the change for only one service, find the Listen Port for this service, modify the files in this path
$HOME/.alien/httpd/conf.”ListenPort”
2) When installing the new version of the AliEn, just delete the files in this path
rm –rf $HOME/.alien/httpd/conf.”ListenPort”
alien Start”ServiceName”

  • Solution for different errors
At the end of the httpd.conf:
PerlPassEnvHOME
PerlPassEnvALIEN_ROOT
PerlPassEnvALIEN_DOMAIN
PerlPassEnvALIEN_USER
PerlPassEnvALIEN_HOME
PerlPassEnvALIEN_ORGANISATION
PerlPassEnvGLOBUS_LOCATION
PerlPassEnvX509_USER_PROXY
PerlPassEnvX509_CERT_DIR
We use PerlPassEnv for Passing the special shell environment variables to the apache.First the mod_sslcheck the validation of the proxy certificate, then the Authenservice checks the authorization of the user.
Here is the four environment varibles important for connecting the security services:
PerlPassEnvALIEN_USER
PerlPassEnvGLOBUS_LOCATION
PerlPassEnvX509_USER_PROXY
PerlPassEnvX509_CERT_DIR
 
 
ERROR1: info Error reason: SSL negotiation failed!
1) Use opensslto check the validation of your X509 certificate
2) Use alien proxy-info to check the validation of your proxy certificate
3) check the X509_USER_PROXY
4) check the GLOBUS_LOCATION
5) Check the X509_CERT_DIR
 
ERROR2: info You do not have permissions in /alice/user/******
That means your certificate didn't match your user name
1) check the ALIEN_USER
2) check the information published in LDAP
 
ERROR3: info Enter GRID pass phrase for this identity:
That means the service could not find the right pahtfor your proxy certificate, it tries to use your X509 certificate
1) check the $ENV{GLOBUS_LOCATION}
 
ERROR4: Error retrieving pidfile /tmp/pcalice57/log/httpdTransferManager.pid
That means you stop the services without using the command
alien Stop”ServiceName”
1) Delete this pidfile
rm /tmp/pcalice57/log/httpdTransferManager.pid
2) alien Start”ServiceName”