htCondor-based AliEn site installation

htCondor installation

This howto assumes that you are using SL 6.8+ .

1. Go to repositories folder:

cd /etc/yum.repos.d/

2. if UMD repos are present (base, testing, untested, updates) and enabled - disable them:

perl -pi -e 's/enabled=1/enabled=0/g' ./UMD-3-*.repo

yum update

3. Install htCondor repositories:

wget http://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel6.repo

wget http://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-development-rhe...

(cd /etc/pki/rpm-gpg/ && wget http://research.cs.wisc.edu/htcondor/yum/RPM-GPG-KEY-HTCondor)

yum update
yum install condor

This has to install htCondor 8.5.5 or later.


htCondor configuration for AliEn

 

This configuration is for htCondor running JobRouter.

1. Go to htCondor config folder:

cd /etc/condor

2. Create the certificate_mapfile file:

echo -e "KERBEROS ^([^@/]*)@(.*)\$ \1@\2 \n\
GSI \"/DC=ch/DC=cern/OU=computers/CN=(.*).*\" \1@cern.ch\n\
FS (.*) \1@fsauth" > ./certificate_mapfile

3. Create the local configuration for htCondor: 

touch config.d/01_alice_jobrouter.config

4. Add the following content to config.d/01_alice_jobrouter.config :

 

DAEMON_LIST = MASTER, SCHEDD, JOB_ROUTER, COLLECTOR

CERTIFICATE_MAPFILE = /etc/condor/certificate_mapfile
GSI_DAEMON_DIRECTORY = /etc/grid-security
GSI_DAEMON_CERT = $(GSI_DAEMON_DIRECTORY)/hostcert.pem
GSI_DAEMON_KEY  = $(GSI_DAEMON_DIRECTORY)/hostkey.pem
GSI_DAEMON_TRUSTED_CA_DIR = $(GSI_DAEMON_DIRECTORY)/certificates

SEC_CLIENT_AUTHENTICATION_METHODS = FS, GSI
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI
SEC_DAEMON_AUTHENTICATION_METHODS = FS, GSI

COLLECTOR.ALLOW_ADVERTISE_MASTER = condor@fsauth/$(FULL_HOSTNAME)
COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(FULL_HOSTNAME)

GRIDMAP = /etc/grid-security/grid-mapfile

ALL_DEBUG = D_FULLDEBUG D_COMMAND
SCHEDD_DEBUG = D_FULLDEBUG

# note: the max jobs parameters below will need to be increased

JOB_ROUTER_DEFAULTS = \
   [ requirements=target.WantJobRouter is True; \
     EditJobInPlace = True; \
     MaxIdleJobs = 50; \
     MaxJobs = 200; \
     delete_WantJobRouter = true; \
     delete_JobLeaseDuration = True; \
     set_JobUniverse = 9; \
     set_remote_jobuniverse = 5; \
   ]

# note: it typically is better _not_ to use such static entries, but rather the command below

JOB_ROUTER_ENTRIES = \
   [ GridResource = "condor your-CE.your-domain your-CE.your-domain:9619"; \
     eval_set_GridResource = "condor your-CE.your-domain your-CE.your-domain:9619"; \
     name = "Site 4"; \
   ]

# configure a script to get the proper entries from the ALICE LDAP server

JOB_ROUTER_ENTRIES_CMD = /var/lib/condor/get_job_routes.sh

JOB_ROUTER_ENTRIES_REFRESH = 60

JOB_ROUTER_POLLING_PERIOD = 10

JOB_ROUTER_SCHEDD2_NAME = $(FULL_HOSTNAME)
JOB_ROUTER_SCHEDD2_POOL = $(FULL_HOSTNAME):9618
JOB_ROUTER_DEBUG = D_FULLDEBUG

GRIDMANAGER_DEBUG = D_FULLDEBUG
JOB_ROUTER_SCHEDD2_SPOOL=/var/lib/condor/spool

FRIENDLY_DAEMONS = condor@fsauth/$(FULL_HOSTNAME), root@fsauth/$(FULL_HOSTNAME), $(FULL_HOSTNAME)

ALLOW_DAEMON = $(FRIENDLY_DAEMONS)

SCHEDD.ALLOW_WRITE = $(FRIENDLY_DAEMONS), *@cern.ch/$(FULL_HOSTNAME)
ALLOW_DAEMON = $(ALLOW_DAEMON) $(FRIENDLY_DAEMONS)

# ========== FULL DEBUGS =============

GRIDMANAGER_DEBUG = D_FULLDEBUG

 

5. Restart htCondor:

condor_restart

 

6. Check htCondor running:

 pstree | grep condor

 

the output has to look like this:

     |-condor_master-+-condor_collecto
     |               |-condor_job_rout
     |               |-condor_procd
     |               |-condor_schedd---condor_gridmana---condor_c-gahp---2*[condor_c-gahp_w]
     |               `-condor_shared_p


LDAP configuration for htCondor-based AliEn site

In Environment section:

# whether is is necessary to use job router service

USE_JOB_ROUTER=( 1 | 0)

# htCondor resource for explicitly defined for submission to vanilla universe, otherwise system default resource will be selected
GRID_RESOURCE=condor your-CE.your-domain your-CE.your-domain:9619 

# routes list example
ROUTES_LIST=[GridResource = "condor your-CE.your-domain your-CE.your-domain:9619"; eval_set_GridResource = "condor your-CE.your-domain your-CE.your-domain:9619"; name = "Site 4"; ]  

# whether to use external cloud
USE_EXTERNAL_CLOUD=(1 | 0) 

#specify extra options for condor_submit command . Example: add extra ClassAds to the job description:   SUBMIT_ARGS=-append "+TestClassAd=1"

SUBMIT_ARGS=<String>

 

Multiple routes example:

ROUTES_LIST = [ TargetUniverse = 5; name = "Route jobs to HTCondor"; ] [ GridResource = "batch pbs"; TargetUniverse = 9; name = "Route jobs to PBS"; ]

 


Miscellaneous scripts for htCondor-based AliEn site

Script to fill the routes list from LDAP:

#!/bin/bash
# print HTCondor job routes obtained from the ALICE LDAP server
#
# example settings in /etc/condor/config.d:
#
# JOB_ROUTER_ENTRIES_CMD = /var/lib/condor/get_job_routes.sh
# JOB_ROUTER_ENTRIES_REFRESH = 600
#
# version 1.3 (2017/04/04)
# author: Maarten Litmaath

usage()
{
    echo "Usage: $0 [-n] [ FQHN ]" >&2
    exit 1
}

LOG=/tmp/job-routes-$(date '+%y%m%d').log
LDAP_ADDR=alice-ldap.cern.ch:8389
h=$(hostname -f)

case $1 in
-n)
    LOG=
    shift
esac

case $1 in
-*)
    usage
    ;;
?*.?*.?*)
    h=$1
    ;;
?*)
    usage
esac

f="(&(objectClass=AlienCE)(host=$h))"

#
# wrapped example output lines returned by the ldapsearch:
#

# environment: ROUTES_LIST=\
# [ "condor ce503.cern.ch ce503.cern.ch:9619" ] \
# [ "condor ce504.cern.ch ce504.cern.ch:9619"; optional extra stuff ] \
# [ "condor ce505.cern.ch ce505.cern.ch:9619" ] \
# [ "condor ce506.cern.ch ce506.cern.ch:9619" ]
#
# or a simpler format (the port currently is needed for the SAM VO feed):
#
# environment: ROUTES_LIST=\
# [ ce503.cern.ch:9619 ] \
# [ ce504.cern.ch:9619; optional extra stuff ] \
# [ ce505.cern.ch:9619 ] \
# [ ce506.cern.ch:9619 ]
#
# the next line may even be absent:
#
# environment: USE_EXTERNAL_CLOUD=0
#

if [ "x$LOG" = x ]
then
    LOG=/dev/null
else
    echo == $(date) >> $LOG
    exec 2>> $LOG
fi

ldapsearch -LLL -x -h $LDAP_ADDR -b o=alice,dc=cern,dc=ch "$f" environment |
    perl -p00e 's/\r?\n //g' | perl -ne '
        if (s/^environment: ROUTES_LIST *= *//i) {
            s/\[ *([^]" ]+)(:\d+) *([];])/[ "condor $1 $1$2" $3/g;
            s/\[ *([^]" ]+) *([];])/[ "condor $1 $1:9619" $2/g;
            s/\[ *[^"]*"/[ "/g;
            s/\[ *("[^"]+")/[ GridResource = $1; eval_set_GridResource = $1/g;
            $routes = $_;
            next;
        }
        if (s/^environment: USE_EXTERNAL_CLOUD *= *//i) {
            $extern = "; set_WantExternalCloud = True" if /1/;
            next;
        }
        END {
            $extern .= " ]";
            $routes =~ s/;? *]/$extern/eg;
            print $routes;
        }
    ' | tee -a $LOG

Cleanup script for job logs and stdout/stderr files removal:

#!/bin/sh

cd ~/htcondor || exit

GZ_SIZE=10k
GZ_MINS=60
GZ_DAYS=2
RM_DAYS=7

STAMP=.stamp
prefix=cleanup-
log=$prefix`date +%y%m%d`
exec >> $log 2>&1 < /dev/null
echo === START `date`
for d in `ls -d 20??-??-??`
do
    (

        echo === $d
        stamp=$d/$STAMP
        [ -e $stamp ] || touch $stamp || exit
        if find $stamp -mtime +$RM_DAYS | grep . > /dev/null
        then
            echo removing...
            /bin/rm -r $d < /dev/null
            exit
        fi
        cd $d || exit
        find . ! -name .\* ! -name \*.gz \( -mtime +$GZ_DAYS -o \
            -size +$GZ_SIZE -mmin +$GZ_MINS \) -exec gzip -9v {} \;
    )
done

find $prefix* -mtime +$RM_DAYS -exec /bin/rm {} \;
echo === READY `date`

 

Crontab line for cleanup script:
37 * * * * /bin/sh /home/alicesgm/htcondor-cleanup.sh

You are here