htCondor-based AliEn site installation

HTCondor installation on the VOBOX

Mind: this documentation describes how to configure a VOBOX to enable it to submit ALICE jobs to HTCondor CEs. The VOBOX will run its own HTCondor services that are independent of the HTCondor services for your CE and batch system.

This howto assumes that you are using SL 6.8+ or CentOS/EL 7.5+ .

Go to repositories folder and install htCondor repositories:

cd /etc/yum.repos.d/

For SL6:

    wget http://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel6.r…

For CentOS/EL 7:

    wget http://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel7.r…

cd /etc/pki/rpm-gpg/

wget http://research.cs.wisc.edu/htcondor/yum/RPM-GPG-KEY-HTCondor

rpm --import RPM-GPG-KEY-HTCondor

cd

yum update
yum install condor

This has to install htCondor 8.5.5 or later.


htCondor configuration for AliEn

 

This configuration is for htCondor running JobRouter.

1. Go to htCondor config folder:

cd /etc/condor

2. NOT NEEDED - (was: Create the certificate_mapfile)

3. Create the local configuration for htCondor: 

touch config.d/01_alice_jobrouter.config

4. Add the following content to config.d/01_alice_jobrouter.config :

 

DAEMON_LIST = MASTER, SCHEDD, JOB_ROUTER, COLLECTOR

# the next line is needed since recent HTCondor versions

COLLECTOR_HOST = $(FULL_HOSTNAME)

CERTIFICATE_MAPFILE = /etc/condor/certificate_mapfile
GSI_DAEMON_DIRECTORY = /etc/grid-security
GSI_DAEMON_CERT = $(GSI_DAEMON_DIRECTORY)/hostcert.pem
GSI_DAEMON_KEY  = $(GSI_DAEMON_DIRECTORY)/hostkey.pem
GSI_DAEMON_TRUSTED_CA_DIR = $(GSI_DAEMON_DIRECTORY)/certificates

SEC_CLIENT_AUTHENTICATION_METHODS = FS, GSI
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI
SEC_DAEMON_AUTHENTICATION_METHODS = FS, GSI

COLLECTOR.ALLOW_ADVERTISE_MASTER = condor@fsauth/$(FULL_HOSTNAME)
COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(FULL_HOSTNAME)

GRIDMAP = /etc/grid-security/grid-mapfile

ALL_DEBUG = D_FULLDEBUG D_COMMAND
SCHEDD_DEBUG = D_FULLDEBUG

# NOTE: the max jobs parameters below will need to be increased

# MaxJobs: typically ~10% more than the number of 1-core slots in the batch system

JOB_ROUTER_DEFAULTS = \
   [ requirements=target.WantJobRouter is True; \
     EditJobInPlace = True; \
     MaxIdleJobs = 50; \
     MaxJobs = 200; \
     delete_WantJobRouter = true; \
     delete_JobLeaseDuration = True; \
     set_JobUniverse = 9; \
     set_remote_jobuniverse = 5; \
   ]

# NOTE: it typically is better _not_ to use such static entries, but rather the command below

#JOB_ROUTER_ENTRIES = \
#   [ GridResource = "condor your-CE.your-domain your-CE.your-domain:9619"; \
#     eval_set_GridResource = "condor your-CE.your-domain your-CE.your-domain:9619"; \
#     name = "My cluster"; \
#   ]

# configure a script to get the proper entries from the ALICE LDAP server (provided below)

JOB_ROUTER_ENTRIES_CMD = /var/lib/condor/get_job_routes.sh

JOB_ROUTER_ENTRIES_REFRESH = 300

JOB_ROUTER_POLLING_PERIOD = 10

JOB_ROUTER_ROUND_ROBIN_SELECTION = True

JOB_ROUTER_SCHEDD2_NAME = $(FULL_HOSTNAME)
JOB_ROUTER_SCHEDD2_POOL = $(FULL_HOSTNAME):9618
JOB_ROUTER_DEBUG = D_FULLDEBUG

GRIDMANAGER_DEBUG = D_FULLDEBUG
JOB_ROUTER_SCHEDD2_SPOOL=/var/lib/condor/spool

FRIENDLY_DAEMONS = condor@fsauth/$(FULL_HOSTNAME), root@fsauth/$(FULL_HOSTNAME), $(FULL_HOSTNAME)

ALLOW_DAEMON = $(FRIENDLY_DAEMONS)

SCHEDD.ALLOW_WRITE = $(FRIENDLY_DAEMONS), *@cern.ch/$(FULL_HOSTNAME)
ALLOW_DAEMON = $(ALLOW_DAEMON) $(FRIENDLY_DAEMONS)

# ========== FULL DEBUGS =============

GRIDMANAGER_DEBUG = D_FULLDEBUG

# more stuff from the CERN VOBOXes

CONDOR_FSYNC = False
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 1000   # to be increased (see MaxJobs above)

GRIDMANAGER_JOB_PROBE_INTERVAL = 600

GRIDMANAGER_MAX_PENDING_REQUESTS = 500
GRIDMANAGER_GAHP_CALL_TIMEOUT = 3600
GRIDMANAGER_SELECTION_EXPR = (ClusterId % 2)          # 2 should be enough already
GRIDMANAGER_GAHP_RESPONSE_TIMEOUT = 300
GRIDMANAGER_DEBUG =
ALLOW_DAEMON = $(ALLOW_DAEMON), $(FULL_HOSTNAME), $(IP_ADDRESS), unauthenticated@unmapped
COLLECTOR.ALLOW_ADVERTISE_MASTER = $(COLLECTOR.ALLOW_ADVERTISE_MASTER), $(ALLOW_DAEMON)
COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(COLLECTOR.ALLOW_ADVERTISE_SCHEDD), $(ALLOW_DAEMON)

DELEGATE_JOB_GSI_CREDENTIALS_LIFETIME = 0

GSI_SKIP_HOST_CHECK = true

 

5. Restart htCondor now and automatically at boot time:

service condor restart

chkconfig condor on

 

6. Check htCondor running:

 pstree | grep condor

 

Initially the output has to look like this:

     |-condor_master-+-condor_collecto
     |               |-condor_job_rout
     |               |-condor_procd
     |               |-condor_schedd
     |               `-condor_shared_p


LDAP and VOBOX configuration for htCondor-based AliEn site

In Environment section for the AliEn CE in the LDAP server:

# whether is is necessary to use job router service

USE_JOB_ROUTER=( 1 | 0)

# htCondor resource for explicitly defined for submission to vanilla universe, otherwise system default resource will be selected
GRID_RESOURCE=condor your-CE.your-domain your-CE.your-domain:9619 

# routes list example
ROUTES_LIST=[ your-ce01.your-domain:9619 ] [ your-ce02.your-domain:9619 ]  

# whether to use external cloud
USE_EXTERNAL_CLOUD=(1 | 0) 

#specify extra options for condor_submit command . Example: add extra ClassAds to the job description:   SUBMIT_ARGS=-append "+TestClassAd=1"

SUBMIT_ARGS=<String>

 

In ~/.alien/Environment on the VOBOX:

d=$HOME/htcondor
mkdir -p $d

export HTCONDOR_LOG_PATH=$d

 

Mind the firewall settings on the VOBOX:

https://alien.web.cern.ch/content/alice-vo-box-setup-and-configuration (section 1.1)

 


Miscellaneous scripts for htCondor-based AliEn site

Script to fill the routes list from LDAP:

#!/bin/bash
# print HTCondor job routes obtained from the ALICE LDAP server
#
# example settings in /etc/condor/config.d:
#
# JOB_ROUTER_ENTRIES_CMD = /var/lib/condor/get_job_routes.sh
# JOB_ROUTER_ENTRIES_REFRESH = 600
#
# version 1.3 (2017/04/04)
# author: Maarten Litmaath

usage()
{
    echo "Usage: $0 [-n] [ FQHN ]" >&2
    exit 1
}

LOG=/tmp/job-routes-$(date '+%y%m%d').log
LDAP_ADDR=alice-ldap.cern.ch:8389
h=$(hostname -f)

case $1 in
-n)
    LOG=
    shift
esac

case $1 in
-*)
    usage
    ;;
?*.?*.?*)
    h=$1
    ;;
?*)
    usage
esac

f="(&(objectClass=AlienCE)(host=$h))"

#
# wrapped example output lines returned by the ldapsearch:
#

# environment: ROUTES_LIST=\
# [ "condor ce503.cern.ch ce503.cern.ch:9619" ] \
# [ "condor ce504.cern.ch ce504.cern.ch:9619"; optional extra stuff ] \
# [ "condor ce505.cern.ch ce505.cern.ch:9619" ] \
# [ "condor ce506.cern.ch ce506.cern.ch:9619" ]
#
# or a simpler format (the port currently is needed for the SAM VO feed):
#
# environment: ROUTES_LIST=\
# [ ce503.cern.ch:9619 ] \
# [ ce504.cern.ch:9619; optional extra stuff ] \
# [ ce505.cern.ch:9619 ] \
# [ ce506.cern.ch:9619 ]
#
# the next line may even be absent:
#
# environment: USE_EXTERNAL_CLOUD=0
#

if [ "x$LOG" = x ]
then
    LOG=/dev/null
else
    echo == $(date) >> $LOG
    exec 2>> $LOG
fi

ldapsearch -LLL -x -h $LDAP_ADDR -b o=alice,dc=cern,dc=ch "$f" environment |
    perl -p00e 's/\r?\n //g' | perl -ne '
        if (s/^environment: ROUTES_LIST *= *//i) {
            s/\[ *([^]" ]+)(:\d+) *([];])/[ "condor $1 $1$2" $3/g;
            s/\[ *([^]" ]+) *([];])/[ "condor $1 $1:9619" $2/g;
            s/\[ *[^"]*"/[ "/g;
            s/\[ *("[^"]+")/[ GridResource = $1; eval_set_GridResource = $1/g;
            $routes = $_;
            next;
        }
        if (s/^environment: USE_EXTERNAL_CLOUD *= *//i) {
            $extern = "; set_WantExternalCloud = True" if /1/;
            next;
        }
        END {
            $extern .= " ]";
            $routes =~ s/;? *]/$extern/eg;
            print $routes;
        }
    ' | tee -a $LOG

Cleanup script for job logs and stdout/stderr files removal:

#!/bin/sh

cd ~/htcondor || exit

GZ_SIZE=10k
GZ_MINS=60
GZ_DAYS=2
RM_DAYS=7

STAMP=.stamp
prefix=cleanup-
log=$prefix`date +%y%m%d`
exec >> $log 2>&1 < /dev/null
echo === START `date`
for d in `ls -d 20??-??-??`
do
    (

        echo === $d
        stamp=$d/$STAMP
        [ -e $stamp ] || touch $stamp || exit
        if find $stamp -mtime +$RM_DAYS | grep . > /dev/null
        then
            echo removing...
            /bin/rm -r $d < /dev/null
            exit
        fi
        cd $d || exit
        find . ! -name .\* ! -name \*.gz \( -mtime +$GZ_DAYS -o \
            -size +$GZ_SIZE -mmin +$GZ_MINS \) -exec gzip -9v {} \;
    )
done

find $prefix* -mtime +$RM_DAYS -exec /bin/rm {} \;
echo === READY `date`

 

Crontab line for cleanup script:
37 * * * * /bin/sh $HOME/htcondor-cleanup.sh