htCondor-based AliEn site installation

htCondor installation

This howto assumes that you are using SL 6.8+ .

1. Go to repositories folder:

cd /etc/yum.repos.d/

2. if UMD-3 repos are present (base, testing, untested, updates) and enabled - disable them:

perl -pi -e 's/enabled=1/enabled=0/g' ./UMD-3-*.repo

yum update

3. Install htCondor repositories:

wget http://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel6.repo

wget http://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-development-rhe...

cd /etc/pki/rpm-gpg/

wget http://research.cs.wisc.edu/htcondor/yum/RPM-GPG-KEY-HTCondor

rpm --import RPM-GPG-KEY-HTCondor

cd

yum update
yum install condor

This has to install htCondor 8.5.5 or later.


htCondor configuration for AliEn

 

This configuration is for htCondor running JobRouter.

1. Go to htCondor config folder:

cd /etc/condor

2. NOT NEEDED - Create the certificate_mapfile:

echo -e "KERBEROS ^([^@/]*)@(.*)\$ \1@\2 \n\
GSI \"/DC=ch/DC=cern/OU=computers/CN=(.*).*\" \1@cern.ch\n\
FS (.*) \1@fsauth" > ./certificate_mapfile

3. Create the local configuration for htCondor: 

touch config.d/01_alice_jobrouter.config

4. Add the following content to config.d/01_alice_jobrouter.config :

 

DAEMON_LIST = MASTER, SCHEDD, JOB_ROUTER, COLLECTOR

CERTIFICATE_MAPFILE = /etc/condor/certificate_mapfile
GSI_DAEMON_DIRECTORY = /etc/grid-security
GSI_DAEMON_CERT = $(GSI_DAEMON_DIRECTORY)/hostcert.pem
GSI_DAEMON_KEY  = $(GSI_DAEMON_DIRECTORY)/hostkey.pem
GSI_DAEMON_TRUSTED_CA_DIR = $(GSI_DAEMON_DIRECTORY)/certificates

SEC_CLIENT_AUTHENTICATION_METHODS = FS, GSI
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI
SEC_DAEMON_AUTHENTICATION_METHODS = FS, GSI

COLLECTOR.ALLOW_ADVERTISE_MASTER = condor@fsauth/$(FULL_HOSTNAME)
COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(FULL_HOSTNAME)

GRIDMAP = /etc/grid-security/grid-mapfile

ALL_DEBUG = D_FULLDEBUG D_COMMAND
SCHEDD_DEBUG = D_FULLDEBUG

# NOTE: the max jobs parameters below will need to be increased

# MaxJobs: typically ~10% more than the number of 1-core slots in the batch system

JOB_ROUTER_DEFAULTS = \
   [ requirements=target.WantJobRouter is True; \
     EditJobInPlace = True; \
     MaxIdleJobs = 50; \
     MaxJobs = 200; \
     delete_WantJobRouter = true; \
     delete_JobLeaseDuration = True; \
     set_JobUniverse = 9; \
     set_remote_jobuniverse = 5; \
   ]

# NOTE: it typically is better _not_ to use such static entries, but rather the command below

JOB_ROUTER_ENTRIES = \
   [ GridResource = "condor your-CE.your-domain your-CE.your-domain:9619"; \
     eval_set_GridResource = "condor your-CE.your-domain your-CE.your-domain:9619"; \
     name = "My cluster"; \
   ]

# configure a script to get the proper entries from the ALICE LDAP server

JOB_ROUTER_ENTRIES_CMD = /var/lib/condor/get_job_routes.sh

JOB_ROUTER_ENTRIES_REFRESH = 300

JOB_ROUTER_POLLING_PERIOD = 10

JOB_ROUTER_ROUND_ROBIN_SELECTION = True

JOB_ROUTER_SCHEDD2_NAME = $(FULL_HOSTNAME)
JOB_ROUTER_SCHEDD2_POOL = $(FULL_HOSTNAME):9618
JOB_ROUTER_DEBUG = D_FULLDEBUG

GRIDMANAGER_DEBUG = D_FULLDEBUG
JOB_ROUTER_SCHEDD2_SPOOL=/var/lib/condor/spool

FRIENDLY_DAEMONS = condor@fsauth/$(FULL_HOSTNAME), root@fsauth/$(FULL_HOSTNAME), $(FULL_HOSTNAME)

ALLOW_DAEMON = $(FRIENDLY_DAEMONS)

SCHEDD.ALLOW_WRITE = $(FRIENDLY_DAEMONS), *@cern.ch/$(FULL_HOSTNAME)
ALLOW_DAEMON = $(ALLOW_DAEMON) $(FRIENDLY_DAEMONS)

# ========== FULL DEBUGS =============

GRIDMANAGER_DEBUG = D_FULLDEBUG

# more stuff from the CERN VOBOXes

CONDOR_FSYNC = False
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 1000   # to be increased (see MaxJobs above)

GRIDMANAGER_JOB_PROBE_INTERVAL = 600

GRIDMANAGER_MAX_PENDING_REQUESTS = 500
GRIDMANAGER_GAHP_CALL_TIMEOUT = 3600
GRIDMANAGER_SELECTION_EXPR = (ClusterId % 5)
GRIDMANAGER_GAHP_RESPONSE_TIMEOUT = 300
GRIDMANAGER_DEBUG =
ALLOW_DAEMON = $(ALLOW_DAEMON), $(FULL_HOSTNAME), $(IP_ADDRESS), unauthenticated@unmapped
COLLECTOR.ALLOW_ADVERTISE_MASTER = $(COLLECTOR.ALLOW_ADVERTISE_MASTER), $(ALLOW_DAEMON)
COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(COLLECTOR.ALLOW_ADVERTISE_SCHEDD), $(ALLOW_DAEMON)

 

 

5. Restart htCondor:

service condor restart

 

6. Check htCondor running:

 pstree | grep condor

 

Initially the output has to look like this:

     |-condor_master-+-condor_collecto
     |               |-condor_job_rout
     |               |-condor_procd
     |               |-condor_schedd
     |               `-condor_shared_p


LDAP configuration for htCondor-based AliEn site

In Environment section:

# whether is is necessary to use job router service

USE_JOB_ROUTER=( 1 | 0)

# htCondor resource for explicitly defined for submission to vanilla universe, otherwise system default resource will be selected
GRID_RESOURCE=condor your-CE.your-domain your-CE.your-domain:9619 

# routes list example
ROUTES_LIST=[ your-ce01.your-domain:9619 ] [ your-ce02.your-domain:9619 ]  

# whether to use external cloud
USE_EXTERNAL_CLOUD=(1 | 0) 

#specify extra options for condor_submit command . Example: add extra ClassAds to the job description:   SUBMIT_ARGS=-append "+TestClassAd=1"

SUBMIT_ARGS=<String>

 


Miscellaneous scripts for htCondor-based AliEn site

Script to fill the routes list from LDAP:

#!/bin/bash
# print HTCondor job routes obtained from the ALICE LDAP server
#
# example settings in /etc/condor/config.d:
#
# JOB_ROUTER_ENTRIES_CMD = /var/lib/condor/get_job_routes.sh
# JOB_ROUTER_ENTRIES_REFRESH = 600
#
# version 1.3 (2017/04/04)
# author: Maarten Litmaath

usage()
{
    echo "Usage: $0 [-n] [ FQHN ]" >&2
    exit 1
}

LOG=/tmp/job-routes-$(date '+%y%m%d').log
LDAP_ADDR=alice-ldap.cern.ch:8389
h=$(hostname -f)

case $1 in
-n)
    LOG=
    shift
esac

case $1 in
-*)
    usage
    ;;
?*.?*.?*)
    h=$1
    ;;
?*)
    usage
esac

f="(&(objectClass=AlienCE)(host=$h))"

#
# wrapped example output lines returned by the ldapsearch:
#

# environment: ROUTES_LIST=\
# [ "condor ce503.cern.ch ce503.cern.ch:9619" ] \
# [ "condor ce504.cern.ch ce504.cern.ch:9619"; optional extra stuff ] \
# [ "condor ce505.cern.ch ce505.cern.ch:9619" ] \
# [ "condor ce506.cern.ch ce506.cern.ch:9619" ]
#
# or a simpler format (the port currently is needed for the SAM VO feed):
#
# environment: ROUTES_LIST=\
# [ ce503.cern.ch:9619 ] \
# [ ce504.cern.ch:9619; optional extra stuff ] \
# [ ce505.cern.ch:9619 ] \
# [ ce506.cern.ch:9619 ]
#
# the next line may even be absent:
#
# environment: USE_EXTERNAL_CLOUD=0
#

if [ "x$LOG" = x ]
then
    LOG=/dev/null
else
    echo == $(date) >> $LOG
    exec 2>> $LOG
fi

ldapsearch -LLL -x -h $LDAP_ADDR -b o=alice,dc=cern,dc=ch "$f" environment |
    perl -p00e 's/\r?\n //g' | perl -ne '
        if (s/^environment: ROUTES_LIST *= *//i) {
            s/\[ *([^]" ]+)(:\d+) *([];])/[ "condor $1 $1$2" $3/g;
            s/\[ *([^]" ]+) *([];])/[ "condor $1 $1:9619" $2/g;
            s/\[ *[^"]*"/[ "/g;
            s/\[ *("[^"]+")/[ GridResource = $1; eval_set_GridResource = $1/g;
            $routes = $_;
            next;
        }
        if (s/^environment: USE_EXTERNAL_CLOUD *= *//i) {
            $extern = "; set_WantExternalCloud = True" if /1/;
            next;
        }
        END {
            $extern .= " ]";
            $routes =~ s/;? *]/$extern/eg;
            print $routes;
        }
    ' | tee -a $LOG

Cleanup script for job logs and stdout/stderr files removal:

#!/bin/sh

cd ~/htcondor || exit

GZ_SIZE=10k
GZ_MINS=60
GZ_DAYS=2
RM_DAYS=7

STAMP=.stamp
prefix=cleanup-
log=$prefix`date +%y%m%d`
exec >> $log 2>&1 < /dev/null
echo === START `date`
for d in `ls -d 20??-??-??`
do
    (

        echo === $d
        stamp=$d/$STAMP
        [ -e $stamp ] || touch $stamp || exit
        if find $stamp -mtime +$RM_DAYS | grep . > /dev/null
        then
            echo removing...
            /bin/rm -r $d < /dev/null
            exit
        fi
        cd $d || exit
        find . ! -name .\* ! -name \*.gz \( -mtime +$GZ_DAYS -o \
            -size +$GZ_SIZE -mmin +$GZ_MINS \) -exec gzip -9v {} \;
    )
done

find $prefix* -mtime +$RM_DAYS -exec /bin/rm {} \;
echo === READY `date`

 

Crontab line for cleanup script:
37 * * * * /bin/sh /home/alicesgm/htcondor-cleanup.sh

You are here