HTCondor installation on the VOBOX
Mind: this documentation describes how to configure a VOBOX to enable it to submit ALICE jobs to HTCondor CEs. The VOBOX will run its own HTCondor services that are independent of the HTCondor services for your CE and batch system.
This howto assumes that you are using SL 6.8+ or CentOS/EL 7.5+ .
Go to repositories folder and install htCondor repositories:
cd /etc/yum.repos.d/
For SL6:
wget http://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel6.r…
For CentOS/EL 7:
wget http://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel7.r…
cd /etc/pki/rpm-gpg/
wget http://research.cs.wisc.edu/htcondor/yum/RPM-GPG-KEY-HTCondor
rpm --import RPM-GPG-KEY-HTCondor
cd
yum update
yum install condor
This has to install htCondor 8.5.5 or later.
htCondor configuration for AliEn
This configuration is for htCondor running JobRouter.
1. Go to htCondor config folder:
cd /etc/condor
2. NOT NEEDED - (was: Create the certificate_mapfile)
3. Create the local configuration for htCondor:
touch config.d/01_alice_jobrouter.config
4. Add the following content to config.d/01_alice_jobrouter.config :
DAEMON_LIST = MASTER, SCHEDD, JOB_ROUTER, COLLECTOR
# the next line is needed since recent HTCondor versions
COLLECTOR_HOST = $(FULL_HOSTNAME)
CERTIFICATE_MAPFILE = /etc/condor/certificate_mapfile
GSI_DAEMON_DIRECTORY = /etc/grid-security
GSI_DAEMON_CERT = $(GSI_DAEMON_DIRECTORY)/hostcert.pem
GSI_DAEMON_KEY = $(GSI_DAEMON_DIRECTORY)/hostkey.pem
GSI_DAEMON_TRUSTED_CA_DIR = $(GSI_DAEMON_DIRECTORY)/certificates
SEC_CLIENT_AUTHENTICATION_METHODS = FS, GSI
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI
SEC_DAEMON_AUTHENTICATION_METHODS = FS, GSI
COLLECTOR.ALLOW_ADVERTISE_MASTER = condor@fsauth/$(FULL_HOSTNAME)
COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(FULL_HOSTNAME)
GRIDMAP = /etc/grid-security/grid-mapfile
ALL_DEBUG = D_FULLDEBUG D_COMMAND
SCHEDD_DEBUG = D_FULLDEBUG
# NOTE: the max jobs parameters below will need to be increased
# MaxJobs: typically ~10% more than the number of 1-core slots in the batch system
JOB_ROUTER_DEFAULTS = \
[ requirements=target.WantJobRouter is True; \
EditJobInPlace = True; \
MaxIdleJobs = 50; \
MaxJobs = 200; \
delete_WantJobRouter = true; \
delete_JobLeaseDuration = True; \
set_JobUniverse = 9; \
set_remote_jobuniverse = 5; \
]
# NOTE: it typically is better _not_ to use such static entries, but rather the command below
#JOB_ROUTER_ENTRIES = \
# [ GridResource = "condor your-CE.your-domain your-CE.your-domain:9619"; \
# eval_set_GridResource = "condor your-CE.your-domain your-CE.your-domain:9619"; \
# name = "My cluster"; \
# ]
# configure a script to get the proper entries from the ALICE LDAP server (provided below)
JOB_ROUTER_ENTRIES_CMD = /var/lib/condor/get_job_routes.sh
JOB_ROUTER_ENTRIES_REFRESH = 300
JOB_ROUTER_POLLING_PERIOD = 10
JOB_ROUTER_ROUND_ROBIN_SELECTION = True
JOB_ROUTER_SCHEDD2_NAME = $(FULL_HOSTNAME)
JOB_ROUTER_SCHEDD2_POOL = $(FULL_HOSTNAME):9618
JOB_ROUTER_DEBUG = D_FULLDEBUG
GRIDMANAGER_DEBUG = D_FULLDEBUG
JOB_ROUTER_SCHEDD2_SPOOL=/var/lib/condor/spool
FRIENDLY_DAEMONS = condor@fsauth/$(FULL_HOSTNAME), root@fsauth/$(FULL_HOSTNAME), $(FULL_HOSTNAME)
ALLOW_DAEMON = $(FRIENDLY_DAEMONS)
SCHEDD.ALLOW_WRITE = $(FRIENDLY_DAEMONS), *@cern.ch/$(FULL_HOSTNAME)
ALLOW_DAEMON = $(ALLOW_DAEMON) $(FRIENDLY_DAEMONS)
# ========== FULL DEBUGS =============
GRIDMANAGER_DEBUG = D_FULLDEBUG
# more stuff from the CERN VOBOXes
CONDOR_FSYNC = False
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 1000 # to be increased (see MaxJobs above)
GRIDMANAGER_JOB_PROBE_INTERVAL = 600
GRIDMANAGER_MAX_PENDING_REQUESTS = 500
GRIDMANAGER_GAHP_CALL_TIMEOUT = 3600
GRIDMANAGER_SELECTION_EXPR = (ClusterId % 2) # 2 should be enough already
GRIDMANAGER_GAHP_RESPONSE_TIMEOUT = 300
GRIDMANAGER_DEBUG =
ALLOW_DAEMON = $(ALLOW_DAEMON), $(FULL_HOSTNAME), $(IP_ADDRESS), unauthenticated@unmapped
COLLECTOR.ALLOW_ADVERTISE_MASTER = $(COLLECTOR.ALLOW_ADVERTISE_MASTER), $(ALLOW_DAEMON)
COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(COLLECTOR.ALLOW_ADVERTISE_SCHEDD), $(ALLOW_DAEMON)
DELEGATE_JOB_GSI_CREDENTIALS_LIFETIME = 0
GSI_SKIP_HOST_CHECK = true
5. Restart htCondor now and automatically at boot time:
service condor restart
chkconfig condor on
6. Check htCondor running:
pstree | grep condor
Initially the output has to look like this:
|-condor_master-+-condor_collecto
| |-condor_job_rout
| |-condor_procd
| |-condor_schedd
| `-condor_shared_p
LDAP and VOBOX configuration for htCondor-based AliEn site
In Environment section for the AliEn CE in the LDAP server:
# whether is is necessary to use job router service
USE_JOB_ROUTER=( 1 | 0)
# htCondor resource for explicitly defined for submission to vanilla universe, otherwise system default resource will be selected
GRID_RESOURCE=condor your-CE.your-domain your-CE.your-domain:9619
# routes list example
ROUTES_LIST=[ your-ce01.your-domain:9619 ] [ your-ce02.your-domain:9619 ]
# whether to use external cloud
USE_EXTERNAL_CLOUD=(1 | 0)
#specify extra options for condor_submit command . Example: add extra ClassAds to the job description: SUBMIT_ARGS=-append "+TestClassAd=1"
SUBMIT_ARGS=<String>
In ~/.alien/Environment on the VOBOX:
d=$HOME/htcondor
mkdir -p $d
export HTCONDOR_LOG_PATH=$d
Mind the firewall settings on the VOBOX:
https://alien.web.cern.ch/content/alice-vo-box-setup-and-configuration (section 1.1)
Miscellaneous scripts for htCondor-based AliEn site
Script to fill the routes list from LDAP:
#!/bin/bash
# print HTCondor job routes obtained from the ALICE LDAP server
#
# example settings in /etc/condor/config.d:
#
# JOB_ROUTER_ENTRIES_CMD = /var/lib/condor/get_job_routes.sh
# JOB_ROUTER_ENTRIES_REFRESH = 600
#
# version 1.3 (2017/04/04)
# author: Maarten Litmaath
usage()
{
echo "Usage: $0 [-n] [ FQHN ]" >&2
exit 1
}
LOG=/tmp/job-routes-$(date '+%y%m%d').log
LDAP_ADDR=alice-ldap.cern.ch:8389
h=$(hostname -f)
case $1 in
-n)
LOG=
shift
esac
case $1 in
-*)
usage
;;
?*.?*.?*)
h=$1
;;
?*)
usage
esac
f="(&(objectClass=AlienCE)(host=$h))"
#
# wrapped example output lines returned by the ldapsearch:
#
# environment: ROUTES_LIST=\
# [ "condor ce503.cern.ch ce503.cern.ch:9619" ] \
# [ "condor ce504.cern.ch ce504.cern.ch:9619"; optional extra stuff ] \
# [ "condor ce505.cern.ch ce505.cern.ch:9619" ] \
# [ "condor ce506.cern.ch ce506.cern.ch:9619" ]
#
# or a simpler format (the port currently is needed for the SAM VO feed):
#
# environment: ROUTES_LIST=\
# [ ce503.cern.ch:9619 ] \
# [ ce504.cern.ch:9619; optional extra stuff ] \
# [ ce505.cern.ch:9619 ] \
# [ ce506.cern.ch:9619 ]
#
# the next line may even be absent:
#
# environment: USE_EXTERNAL_CLOUD=0
#
if [ "x$LOG" = x ]
then
LOG=/dev/null
else
echo == $(date) >> $LOG
exec 2>> $LOG
fi
ldapsearch -LLL -x -h $LDAP_ADDR -b o=alice,dc=cern,dc=ch "$f" environment |
perl -p00e 's/\r?\n //g' | perl -ne '
if (s/^environment: ROUTES_LIST *= *//i) {
s/\[ *([^]" ]+)(:\d+) *([];])/[ "condor $1 $1$2" $3/g;
s/\[ *([^]" ]+) *([];])/[ "condor $1 $1:9619" $2/g;
s/\[ *[^"]*"/[ "/g;
s/\[ *("[^"]+")/[ GridResource = $1; eval_set_GridResource = $1/g;
$routes = $_;
next;
}
if (s/^environment: USE_EXTERNAL_CLOUD *= *//i) {
$extern = "; set_WantExternalCloud = True" if /1/;
next;
}
END {
$extern .= " ]";
$routes =~ s/;? *]/$extern/eg;
print $routes;
}
' | tee -a $LOG
Cleanup script for job logs and stdout/stderr files removal:
#!/bin/sh
cd ~/htcondor || exit
GZ_SIZE=10k
GZ_MINS=60
GZ_DAYS=2
RM_DAYS=7
STAMP=.stamp
prefix=cleanup-
log=$prefix`date +%y%m%d`
exec >> $log 2>&1 < /dev/null
echo === START `date`
for d in `ls -d 20??-??-??`
do
(
echo === $d
stamp=$d/$STAMP
[ -e $stamp ] || touch $stamp || exit
if find $stamp -mtime +$RM_DAYS | grep . > /dev/null
then
echo removing...
/bin/rm -r $d < /dev/null
exit
fi
cd $d || exit
find . ! -name .\* ! -name \*.gz \( -mtime +$GZ_DAYS -o \
-size +$GZ_SIZE -mmin +$GZ_MINS \) -exec gzip -9v {} \;
)
done
find $prefix* -mtime +$RM_DAYS -exec /bin/rm {} \;
echo === READY `date`
Crontab line for cleanup script:
37 * * * * /bin/sh $HOME/htcondor-cleanup.sh