Titan development

ORNL Titan architecture

On ORNL Titan users can access only the interactive nodes (IN) and LustreFS via a SSH. There is no access to Service Nodes which do the preparations for PBS jobs. Worker nodes (WN) run Cray Linux microkernel and have neither inbound neither outbound network connectivity. WN arre interconnected by InfiniBand network.

The communication between INs and WNs is only possible through a shared filesystem - LustreFS.

To access ORNL Titan one needs an RSA token obtained from ORNL.

Titan/ALICE services interaction

As there is no network connectivity on WNs there is a JobService that identifies that batch is ready to accept new jobs, takes jobs from AliEn Central Queue, transmits their heartbeats to monitoring services, upload job results after job has completed and marks the job as DONE/ERROR_* .

Computing time on ORNL Titan is received through PBS interface.

A simple example for PBS script on Titan:

#!/bin/bash
# Begin PBS directives
#PBS -A CSC108
#PBS -N sleep_test
#PBS -j oe
#PBS -l walltime=00:20:00,nodes=1
#PBS -l gres=atlas1
# End PBS directives and begin shell commands

cd $MEMBERWORK/csc108

module load cray-mpich/7.2.5
module load python/3.4.3
module load python_mpi4py/1.3.1

aprun -n 1 ./get_rank_and_exec_job.py

aprun command actually starts "-n" copies of a program on Titan WNs through MPI interface. See complete documentation for aprun on ORNL website: https://www.olcf.ornl.gov/kb_articles/using-the-aprun-command/

The actual implementation for ALICE workflow batch implementation is in Titan.pm module, the module submits it through a regular qsub interface (no stdin inputs are allowed on Titan, thus the module utilizes "qsub <filename>" command).

For more information on Titan Scheduling policy please visit: https://www.olcf.ornl.gov/support/system-user-guides/titan-user-guide/#358

ALICE PBS batch life cycle on Titan

After ALICE batch is submitted it spends some time in the common PBS queue.

After it is started it creates its working folder (inside of a working folder for a JobService) and SQLite database (jobagent.db) inside ofit that contains batch info (jobagent_info) and waits for a response from JobService that operates with this working folder. The service responds with its own part of the database structure: a table (alien_jobs) which describes jobs assigned to processes in this batch (every process is identified by its own MPI rank within this batch) and another SQLite database for storing job heartbeats (jobagent.db.monitoring).

See descriptions for the databases below.

When there are MPI processes in the batch that do not run any ALICE jobs the service fetches jobs from Central Queue, downloads all of the files and assigns job information to an available process.

The general idea describing folder hierarchy and folder contents is the following:

If single JobService is using too mush resources on an interactive node - this approach can be scaled for many interactive nodes. VOBox CE module can use multiple working folders each operated by its own JobService on LustreFS. The working folder for a new PBS batch can be selected, for example, using a simple round robin policy.

Working folder is selected by Vobox CE service before batch submission (for example, using a simple round-robin)

SQLite databases structure

The latest database structure is the following.

jobagent.db: contains general information about batch and running jobs.

>> .schema

CREATE TABLE jobagent_info (ttl INT NOT NULL, cores INT NOT NULL, started INT, max_wait_retries INT);

CREATE TABLE alien_jobs (rank INTEGER NOT NULL,
queue_id VARCHAR(20),
user VARCHAR(20),
masterjob_id VARCHAR(20),
job_folder VARCHAR(256) NOT NULL,
status CHAR(1),
executable VARCHAR(256),
validation VARCHAR(256),
environment TEXT,
exec_code INTEGER DEFAULT -1, val_code INTEGER DEFAULT -1);

jobagent.db.monitoring: contains a table for job heartbeats information.

>> .schema

CREATE TABLE alien_jobs_monitoring (queue_id VARCHAR(20), resources VARCHAR(100));

Vobox start

AliEn is installed locally in /ccs/home/psvirin/alien/bin .

To start VoBox do the following steps:

export PATH=$PATH:/lustre/atlas/proj-shared/csc108/psvirin/alice.cern.ch/alice.cern.ch/bin/

cd /ccs/home/psvirin/alien/bin

./alien proxy-init -valid 120:00

./alien StartMonitor

./alien StartMonaLisa

./alien StartCE

There must be a Titan.pm module for Vobox CE service in /autofs/nccs-svm1_home1/psvirin/alien.v2-19.346/lib/perl5/site_perl/5.10.1/AliEn/LQ/Titan.pm

CVMFS on Titan

CVMFS on Titan is located at /lustre/atlas/proj-shared/csc108/psvirin/alice.cern.ch/alice.cern.ch

It is not a real CVMFS but its snapshot which is updated by publisher script running on dtn04.ccs.ornl.gov hourly.

ALICE job wrappers

Located in /lustre/atlas/proj-shared/csc108/psvirin/ALICE_TITAN_SCRIPTS/ .

ls /lustre/atlas/proj-shared/csc108/psvirin/ALICE_TITAN_SCRIPTS/

alien_multijob_run.sh get_rank_and_exec_job.py

get_rank_and_exec_job.py is used by Titan.pm module. This script is spawned by MPI, it takes its rank, then executes alien_multijob_run.sh which actually communicates with job service through SQLite databases, executes the payload and pushes job heartbeats into SQLite table. In other words, it is a simple networkless reimplementation of AliEn JobAgent's corresponding parts.

TitanJobService compilation

TitanJobService repository is currently located in the separate git branch:

git clone https://gitlab.cern.ch/jalien/jalien.git

git checkout Titan

Running compilation:

export PATH=/lustre/atlas/proj-shared/csc108/psvirin/jre1.8.0_74/bin/:$PATH

cd /lustre/atlas/proj-shared/csc108/psvirin/jalien-latest-ja/jalien

./compile.sh

Running TitanJobService

Environment variables to be set:

export installationMethod=CVMFS

export site=ORNL

export CE=ALICE::ORNL::Titan

export WORKDIR=/lustre/atlas/scratch/psvirin/csc108/workdir2 # or a path to a folder where PBS batches will create their folders

export TTL=36000 #(optional/unused currently)

export cerequirements='other.user=="psvirin"' # (optional)

Path variables:

Job Service uses alternate JDK8 (it has some security extensions installed)

export PATH=/lustre/atlas/proj-shared/csc108/psvirin/jre1.8.0_74/bin/:$PATH

# adding path to alienv

export PATH=$PATH:/lustre/atlas/proj-shared/csc108/psvirin/alice.cern.ch/alice.cern.ch/bin/

Running job service:

You should have your Grid credentials located at ~/.globus/

Currently the Titan Job Service is located at: /lustre/atlas/proj-shared/csc108/psvirin/jalien-latest-ja/jalien

cd /lustre/atlas/proj-shared/csc108/psvirin/jalien-latest-ja/jalien

./jalien TitanJobService

(enter your Grid certificate password as jAliEn not supporting non-password certificates or X509 proxy certificates for the current time)

Service can be started on any of Titan interactive nodes (e.g., titan-ext3.ccs.ornl.gov, dtn04.ccs.ornl.gov, etc.)

To be done

X509 proxies and passwordless certificates into jAliEn client and server sides
new variables which define number of cores and time slot length to be introduced for Titan.pm
some jobs (1-3%) go to ERROR_ASSIGNED state while fetching big number of JDLs (2000), needs to be investigated
alimonitor understands RUNNING state for the jobs but does see DONE/ERROR_* states, has to be fixed
X509 proxy has to be introduced into jAliEn
fetching multiple JDLs within one request can be simulated with a HTTP/JSON request to a Tomcat which can translate it into jAliEn entities (just a suggestion so far)

CERN Accelerating science