User How-To - Collections

This document describes how to work with AliEn collections

 

 

What is a collection?

A collection is a group of files. Collections make working with big groups of files easier. For instance, with only one command you can replicate all the files of a collection in another SE. Collections can also be used as input for jobs. All of these topics will be described in this document

 

How to create a collection

There are three ways of creating a collection

 

  • Manual: the command createCollection makes an empty collection in the catalogue. Then, you can use addFileToCollection and removeFileFromCollection to put all the files that you want inside your collection. This example assumes that the files /alice/cern.ch/user/p/psaiz/tutorial/collections/file1 and /alice/cern.ch/user/p/psaiz/tutorial/collections/file2 are already registered in the catalogue

[aliendb06a.cern.ch:3307] /alice/cern.ch/user/p/psaiz/tutorial/collections/ > createCollection myFirstCollection
Jul 25 14:35:41 info File /alice/cern.ch/user/p/psaiz/tutorial/collections/myFirstCollection inserted in the catalog
[aliendb06a.cern.ch:3307] /alice/cern.ch/user/p/psaiz/tutorial/collections/ > addFileToCollection file1 myFirstCollection
Jul 25 14:41:06 info File '/alice/cern.ch/user/p/psaiz/tutorial/collections/file1' added to the collection!
[aliendb06a.cern.ch:3307] /alice/cern.ch/user/p/psaiz/tutorial/collections/ > addFileToCollection file2 myFirstCollection
Jul 25 14:41:31 info File '/alice/cern.ch/user/p/psaiz/tutorial/collections/file2' added to the collection!

 

  • Find: You can also use find -c <collection> to put all the files that the command finds in a new collection:

[aliendb06a.cern.ch:3307] /alice/cern.ch/user/p/psaiz/tutorial/collections/ > find -c find_collection . file
Jul 25 14:43:34 info File /alice/cern.ch/user/p/psaiz/tutorial/collections/find_collection inserted in the catalog
Jul 25 14:43:34 info File '/alice/cern.ch/user/p/psaiz/tutorial/collections/file1' added to the collection!
Jul 25 14:43:35 info File '/alice/cern.ch/user/p/psaiz/tutorial/collections/file2' added to the collection!
[aliendb06a.cern.ch:3307] /alice/cern.ch/user/p/psaiz/tutorial/collections/ >

 

  • XML file: If you already have an xml file with the collection, you can use createCollection -xml to create a collection with all the files specified in the xml

 

[aliendb06a.cern.ch:3307] /alice/cern.ch/user/p/psaiz/tutorial/collections/ > createCollection -xml \
/alice/sim/2006/pp_minbias/collections/collection.5169_001-5169_999.973.xml my_xml_collection
Jul 25 15:01:23 info File /alice/cern.ch/user/p/psaiz/tutorial/collections/my_xml_collection inserted in the catalog
Jul 25 15:01:23 info access: warning - we are using the backdoor ....
And the file is /tmp/ALICE/cache/4c428d0a-e823-11db-b0b4-0016768a4ba4.276301185367708
Jul 25 15:01:55 info File '/alice/sim/2006/pp_minbias/5169/570/AliESDs.root' added to the collection!
Jul 25 15:01:55 info File '/alice/sim/2006/pp_minbias/5169/570/Kinematics.root' added to the collection!
Jul 25 15:01:55 info File '/alice/sim/2006/pp_minbias/5169/570/galice.root' added to the collection!
Jul 25 15:01:55 info File '/alice/sim/2006/pp_minbias/5169/129/AliESDs.root' added to the collection!
Jul 25 15:01:55 info File '/alice/sim/2006/pp_minbias/5169/129/Kinematics.root' added to the collection!
Jul 25 15:01:55 info File '/alice/sim/2006/pp_minbias/5169/129/galice.root' added to the collection!
Jul 25 15:01:56 info File '/alice/sim/2006/pp_minbias/5169/034/AliESDs.root' added to the collection!
...

 

We could even go one step further, and create a 'collection of collections':

 

aliendb06a.cern.ch:3307] /alice/cern.ch/user/p/psaiz/tutorial/collections/ > createCollection GlobalCollection
Jul 25 17:58:07 info Ready to insert the info
Jul 25 17:58:09 info File /alice/cern.ch/user/p/psaiz/tutorial/collections/GlobalCollection inserted in the catalog
[aliendb06a.cern.ch:3307] /alice/cern.ch/user/p/psaiz/tutorial/collections/ > addFileToCollection myFirstCollection GlobalCollection
Jul 25 17:58:56 info File '/alice/cern.ch/user/p/psaiz/tutorial/collections/myFirstCollection' added to the collection!
[aliendb06a.cern.ch:3307] /alice/cern.ch/user/p/psaiz/tutorial/collections/ > addFileToCollection find_collection GlobalCollection
Jul 25 17:59:04 info File '/alice/cern.ch/user/p/psaiz/tutorial/collections/find_collection' added to the collection!

 

Advanced features for adding files to collections

There are some options that you can specify when adding a file to a collection. If you use man addFileToCollection, you will get the description of all of them

 

[aliendb06a.cern.ch:3307] /alice/cern.ch/user/p/psaiz/tutorial/collections/ > man addFileToCollection
AddFileToCollection: inserts a file into a collection of files
Usage:
addFileToCollection [-gn] [-name <name>] <file> <collection> [<extra>]
Options:
-g: use the file as guid instead of lfn
-n: do not update the collection after adding the file

Basically, there are three options:

  • Usually, when you add a file to a collection, you specify the lfn of the file. Then AliEn, will translate the lfn into the GUID, and put the GUID in the catalogue. Thus, even if the lfn changes, the collection will still point to the original file. If you use the option -g, then you can add the file specifying the GUID directly
  • If you use -name <name>, you can put the local filename when the collection is retrieved
  • You can also put any other kind of information. This can be useful, if you want to put things like the event lists

 

Moreover, note that the same file can be in several collections.

 

 

How to see the contents of a collection:

If you look at it in the catalogue, a collection looks like any other file. If you do ls -l, and look at the first character, you will see that it is a 'c' (short for collection). Moreover, the size of the collection is the sum of the size of all the files it contains.

You can use the command listFilesFromCollection to see all the files that are inside a collection:

 

[aliendb06a.cern.ch:3307] /alice/cern.ch/user/p/psaiz/tutorial/collections/ > ls -al
drwxr-xr-x psaiz psaiz 0 Jul 25 14:35 .
drwxr-xr-x psaiz psaiz 0 Jul 25 14:35 ..
-rwxr-xr-x psaiz psaiz 610514 Jul 25 14:39 file1
-rwxr-xr-x psaiz psaiz 2685 Jul 25 14:39 file2
crwxr-xr-x psaiz psaiz 613199 Jul 25 14:43 find_collection
crwxr-xr-x psaiz psaiz 1226398 Jul 25 17:59 GlobalCollection
crwxr-xr-x psaiz psaiz 613199 Jul 25 14:41 myFirstCollection
crwxr-xr-x psaiz psaiz 2147483647 Jul 25 15:01 my_xml_collection
[aliendb06a.cern.ch:3307] /alice/cern.ch/user/p/psaiz/tutorial/collections/ > listFilesFromCollection myFirstCollection
079FD442-3AAC-11DC-94C4-0016768A4BA4 (from the file /alice/cern.ch/user/p/psaiz/tutorial/collections/file1)
203A0B3A-3AAC-11DC-B78D-0016768A4BA4 (from the file /alice/cern.ch/user/p/psaiz/tutorial/collections/file2)

Finally, you can also check where a collection is. If you use the command whereis, it will tell you the se that contain all the files inside the collection

 

[aliendb06a.cern.ch:3307] /alice/cern.ch/user/p/psaiz/tutorial/collections/ > whereis myFirstCollection
Jul 25 15:07:35 info The file tutorial/collections/myFirstCollection is in
SE => ALICE::CERN::se pfn =>auto

 

Data management operations

Retrieving a collection

If you do get <collection>, you will get on your local machine all the files inside the collection. By default, they will be in a new directory in your temporary area, although you can specify any other directory:

 

[aliendb06a.cern.ch:3307] /alice/cern.ch/user/p/psaiz/tutorial/collections/ > get myFirstCollection /tmp/my_collection
Jul 25 15:19:21 info The file tutorial/collections/myFirstCollection is in
SE => ALICE::CERN::se pfn =>auto

Jul 25 15:19:21 info This is in fact a collection!! Let''s get all the files
Jul 25 15:19:21 info We have to rename it to /tmp/my_collection
Jul 25 15:19:21 info Getting the file 079FD442-3AAC-11DC-94C4-0016768A4BA4 from the collection
And the file is /tmp/my_collection/file1
Jul 25 15:19:21 info Getting the file 203A0B3A-3AAC-11DC-B78D-0016768A4BA4 from the collection
And the file is /tmp/my_collection/file2
Jul 25 15:19:21 info Got /tmp/my_collection/file1,/tmp/my_collection/file2
[aliendb06a.cern.ch:3307] /alice/cern.ch/user/p/psaiz/tutorial/collections/ >

 

We could even retrieve the 'collection of collections' that we created earlier on:

[aliendb06a.cern.ch:3307] /alice/cern.ch/user/p/psaiz/tutorial/collections/ > get GlobalCollection /tmp/my_global_collection
Jul 25 18:02:06 info The file tutorial/collections/GlobalCollection is in
SE => ALICE::CERN::se pfn =>auto

Jul 25 18:02:06 info access: warning - we are using the backdoor ....
Jul 25 18:02:06 info This is in fact a collection!! Let''s get all the files
Jul 25 18:02:06 info Getting the file 8E0182E8-3AAB-11DC-8327-000CF1D109D8 from the collection
Jul 25 18:02:06 info This is in fact a collection!! Let''s get all the files
Jul 25 18:02:06 info Getting the file 079FD442-3AAC-11DC-94C4-0016768A4BA4 from the collection
And the file is /tmp/my_global_collection/myFirstCollection/file1
Jul 25 18:02:06 info Getting the file 203A0B3A-3AAC-11DC-B78D-0016768A4BA4 from the collection
And the file is /tmp/my_global_collection/myFirstCollection/file2
Jul 25 18:02:06 info Got /tmp/my_global_collection/myFirstCollection/file1,/tmp/my_global_collection/myFirstCollection/file2
Jul 25 18:02:06 info Getting the file A860F46A-3AAC-11DC-8328-000CF1D109D8 from the collection
Jul 25 18:02:06 info This is in fact a collection!! Let''s get all the files
Jul 25 18:02:06 info Getting the file 079FD442-3AAC-11DC-94C4-0016768A4BA4 from the collection
And the file is /tmp/my_global_collection/find_collection/file1
Jul 25 18:02:06 info Getting the file 203A0B3A-3AAC-11DC-B78D-0016768A4BA4 from the collection
And the file is /tmp/my_global_collection/find_collection/file2
Jul 25 18:02:06 info Got /tmp/my_global_collection/find_collection/file1,/tmp/my_global_collection/find_collection/file2
[aliendb06a.cern.ch:3307] /alice/cern.ch/user/p/psaiz/tutorial/collections/ > exit
[pcegee02] /home/psaiz > ls -R /tmp/my_global_collection/
/tmp/my_global_collection/:
find_collection myFirstCollection

/tmp/my_global_collection/find_collection:
file1 file2

/tmp/my_global_collection/myFirstCollection:
file1 file2

 

Mirroring a collection

The same way that you can use the command mirror -t to transfer a file from one site to another, if you use mirror -t on a collection, all the files on that collection will be scheduled for a transfer. The transfer will be split in one transfer per file, but you can still follow all of them using the master transferid.

 

[aliendb06a.cern.ch:3307] /alice/cern.ch/user/p/psaiz/tutorial/collections/ > mirror -t find_collection alice::gsi::se
Jul 26 11:49:34 info The file tutorial/collections/find_collection is in
SE => ALICE::CERN::se pfn =>auto

Jul 26 11:49:34 info Mirroring file /alice/cern.ch/user/p/psaiz/tutorial/collections/find_collection (from ALICE::CERN::se)
Jul 26 11:49:34 info The transfer has been scheduled!! (1542715)
[aliendb06a.cern.ch:3307] /alice/cern.ch/user/p/psaiz/ > listTransfer -id 1542715 -master
Jul 26 11:52:48 info Checking the transfer -id 1542715 -master
TransferId Status User Destination Size Source
1542715 DONE psaiz alice::gsi::se 0
1542716 DONE alienmaster alice::gsi::se 610514
1542717 DONE alienmaster alice::gsi::se 2685

You can also decide to replicate only the files that are not already in that SE. For that, you have to put mirror -tu

 

Collocating a location

STILL TO BE IMPLEMENTED Usually it is a good practice if all the files of a collection can be found in one site. You can use the alien command updateCollection to verify that there is at least one site that contains all the files. If there isn't, updateCollection can be used to schedule the minimum number of transfers that are needed to make sure that the requirement is fulfilled

 

 

Submitting jobs with collections:

There are several ways in which you can work with collection, basically depending on the number of files that the collection has. We are assuming that you are already familiar with the JDL syntax to submit a job. If you are not, you can find that info here

 

Analyzing all the files in the same job

The simplest way is to put the collection as InputData in your job. Then, only sites that have that collection will be able to pick the job. This is recommended if the number of files in the collection is not too big. If the collection has a lot of files, you should look into the splitting section

 

Splitting a collection

You can put the collection as InputDataCollection, and specify any of the possible Split options (per file, directory, ...). Then, AliEn will take care of splitting the job in all the subjobs

 

Hierarchical splitting

STILL TO BE IMPLEMENTEDSince a collection can point to any lfn (including other collections!!), a lot of new opportunities arise. Imagine for instance that you have different runs, each of them with several files. If you have a collection per run, and another collection containing all the collections, it would be very easy if you could submit a job that analyzes this top collection, then got split into one job per run collection, and each of these jobs got split again...