NICA-Scheduler

1st experiment of the NICA project

If you have time-consuming tasks, many simple tasks or a lot of files to process, you can use batch systems on distributed clusters, such as SLURM on the HybriLIT platform including supercomputer Govorun, Sun Grid Engine (SGE) on the NICA ncx-cluster or Torque on the JINR CICC complex to essentially accelerate your work.
If you know how to work with SLURM (SLURM on HybriLIT), Sun Grid Engine (SGE user guide) and Torque systems (doc index), you can use sbatch or qsub commands on the clusters to parallel data processing. Simple examples of user jobs for SLURM, SGE and Torque can be found in BmnRoot here. Otherwise, NICA-Scheduler has been developed to simplify running user tasks in parallel.

NICA-Scheduler is a module of the BmnRoot software. It uses an existing batch system (SLURM, SGE and Torque are supported) to distribute user jobs on the cluster and simplifies parallel job executing without knowledge of the batch system. Jobs for distributed execution are described and passed to NICA-Scheduler as XML files, e.g.:
$ nica-scheduler my_job.xml

NICA-Scheduler Job Example:

<job name="reco_job">
   <macro path="$VMCWORKDIR/macro/run/run_reco_bmn.C" start_event=”0” event_count=”1000”>
    <file input="$VMCWORKDIR/macro/run/bmnsim1.root" output="$VMCWORKDIR/macro/run/bmndst1.root"/>
    <file input="$VMCWORKDIR/macro/run/bmnsim2.root" output="$VMCWORKDIR/macro/run/bmndst2.root"/>
    <file sim_input="energy=3,gen=dcmqgsm" output="$VMCWORKDIR/macro/run/bmndst_${counter}.root"/>
   </macro>
   <run mode=“global" count=“25" config=“~/bmnroot/build/config.sh"/>
</job>

An XML description of a job to run in a batch system starts with <job> tag and ends with closing </job> tag. The attribute 'name' defines the name of the job to identify it. The attribute 'batch' defines batch system of the cluster if needed. The possible values: ‘sge’ for Sun Grid Engine (LHEP ncx-farm), ‘slurm’ for SLURM (HybriLIT, Govorun) and ‘torque’ for Torque batch system (LIT CICC). Attention! The default value is ‘sge’ (Sun Grid Engine batch system) to run on the ncx-farm.

Tag <macro> sets information about a macro being executed by BmnRoot or MpdRoot :

  • path – path of the ROOT macro for distributed execution.
    Important! Since a job can be started by the scheduler on any machine of the cluster, the path must point to a shared space (e.g. /eos/nica/bmn/users/my_user/ volumes on the NICA cluster). This argument is required.
  • start_event – number of the start event to process for all input files. This argument is optional.
  • event_count – count of events to process for all input files. This argument is optional.
  • add_args – additional last arguments of the ROOT macro, if required.
  • name – conventional name of the macro if you want to use dependencies. This argument is optional and usually not used.

Tag <file> is included inside the <macro> tag and contains information on input and output files to process by the macro:

  • input – path to one input file or a set of files in case of regular expressions (?,*,+) to be processed.
  • file_input – path to the text file containing a list of input files separated by new lines.
  • job_input – name of the one of the previous jobs that resulting files are input files for the current macro.
  • macro_input – name of the one of the previous macros in the same job that resulting files are input files for the current macro.
  • sim_input – string to specify a list of input simulation files forming from the BM@N Unified database.
  • exp_input – string to specify a list of input experimental (raw) files forming from the BM@N Unified database.
  • output – path to result files.
    Important! Since a user job can be started by the NICA-Scheduler on any machine of the cluster, path must point to a shared space, e.g. EOS volume being available on all cluster nodes.
  • start_event – number of the start event specific for this file. This argument is optional.
  • event_count – count of events to process specific for this file. This argument is optional.
    Important! If start_event (or event_count) is set in <macro> and <file> tags then the value in the <file> tag is selected as preferable.
  • parallel_mode – processor count to parallel event processing for the given input file. This argument is optional.
  • merge – whether merge partial result files in the parallel_mode. Default value is true, possible values of the attribute: "false", "true", "nodel", "chain".

An example of sim_input string is "energy=3,gen=urqmd", where the following parameters: collision energy of selected events is equal to 3 GeV and event generator is UrQMD. All possible parameters are described in database console utilities. As a list of input files to be processed can include a lot of files in the one <file> tag, special variables can be used in the output argument to set a list of output files:

${counter} = counter is corresponding to the regular input file to be processed, the counter starts with 1 and increases by 1 for the next input file.
${input} = absolute path of the input file.
${file_name} = input file name without extension.
${file_name_with_ext} = input file name with extension.
${file_dir_name} = directory name of the input file.
${batch_temp_dir} = path to a batch temp directory.
${user} = username who runs the job.
${first_number} = first number in the input file name.
${last_number} = last number in the input file name.

Also, you can use an additional possibility to exclude first ('~' symbol after colon) or last ('-' symbol after colon) characters for the above special variables, e.g. ${file_name_with_ext:~N} is used as an input file name with extension but without first N chars, or ${file_name:-N} - input file name without last N chars.

In case of mass data processing or production, input files are often pre-copied to an intermediate fast storage before processing on a cluster. This possibility is implemented via <put> and <get> tags. The <put> tag is used to preliminary copy input files to a temporary storage, and the <get> tag – copy the result files back from the storage. In addition, the <clean> tag can be used to list auxiliary files, which will be deleted after the processing. The <put>, <get> and <clean> tags are included inside the <file> tag and contain the following attributes:

  • command – copy command (possible with parameters) used to transfer files to the intermediate storage (<put> tag) and back (<get> tag), e.g. "xrdcp -f -p -N".
  • path – destination file path of the copy command. The <clean> tag uses only this path attribute for files to be deleted after processing.

Tag <run> describes run parameters and allocated resources for the job. It can contain the following arguments:

  • mode – execution mode. Value ‘global’ sets distributed processing on a cluster with a batch system, ‘local’ – multi-threaded execution on a user multi-core computer. The default value is ‘local’.
  • count – maximum count of processors allocated for the job. If the count is greater than the number of files to be processed, the count of the allocated processors is assigned to the number of files. The default value is 1. You can assign the processor count to 0, then all cores of the user machine will be employed in case of the local mode or all processors of the batch system will be requested in case of the global mode.
  • config – path to the bash file with environment variables (including ROOT environment variables) being executed before running the macros. This argument is optional.
  • queue – queue name of the batch system to be used for running the job if it is not as default, e.g. "mpd@bfsrv.jinr-t1.ru" for LIT CICC cluster queue.
  • qos – quality of service (QOS) for each job if required.
  • hosts – selected host names separated by comma to process the job, e.g. "ncx201.jinr.ru,ncx202.jinr.ru" - for running the job only on ncx201.jinr.ru and ncx202.jinr.ru nodes of the NICA cluster. The parameter is not usually used.
  • priority – priority of the job (an integer in the range -1023 to 1024 with default value: 0). It is optional argument.
  • memory1 – a number of operative memory allocated per one processor slot, e.g. "4 GB", but not less than 1 MB. It is optional argument.
  • logs – log file path for multi-threaded (local) mode (optional).
  • core_dump - whether dump core to 'core.*' files in case of task failures. Default value is false, possible values: "false", "true".

If you want to run non-ROOT (arbitrary) command by the NICA-Scheduler, use <command> tag instead of the <macro> one with argument line – command line for distributed execution by a scheduling system. An example of the job with <command> tag:

<job>
   <command line="show_simulation_files energy=5-9"/>
   <run mode="global" config="~/bmnroot/build/config.sh"/>
</job>

Another example of the job for local multi-threaded execution:

<job>
   <macro path=“$VMCWORKDIR/macro/run/run_reco_bmn.C">
    <file input=“$VMCWORKDIR/macro/run/bmnsim1.root" output="$VMCWORKDIR/macro/run/bmndst1.root“ start_event=”0” event_count=”0”/>
    <file input="$VMCWORKDIR/macro/run/bmnsim2.root" output="$VMCWORKDIR/macro/run/bmndst2.root“ start_event=”0” event_count=”1000” parallel_mode=“5” merge=“true”/>
   </macro>
   <run mode="local" count=“6" config=“~/bmnroot/build/config.sh" logs="processing.log"/>
</job>

An XML job description for the NICA-Scheduler can contain more than one job, and a list of user jobs for distributed execution. In this case <job> tags are included in a general <jobs> tag. Some dependencies can be set between different jobs, so that job depending on another one will not start its execution until the latter ends. To set dependency, use 'dependency' attribute assigned to the name of another job in the <job> tag. In addition, a NICA-Scheduler job can contain more than one macro to execute, just use several <macro> tags in the job.

The examples directory of the NICA-Scheduler contains various XML-examples. If you have any questions on the NICA-Scheduler, please, email: gertsen@jinr.ru.

Leave a Reply