Sites installation, configuration and testing for parallel jobs

Within EuMedGRID, it is strongly advised that sites support parallel (MPI) jobs.

This page describes the steps needed to enable MPI support at a site. It also provides instructions for testing MPI support by submitting a test job.

We will assume in the following that you are about to install a CREAM CE according to InfnGrid. If this is not the case, and you are installing your CREAM CE pointing to EGEE repositories, please follow the relevant EGEE documentation, namely this chapter and the MPI-specific link contained therein, and this other page.

Configuring support for MPI jobs

A lot of material exists on the web, related to this, see for example the links at the bottom of the Parallel Application Support page, in the EuMedGRID application site. Note however that quite a large fraction of the documentation may be obsolete, since it refers to gLite3.1.

To be specific, it looks to me that MPICH and LAM flavors are now obsolete and the only supported protocols for message passing are currently MPICH2 and OPENMPI.

Adding relevant packages

Assuming you are pointing to InfnGrid repositories, make sure you have the files /etc/yum.repos.d/ig.repo and /etc/yum.repos.d/glite-generic.repo on your CE/WN (please refer to the EuMedGridInstallation page and links therein, for details as to where to find such files). Run the following command:

yum install ig_MPI_utils

The above command will install all required packages, like mpich2, mpi-start and the example files which you will need to customize before running ig_yaim.

Configuring using yaim

Copy all example MPI-related yaim files to your siteinfo directory (we'll assume it is /root/siteinfo/).

cp /opt/glite/yaim/examples/siteinfo/services/*mpi* /root/siteinfo/services/

If you have a mixed environment, namely CREAM-CE configured according to InfnGrid and WNs configured according to EGEE, to avoid getting lost between yaim and ig_yaim configuration files, I advise you do the following:

cd /root/siteinfo/services/
ln -sf ig-mpi_wn glite-mpi_wn
ln -sf ig-mpi_ce glite-mpi_ce
ln -sf ig-mpi glite-mpi

Now, you are ready to setup your environment:

  • make sure you have set ENABLE_MPI = yes in your ig-site-info.def file
  • make sure you remove any MPI-related string from CE_RUNTIMEENV : all needed strings will be added automatically after running yaim
  • make sure you have set CE_BATCH_SYS = pbs rather than torque in your ig-site-info.def file
  • edit file services/ig-mpi and set MPI_MPICH_MPIEXEC so as to overcome a bug in the yaim scripts:
         # MPI_MPICH_MPIEXEC is needed just to fix the buggy yaim scripts, set it to $MPI_MPICH2_MPIEXEC
         MPI_MPICH_MPIEXEC="/usr/bin/mpiexec"
         
  • edit file services/ig-mpi and set MPI_<flavor>_ENABLE, MPI_<flavor>_PATH, MPI_<flavor>_VERSION and MPI_<flavor>_MPIEXEC for each MPI flavor you intend to support. For example:
         MPI_MPICH2_ENABLE="yes"
         MPI_OPENMPI_ENABLE="yes"
         MPI_MPICH2_PATH="/usr"
         MPI_MPICH2_VERSION="1.1.1p1"
         MPI_OPENMPI_PATH="/usr/lib64/openmpi/1.3.2-gcc"
         MPI_OPENMPI_VERSION="1.3.2p2"
         MPI_MPICH2_MPIEXEC="/usr/bin/mpiexec"
         MPI_OPENMPI_MPIEXEC="/usr/lib64/openmpi/1.3.2-gcc/bin/mpiexec"
         

If you want to use different path for mpich2 see also * slides about mpi-start problem

Needless to say, verify that yum installed exactly the above versions of the programs, and that the above paths are correct.

  • set variable MPI_SHARED_HOME if you have a shared area which is accessible to all worker-nodes(Suggested to increase performances). Now, this is rather tricky, as you should set the variable to:
    • no (which is the default) if such a shared area does not exist
    • yes if the home directories are shared (namely, if a generic user eumed123 on wnA has the same home directory as eumed123 on wnB
    • anything other than yes or no will be interpreted as the shared area physical path, for example MPI_SHARED_HOME = /share/cluster/mpi. In this case, remember to configure the area so that is world-writable (namely, configured like /tmp, chmod 1777 /share/cluster/mpi)
  • set variable MPI_SSH_HOST_BASED_AUTH to yes (default, no) if you allow ssh host-based authentication between worker nodes: this is useful, so that mpiexec will be able to start the job's executable on other nodes:
    • add the following two line to the top of /etc/ssh/sshd_config, on the CE, UI and WNs:
              HostbasedAuthentication yes
              IgnoreUserKnownHosts yes
    • create file /etc/ssh/shosts.equiv: it should contain the full names of the CE, UI, WNs, one on each line, like the following
              ce01.my.full.domain
              ui01.my.full.domain
              wn01.my.full.domain
              wn02.my.full.domain
              ...
    • restart the SSH daemon:
              service sshd restart
    • add the following two lines at the top of /etc/ssh/ssh_config on the CE, UI and WNs:
              StrictHostKeyChecking no
              UserKnownHostsFile /dev/null
    • Due to a bug in the script /opt/i2g/etc/mpi-start/mpich2.mpi scp between worker nodes is not working so you should copy the corrected version: to all worker nodes ( It is not necessary in case you use MPI_SHARED_HOME=yes as suggested)
  • unlike stated in a previous version of this document, there should be no more need for an instruction like MPI_SUBMIT_FILTER = yes. I have recently re-configured my CE without need for MPI_SUBMIT_FILTER. Let me know whether your experience is different: my setup involves the latest (as of 2011-02-10) CE version, configured "a la EGEE", in particular I have the following packages installed:
       glite-CREAM-3.2.9-1.sl5.x86_64
       lcg-info-dynamic-pbs-1.0.13-1.noarch
       lcg-info-dynamic-scheduler-pbs-2.0.1-1.noarch
       lcg-pbs-utils-1.0.0-1.noarch
       torque-2.3.6-2cri.el5.x86_64
       torque-client-2.3.6-2cri.el5.x86_64
       torque-server-2.3.6-2cri.el5.x86_64
    

More complete information about how to set MPI-related variables can be found at the official YAIM page or the MPI-specific YAIM page. Note that your site should advertise special interconnection between nodes, if available, in the form MPI-<Interconnect> where <Interconnect> might be Ethernet (which is the default), Infiniband, Myrinet, SCI: this is done by appending a line to CE_RUNTIMEENV like:

MPI-Ethernet

which will eventually be translated by yaim to the publishing of

GlueHostApplicationSoftwareRunTimeEnvironment: MPI-Ethernet

At this point you are ready to run yaim. To update an already configured CREAM-CE, run something like:

/opt/glite/yaim/bin/ig_yaim -r -s ig-site-info.def -n ig_CREAM_torque -f config_mpi_ce -f config_cream_gip

(note that both functions need to be executed within the same ig_yaim call) whereas if you are installing a CREAM-CE from scratch you should run something like:

/opt/glite/yaim/bin/ig_yaim -c -s ig-site-info.def -n ig_CREAM_torque

as you would normally do. For EGEE-like installations, remember to add -n MPI_CE as the first -n option: the order of the -n switches is important!.

For a WN, the easiest is probably to re-run the full configuration

/opt/glite/yaim/bin/ig_yaim -c -s ig-site-info.def -n ig_WN_torque_noafs

or, if you are configuring "a la EGEE":

/opt/glite/yaim/bin/yaim -c -s site-info.def -n MPI_WN -n glite-WN  -n TORQUE_client

(the order of the -n switches is important!)

After configuring using yaim

Make sure the maui scheduler is running. From my experience, it seems that MPI jobs will be correctly queued to Torque, but they will sit forever in the queue, in pending state unless:

  • the Maui scheduler is running,
       service maui restart
       chkconfig maui on
       
  • the following line is present in the Maui configuration file /var/spool/maui/maui.cfg
       ENABLEMULTIREQJOBS TRUE
       

Note that for some reason which goes beyond my understanding, Maui needs to be started even if you configured Torque as your scheduler, namely even if you have set server scheduling = True in qmgr.

Updating MPI packages to ensure correct allocation of processors

The Torque/Maui packages coming by default with gLite3.2, as of 2011-07-29, do not properly handle the allocation of processors to parallel jobs. This (and something more) is described in the following presentation:

After some investigation, and with the help of our Syrian colleagues who patiently tested a number of attempts at their site, we have come up with a set of updated RPMs for Torque and Maui. You can download them at the link below: they are available as compressed archive (tar.gz), make sure you install only the smallest set of RPMs on your CE/WNs (for example, no need for maui nor torque_server on WNs):

After having installed above packages, remember to:
  • on WN, remove /var/spool/pbs/mom_priv/config otherwise yaim will leave the default (wrong) settings in it
  • re-run yaim

Colleagues at Roma3 have also found it necessary to replace file /opt/i2g/etc/mpi-start/mpich2.mpi with following one:

Actually, I managed to run MPI jobs by making a couple small changes to above mpich2.mpi:

   [root@wn03 eumed019]# diff /opt/i2g/etc/mpi-start/mpich2.mpi.bitelli /opt/i2g/etc/mpi-start/mpich2.mpi
   9c9
   < ##export COMM="--comm=pmi"
   ---
   > export COMM="--comm=pmi"
   45c45
   <             export I2G_MACHINEFILE_AND_NP="-machinefile $MPI_START_MACHINEFILE -np $I2G_MPI_NP"
   ---
   > #FG#            export I2G_MACHINEFILE_AND_NP="-machinefile $MPI_START_MACHINEFILE -np $I2G_MPI_NP"
and adding following line to my .jdl file
   Environment = { "I2G_MPI_FILE_DIST=ssh" };

Any volunteer to update this wiki with the ultimate (working) recipe?

Moreover, given that Torque is somewhat confused with CPUs/cores/slots and all this, you will need to "force" Torque to believe it has a number of CPUs at least equal to the maximum size of parallel jobs you want to run. For example, if you have 2 machines with 8 cores each, you will need to tell Torque it can count on 16 nodes (2 times 8).

  qmgr -c 'set server resources_available.nodect=16'
This way the job will be accepted by Torque, and Maui will later optimize by translating the request into one for 16 job slots rather than 16 real nodes.

Testing support for MPI jobs

Based on some example found in a number of web pages (see for example this page or this other one) I came up with a set of files which may be used to test MPI support: note that to ease debugging, I recently added the feature that each node prints out its own name.

Please remember to modify the line for OutputSandboxURI in the .jdl file before submitting the job.

-- FulvioGaleazzi - 2010-11-26

Topic attachments
I Attachment Action Size Date Who Comment
elsepptx MPI_SITE_IN_ROMA3.pptx manage 123.5 K 2011-05-10 - 10:48 FedericoBitelli  
elsegz maui-3.3.1-1.tar.gz manage 2931.1 K 2011-07-29 - 09:01 FulvioGaleazzi Updated MAUI packages to address problem of CPU assignment to parallel jobs
elsempi mpich2.mpi manage 4.0 K 2011-06-01 - 10:47 FedericoBitelli  
elsegz torque-2.5.7-1.tar.gz manage 7095.6 K 2011-07-29 - 09:02 FulvioGaleazzi Updated Torque packages to address problem of CPU assignment to parallel jobs
Topic revision: r17 - 2011-09-16 - 15:53:58 - FulvioGaleazzi
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback