Sites installation, configuration and testing for parallel jobs
Within EuMedGRID, it is strongly advised that sites support parallel (MPI) jobs.
This page describes the steps needed to enable MPI support at a site. It also provides instructions for testing MPI support by submitting a test job.
We will assume in the following that you are about to install a CREAM CE according to InfnGrid. If this is not the case, and you are installing your CREAM CE pointing to EGEE repositories, please follow the relevant EGEE documentation, namely
this chapter and the MPI-specific link contained therein, and
this other page.
Configuring support for MPI jobs
A lot of material exists on the web, related to this, see for example the links at the bottom of the
Parallel Application Support page, in the EuMedGRID application site. Note however that quite a large fraction of the documentation may be obsolete, since it refers to gLite3.1.
To be specific, it looks to me that MPICH and LAM flavors are now obsolete and the only supported protocols for message passing are currently MPICH2 and OPENMPI.
Adding relevant packages
Assuming you are pointing to InfnGrid repositories, make sure you have the files
/etc/yum.repos.d/ig.repo and /etc/yum.repos.d/glite-generic.repo on your CE/WN (please refer to the EuMedGridInstallation page and links therein, for details as to where to find such files). Run the following command:
yum install ig_MPI_utils
The above command will install all required packages, like
mpich2,
mpi-start and the example files which you will need to customize before running
ig_yaim.
Configuring using yaim
Copy all example MPI-related yaim files to your
siteinfo directory (we'll assume it is
/root/siteinfo/).
cp /opt/glite/yaim/examples/siteinfo/services/*mpi* /root/siteinfo/services/
If you have a mixed environment, namely CREAM-CE configured according to InfnGrid and WNs configured according to EGEE, to avoid getting lost between
yaim and
ig_yaim configuration files, I advise you do the following:
cd /root/siteinfo/services/
ln -sf ig-mpi_wn glite-mpi_wn
ln -sf ig-mpi_ce glite-mpi_ce
ln -sf ig-mpi glite-mpi
Now, you are ready to setup your environment:
- make sure you have set
ENABLE_MPI = yes in your ig-site-info.def file
- make sure you remove any MPI-related string from CE_RUNTIMEENV : all needed strings will be added automatically after running
yaim
- make sure you have set
CE_BATCH_SYS = pbs rather than torque in your ig-site-info.def file
- edit file
services/ig-mpi and set MPI_MPICH_MPIEXEC so as to overcome a bug in the yaim scripts:
# MPI_MPICH_MPIEXEC is needed just to fix the buggy yaim scripts, set it to $MPI_MPICH2_MPIEXEC
MPI_MPICH_MPIEXEC="/usr/bin/mpiexec"
- edit file
services/ig-mpi and set MPI_<flavor>_ENABLE, MPI_<flavor>_PATH, MPI_<flavor>_VERSION and MPI_<flavor>_MPIEXEC for each MPI flavor you intend to support. For example:
MPI_MPICH2_ENABLE="yes"
MPI_OPENMPI_ENABLE="yes"
MPI_MPICH2_PATH="/usr"
MPI_MPICH2_VERSION="1.1.1p1"
MPI_OPENMPI_PATH="/usr/lib64/openmpi/1.3.2-gcc"
MPI_OPENMPI_VERSION="1.3.2p2"
MPI_MPICH2_MPIEXEC="/usr/bin/mpiexec"
MPI_OPENMPI_MPIEXEC="/usr/lib64/openmpi/1.3.2-gcc/bin/mpiexec"
If you want to use different path for mpich2 see also *
slides about mpi-start problem
Needless to say, verify that
yum installed exactly the above versions of the programs, and that the above paths are correct.
More complete information about how to set MPI-related variables can be found at
the official YAIM page or
the MPI-specific YAIM page. Note that your site should advertise special interconnection between nodes, if available, in the form
MPI-<Interconnect> where
<Interconnect> might be Ethernet (which is the default), Infiniband, Myrinet, SCI: this is done by appending a line to
CE_RUNTIMEENV like:
MPI-Ethernet
which will eventually be translated by
yaim to the publishing of
GlueHostApplicationSoftwareRunTimeEnvironment: MPI-Ethernet
At this point you are ready to run yaim. To update an already configured CREAM-CE, run something like:
/opt/glite/yaim/bin/ig_yaim -r -s ig-site-info.def -n ig_CREAM_torque -f config_mpi_ce -f config_cream_gip
(note that both functions need to be executed within the same
ig_yaim call) whereas if you are installing a CREAM-CE from scratch you should run something like:
/opt/glite/yaim/bin/ig_yaim -c -s ig-site-info.def -n ig_CREAM_torque
as you would normally do. For EGEE-like installations, remember to add
-n MPI_CE as the first
-n option: the order of the
-n switches
is important!.
For a WN, the easiest is probably to re-run the full configuration
/opt/glite/yaim/bin/ig_yaim -c -s ig-site-info.def -n ig_WN_torque_noafs
or, if you are configuring "a la EGEE":
/opt/glite/yaim/bin/yaim -c -s site-info.def -n MPI_WN -n glite-WN -n TORQUE_client
(the order of the
-n switches
is important!)
After configuring using yaim
Make sure the
maui scheduler is running. From my experience, it seems that MPI jobs will be correctly queued to Torque, but they will sit forever in the queue, in pending state unless:
Note that for some reason which goes beyond my understanding, Maui needs to be started even if you configured Torque as your scheduler, namely even if you have
set server scheduling = True in
qmgr.
Updating MPI packages to ensure correct allocation of processors
The Torque/Maui packages coming by default with gLite3.2, as of 2011-07-29, do not properly handle the allocation of processors to parallel jobs. This (and something more) is described in the following presentation:
After some investigation, and with the help of our Syrian colleagues who patiently tested a number of attempts at their site, we have come up with a set of updated RPMs for Torque and Maui. You can download them at the link below: they are available as compressed archive (tar.gz), make sure you install only the smallest set of RPMs on your CE/WNs (for example, no need for maui nor torque_server on WNs):
After having installed above packages, remember to:
- on WN, remove
/var/spool/pbs/mom_priv/config otherwise yaim will leave the default (wrong) settings in it
- re-run
yaim
Colleagues at Roma3 have also found it necessary to replace file
/opt/i2g/etc/mpi-start/mpich2.mpi with following one:
Actually, I managed to run MPI jobs by making a couple small changes to above
mpich2.mpi:
[root@wn03 eumed019]# diff /opt/i2g/etc/mpi-start/mpich2.mpi.bitelli /opt/i2g/etc/mpi-start/mpich2.mpi
9c9
< ##export COMM="--comm=pmi"
---
> export COMM="--comm=pmi"
45c45
< export I2G_MACHINEFILE_AND_NP="-machinefile $MPI_START_MACHINEFILE -np $I2G_MPI_NP"
---
> #FG# export I2G_MACHINEFILE_AND_NP="-machinefile $MPI_START_MACHINEFILE -np $I2G_MPI_NP"
and adding following line to my
.jdl file
Environment = { "I2G_MPI_FILE_DIST=ssh" };
Any volunteer to update this wiki with the ultimate (working) recipe?
Moreover, given that Torque is somewhat confused with CPUs/cores/slots and all this, you will need to "force" Torque to believe it has a number of CPUs at least equal to the maximum size of parallel jobs you want to run. For example, if you have 2 machines with 8 cores each, you will need to tell Torque it can count on 16 nodes (2 times 8).
qmgr -c 'set server resources_available.nodect=16'
This way the job will be accepted by Torque, and Maui will later optimize by translating the request into one for 16 job slots rather than 16 real nodes.
Testing support for MPI jobs
Based on some example found in a number of web pages (see for example
this page or
this other one) I came up with a set of files which may be used to test MPI support: note that to ease debugging, I recently added the feature that each node prints out its own name.
Please remember to modify the line for
OutputSandboxURI in the
.jdl file before submitting the job.
--
FulvioGaleazzi - 2010-11-26