EumedGrid Support Project recently introduced in a site some GPU node.

For sites that have such device and want to share these resources in the grid some steps are necessary (suggested).

We assume that the node is correcctly installed, the driver for the NVIDIA GPU is correcty installed and Cuda software is properly working ( we have installed the CUDA Toolkit package: http://developer.nvidia.com/cuda-downloads)

Install on the Operating system of the gpu node the middleware as a standard Worker Node.

This instructions are for MAUI and TORQUE package where the GPUs are not supported.

Create in pbs a dedicated queue

create queue gpu

set queue gpu queue_type = Execution

set queue gpu max_user_queuable = 10

set queue gpu max_running = 3

set queue gpu acl_hosts = wn-01-01-gpu.cluster.roma3

set queue gpu resources_max.cput = 72:00:00

set queue gpu resources_max.walltime = 90:00:00

set queue gpu resources_default.neednodes = gpu

set queue gpu max_user_run = 3

(We are assuming that the nodes has 3 gpus on board and a total of 24 cores)

To optimize the usage of all the 24 cores avalaible the server hosting the gpus (wn-01-01.gpu) has been configured in the Torque manager with the properties gpu and the properties lcgpro.

create node wn-01-01-gpu.cluster.roma3

set node wn-01-01-gpu.cluster.roma3 np = 24

set node wn-01-01-gpu.cluster.roma3 properties = gpu

set node wn-01-01-gpu.cluster.roma3 properties += lcgpro

In order to avoid that the 24 cores could be occupied by standard jobs we also configured the scheduler (MAUI) with a “standing reservation” so that 6 cores are always reserved for jobs submitted to the gpu queue.

SRCFG[gpu] HOSTLIST=wn-01-01-gpu

SRCFG[gpu] PERIOD=INFINITY

SRCFG[gpu] TASKCOUNT=3

SRCFG[gpu] RESOURCES=PROCS:6

SRCFG[gpu] CLASSLIST=gpu

After these configurations all “local users” of the cluster could be able to submit jobs the gpu subsystem using the simple command.

qsub –q gpu

It has been decided also to offers the gpus to grid users.

The CreamCE allows to forward, via BLAH component, requirements to the batch system.

The CERequirements expression received by CE is then forwarded (via the WMS) to BLAH that manages this expression setting some environment variables, which are available and can be properly used by the $GLITE_LOCATION/bin/pbs_local_submit_attributes.sh script. What is printed by this script is automatically added to the submit command file:

So we modified that file according to the following script:

if [ -n "$GlueHostApplicationSoftwareRuntimeEnvironment" ];then

CHECKGPUTAG=`echo $GlueHostApplicationSoftwareRuntimeEnvironment | grep -i "\-GPU" `

If [ -n "$CHECKGPUTAG" ];then

echo "#PBS -q gpu“

fi

fi

With this script each job containing as Requirement an expression ending with the string –GPU (case insensitive) is resubmitted to the queue gpu

Note that in the JDL file must be present a string like:

Requirements = Member("<anything>-GPU", other.GlueHostApplicationSoftwareRuntimeEnvironment);

To complete the configuration some tags has been published in the VO eumed

  • VO-eumed-GPU
  • VO-eumed-prod-gromacs-v040503-SL5-x86_64-gpu
  • VO-eumed-prod-namd-v020802-SL5-x86_64-gpu

In this way a member of the VO eumed can look for tags containing a gpu string and if he uses some of this tags in (as told above) his jobs is routed via the WMS to the right Computing Element (our CE) and once arrived in the CE the jobs (via blah ) is addressed to the gpu queue.

Note that any gpu apllication can be runned using the generic tag VO-eumed-GPU

so for example typical jdl could be:

Type = "Job";

JobType = "Normal";

Executable = "launchprog.sh";

StdOutput = "data.dat";

StdError = "data.err";

InputSandbox = {"launchprog.sh","prog"};

OutputSandbox = {"data.err"};

Requirements = Member("VO-eumed-GPU", other.GlueHostApplicationSoftwareRuntimeEnvironment);


-- FedericoBitelli - 2011-10-04

Topic revision: r4 - 2011-10-04 - 14:28:41 - AntonioBudano
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback