Sites installation and setup tips: configuration tuning

Possible service-specific issues during configuration

Note on updating services

Before updating services, please make sure:

  • you have declared a downtime for your site
  • you have carefully read the official documentation, for example the EGEE gLite updates page: to be specific, gLite3.2 updates 15, 16 and 17 need some little trickery in order to solve package dependency problems.

Generic issues

Note that LDAP is extremely picky as to syntax errors. If you have mistakes/typos in your site-info.def, they will most probably result in parts of the LDAP tree not being published. For example, I saw a case where quite a lot of information about a CE was not published due to a typo: I had CE_SF00 = 150O (note that last character is 'letter o' rather than 'number zero') whereas the correct string would have read CE_SF00 = 1500. If you suspect such kind of problem, do the following:

  • stop bdii
  • edit file /opt/bdii/etc/bdii.conf and set BDII_LOG_LEVEL = DEBUG
  • start bdii and check the logfile /var/log/bdii/bdii-update.log: it should not print anything abnormal, at least until the first occurrence of the Sleeping for 120 seconds line
  • when done, remember to change back BDII_LOG_LEVEL = ERROR

BDII installation

As stated above, starting from August 2010 it is no longer possible for the site BDII and the CE to share the same hostname. You can of course create a virtual machine on your CE, and configure the virtual machine as the site BDII. Note that, in this case, you will need to define a new IP name/number in your DNS for your site BDII.

Note that if you decide to configure the site BDII using ig_yaim you need to remove file /services/glite-bdii_site, otherwise your modifications, which I assume are contained in /services/ig-bdii_site will not be taken into account.

DPM SE installation

At time of writing (19-07-2010) there may be problems with configuring the DPM: in my test I have observed that sometimes the LDAP description is not complete, resulting in the SE not appearing in lcg-infosites. You can check whether your SE is affected by the problem by looking at the value for GlueSEName: if it looks like the following

[root@gridsrv3-4 siteinfo]# ldapsearch -x -H ldap://<your_SE_name>:2170 -b mds-vo-name=resource,o=grid | grep GlueSEName
GlueSEName: <your_site_name> DPM server

you will have to follow the procedure described below.

I have opened a Savannah ticket for this. I suspect this problem is brought in by some new package like perl-dpm which supercedes the packages DPM-interfaces and DPM-interfaces2.

UPDATE as of 2010-07-21: the problem is actually due to lcg-infosites, see bug 57787. A fix is ready and will hopefully go into production soon. However, the following recipe will result in LDAP publishing additional information, so I believe you'd better apply it.

UPDATE as of 2010-07-23: partial correction to the above... The fix will make lcg-infosites happy but cause Gstat to mildly complain. You thus have two options:

  • ask users to add VO: when running lcg-infosites of lcg-info (i.e., like lcg-infosites --vo VO:eumed se)
  • apply the following fix, and un-apply it as soon as lcg-infosites gets officially patched

UPDATE as of 2010-08-25: Fixed by gLite3.2 update 16 see the EGEE gLite updates page for further details. Applying the fix described below should not be needed any longer.

After some investigation I found the following workaround will make both LDAP and lcg-infosites happy:

  • open file /opt/glite/yaim/functions/config_gip_dpm, find the line where dpm-listspaces is called and add the --legacy switch: in the end, the line should read more or less like
       /opt/lcg/bin/dpm-listspaces --legacy --gip --protocols --basedir <some_directory> --site <your_site_name>
  • run yaim (or ig_yaim, according to what you are using): you can run just the single function config_gip_dpm like the following
       ig_yaim -r -s ig-site-info.def -n ig_SE_dpm_mysql -f config_gip_dpm
  • check with ldapsearch that the value for GlueSEName is set similar to:
       [root@gridsrv3-4 siteinfo]# ldapsearch -x -H ldap:// -b mds-vo-name=resource,o=grid  | grep GlueSEName
       GlueSEName: GARR-01-DIR:srm
    (notice the :srm whereas it was DPM server earlier)
  • if your SE still is not shown by lcg-infosites, try stopping service bdii, wait 5 minutes and then start it again (do NOT run service bdii restart, but rather stop, wait some time and then start)

CREAM-CE installation

If you followed the "generic" instructions, you should complete yaim configuration without any error. Make sure that:

  • tomcat is running
  • make sure you have no broken links under both /usr/share/tomcat5/common/lib/ and /usr/share/tomcat5/server/lib/, apart from 3 links for jakarta-commons-logging (will be fixed in a future release for gLite)
  • since quite long time, there is no longer need to separately setup the blparser after running yaim. Indeed, while yaim was running you may have noticed a message saying BLPARSER_WITH_UPDATER_NOTIFIER = true. Blparser configuration is not required
  • follow instructions to verify CREAM-CE functionality

maxOutputSandboxSize error: this error showed up at all sites where the CE was configured to use French language: the parameter for MaxOutputSandboxSize is passed by the WMS as 1.00E08 but is received by CREAM as 1,00E08 (note the ',' rather than the '.'). While I personally discourage this kind of "localization" for machines which act as servers, there's a way to survive without reinstalling the machine. This kind of problems should be handled automatically by the tomcat libraries, but for some reason it is not. CREAM developers will make their code a bit more robust. Pending this, the solution is:

  • edit file $CATALINA_HOME/conf/tomcat5.conf (should be /usr/share/tomcat5/conf/tomcat5.conf), and un-comment the LANG line. You should have something like
          # You can change your tomcat locale here
    and you need to change it to:
          # You can change your tomcat locale here
  • restart tomcat
          /etc/rc.d/init.d/tomcat5 restart

This change will affect all tomcat-based applications on your host: since, most probably, CREAM is the only such application there should not be any side-effect.

Unable to publish tag on the CE: I have opened a bug about this problem, which seems to be a duplicate of this other bug. The fix should go in production "soon". The problem is that yaim creates a wrong directory tree, under /opt/glite rather than /opt/edg. The fix is

rmdir /opt/edg/var/info/
ln -s /opt/glite/var/info /opt/edg/var/info

More generally speaking, this page reports updated CREAM-CE specific issues and fixes/workarounds so you may find it a very interesting reading.

Post-configuration setup and tuning

After configure you will also need (or like) to perform the following steps.

Software areas: ownership and privileges

In order to properly setup the software areas, you need to change permissions as suggested during yaim execution: the YAIM logfile is accumulated in /opt/glite/yaim/log/yaimlog. For the eumed software area, you'd need to set
   chown -R sgmeumed001.sgmeumed $VO_EUMED_SW_DIR
  chmod -R g+w $VO_EUMED_SW_DIR
and act similarly for the other VOs your site supports.

SSH: negotiate simpler algorithm

Setup ssh so that it negotiates a simpler algorithm for encryption: consider putting these instructions in your system-wide ssh configuration file /etc/ssh/ssh_config

Torque/Maui setup

  • make sure the standard scheduler is not working. CE installs Maui as your batch scheduler, so you need to switch off the built-in Torque one. Run qmgr and issue the command:
    set server scheduling = false 
  • make sure the default queue points to an existing queue, if not change it appropriately. Run qmgr and issue the command:
    set server default queue = local 
    where I assumed you also created a local queue as per next point.
  • create other queues as needed, for your local users. For instance, you can run qmgr and issue the commands:
       create queue local
       s q local queue_type = Execution
       s q local resources_max.cput = 48:00:00
       s q local resources_max.walltime = 72:00:00
       s q local enabled=true
       s q local started=true 
  • instruct PBS/Torque to keep memory of recently completed jobs for a reasonable amount of time, to make CREAM-CE daemons happy. Do this by issuing the command:
     qmgr -c "set server keep_completed = 120"

VOMS certificates installation

Make sure package ig-vomscerts-all has been installed (it was not, in my test)

Possible LDAP problem with CREAM-CE

  • when installing a CREAM-CE not acting as site-BDII, I found it impossible to connect to its LDAP server. Symptom: netstat -anp | grep 2170 would show slapd running on The simple fix is:
    • make sure variable BDII_HOST is defined in /opt/bdii/etc/bdii.conf. In my case it was not present, and I set it to BDII_HOST = *
    • restart bdii: /etc/rc.d/init.d/bdii restart

Batch jobs using local disk as working area

You may have a huge local disk on your servers: it would be a pity not to use it, right? One solution could be to make a huge /tmp area and instructs your users about its existence, but a much more transparent way to accomplish the same result is to instruct the batch system such that GRID batch jobs will use it as their initial working directory. Note that the recipe below is not 100% tested, your suggestions and corrections are more than welcome.

There is, at present, no way to configure this behaviour from site-info.def. You may need to repeat the CE configuration part after each re-configuration of the CE.

On the WNs, do the following once and for all:

  • let's assume that the "directory with a lot of space" will be called /data on any WN
  • make sure anybody can create subdirs
        chmod 777 /data ; chmod +t /data
    and check with ls -ld /data

On the CE:

  • edit file /opt/glite/etc/blah.config, and add/change the parameter
  • that's all: for each GRID job (not for jobs submitted locally using qsub!) a unique subdirectory under such path will be created at job start and removed at job completion.
Note that blah_wn_temporary_home_dir can contain any variable which exists in the generic user's environment, hence the following (may) work:

TopBDII : synchronizing configuration file

Until recently, it was possible to make topBDII point to an external configuration file, by setting BDII_HTTP_URL variable in site-info.def. At some point this behaviour has been broken, and currently topBDII only cares, by default, of offical EGEE/EGI sources.

To restore the previous behaviour, perform the following steps:

  • edit file /etc/glite/glite-info-update-endpoints.conf such that it reads more or less like the following: of course, change the value for manual_file as appropriate:
       EGI  = FALSE
       OSG = FALSE
       manual = True
       manual_file = /opt/glite/etc/eumed-bdii.conf
       output_file = /opt/glite/etc/gip/top-urls.conf 
       cache_dir = /var/cache/glite/glite-info-update-endpoints
  • get the script and place it under /usr/local/bin
  • create file /etc/cron.d/getTopBdiiConf with content similar to:
       35 */3 * * * root /usr/local/bin/ /opt/glite/etc/eumed-bdii.conf 2000 
    the first argument of the script is the URL where to download the configuration file from (the old BDII_HTTP_URL), second argument is file destination (this needs to be set to same value as manual_file above), third argument is minimum allowed file length for downloaded file (to avoid overwriting a working configuration with a file containing: HTTP error: 404, File not found)

-- FulvioGaleazzi - 2010-11-17

Topic revision: r7 - 2011-12-20 - 15:59:37 - FulvioGaleazzi
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback