Sites installation and setup tips: configuration tuning
Possible service-specific issues during configuration
Note on updating services
Before updating services, please make sure:
- you have declared a downtime for your site
- you have carefully read the official documentation, for example the EGEE gLite updates page: to be specific, gLite3.2 updates 15, 16 and 17 need some little trickery in order to solve package dependency problems.
Generic issues
Note that LDAP is extremely picky as to syntax errors. If you have mistakes/typos in your
site-info.def, they will most probably result in parts of the LDAP tree not being published. For example, I saw a case where quite a lot of information about a CE was not published due to a typo: I had
CE_SF00 = 150O (note that last character is 'letter o' rather than 'number zero') whereas the correct string would have read
CE_SF00 = 1500. If you suspect such kind of problem, do the following:
- stop bdii
- edit file
/opt/bdii/etc/bdii.conf and set BDII_LOG_LEVEL = DEBUG
- start bdii and check the logfile
/var/log/bdii/bdii-update.log: it should not print anything abnormal, at least until the first occurrence of the Sleeping for 120 seconds line
- when done, remember to change back
BDII_LOG_LEVEL = ERROR
BDII installation
As stated above, starting from August 2010 it is no longer possible for the site BDII and the CE to share the same hostname. You can of course create a virtual machine on your CE, and configure the virtual machine as the site BDII. Note that, in this case, you will need to define a new IP name/number in your DNS for your site BDII.
Note that if you decide to configure the site BDII using
ig_yaim you need to remove file
/services/glite-bdii_site, otherwise your modifications, which I assume are contained in
/services/ig-bdii_site will not be taken into account.
DPM SE installation
At time of writing (19-07-2010) there may be problems with configuring the DPM: in my test I have observed that sometimes the LDAP description is not complete, resulting in the SE not appearing in
lcg-infosites. You can check whether your SE is affected by the problem by looking at the value for
GlueSEName: if it looks like the following
[root@gridsrv3-4 siteinfo]# ldapsearch -x -H ldap://<your_SE_name>:2170 -b mds-vo-name=resource,o=grid | grep GlueSEName
GlueSEName: <your_site_name> DPM server
you will have to follow the procedure described below.
I have opened a
Savannah ticket for this. I suspect this problem is brought in by some new package like
perl-dpm which supercedes the packages
DPM-interfaces and
DPM-interfaces2.
UPDATE as of 2010-07-21: the problem is actually due to
lcg-infosites, see
bug 57787. A fix is ready and will hopefully go into production soon. However, the following recipe will result in LDAP publishing
additional information, so I believe you'd better apply it.
UPDATE as of 2010-07-23: partial correction to the above... The fix will make
lcg-infosites happy but cause Gstat to mildly complain. You thus have two options:
- ask users to add
VO: when running lcg-infosites of lcg-info (i.e., like lcg-infosites --vo VO:eumed se)
- apply the following fix, and un-apply it as soon as
lcg-infosites gets officially patched
UPDATE as of 2010-08-25: Fixed by gLite3.2 update 16 see
the EGEE gLite updates page for further details. Applying the fix described below should not be needed any longer.
After some investigation I found the following workaround will make both LDAP and lcg-infosites happy:
- open file
/opt/glite/yaim/functions/config_gip_dpm, find the line where dpm-listspaces is called and add the --legacy switch: in the end, the line should read more or less like
/opt/lcg/bin/dpm-listspaces --legacy --gip --protocols --basedir <some_directory> --site <your_site_name>
- run yaim (or ig_yaim, according to what you are using): you can run just the single function
config_gip_dpm like the following
ig_yaim -r -s ig-site-info.def -n ig_SE_dpm_mysql -f config_gip_dpm
- check with
ldapsearch that the value for GlueSEName is set similar to:
[root@gridsrv3-4 siteinfo]# ldapsearch -x -H ldap://gridsrv3-4.dir.garr.it:2170 -b mds-vo-name=resource,o=grid | grep GlueSEName
GlueSEName: GARR-01-DIR:srm
(notice the :srm whereas it was DPM server earlier)
- if your SE still is not shown by
lcg-infosites, try stopping service bdii, wait 5 minutes and then start it again (do NOT run service bdii restart, but rather stop, wait some time and then start)
CREAM-CE installation
If you followed the "generic" instructions, you should complete yaim configuration without any error. Make sure that:
- tomcat is running
- make sure you have no broken links under both
/usr/share/tomcat5/common/lib/ and /usr/share/tomcat5/server/lib/, apart from 3 links for jakarta-commons-logging (will be fixed in a future release for gLite)
- since quite long time, there is no longer need to separately setup the blparser after running yaim. Indeed, while yaim was running you may have noticed a message saying
BLPARSER_WITH_UPDATER_NOTIFIER = true. Blparser configuration is not required
- follow instructions to verify CREAM-CE functionality
maxOutputSandboxSize error: this error showed up at all sites where the CE was configured to use French language: the parameter for
MaxOutputSandboxSize is passed by the WMS as 1.00E08 but is received by CREAM as 1,00E08 (note the ',' rather than the '.'). While I personally discourage this kind of "localization" for machines which act as servers, there's a way to survive without reinstalling the machine. This kind of problems should be handled automatically by the tomcat libraries, but for some reason it is not. CREAM developers will make their code a bit more robust.
Pending this, the solution is:
- edit file
$CATALINA_HOME/conf/tomcat5.conf (should be /usr/share/tomcat5/conf/tomcat5.conf), and un-comment the LANG line. You should have something like
# You can change your tomcat locale here
#LANG=en_US and you need to change it to:
# You can change your tomcat locale here
LANG=en_US
- restart tomcat
/etc/rc.d/init.d/tomcat5 restart
This change will affect all tomcat-based applications on your host: since, most probably, CREAM is the only such application there should not be any side-effect.
Unable to publish tag on the CE: I have
opened a bug about this problem, which seems to be a duplicate of
this other bug. The fix should go in production "soon". The problem is that yaim creates a wrong directory tree, under
/opt/glite rather than
/opt/edg. The fix is
rmdir /opt/edg/var/info/
ln -s /opt/glite/var/info /opt/edg/var/info
More generally speaking,
this page reports updated CREAM-CE specific issues and fixes/workarounds so you may find it a very interesting reading.
Post-configuration setup and tuning
After configure you will also need (or like) to perform the following steps.
Software areas: ownership and privileges
In order to properly setup the software areas, you need to change permissions as suggested during
yaim execution: the YAIM logfile is accumulated in
/opt/glite/yaim/log/yaimlog. For the
eumed software area, you'd need to set
chown -R sgmeumed001.sgmeumed $VO_EUMED_SW_DIR
chmod -R g+w $VO_EUMED_SW_DIR
and act similarly for the other VOs your site supports.
SSH: negotiate simpler algorithm
Setup ssh so that it negotiates a simpler algorithm for encryption: consider putting these instructions in your
system-wide ssh configuration file
/etc/ssh/ssh_config
Torque/Maui setup
- make sure the standard scheduler is not working. CE installs Maui as your batch scheduler, so you need to switch off the built-in Torque one. Run
qmgr and issue the command: set server scheduling = false
- make sure the default queue points to an existing queue, if not change it appropriately. Run
qmgr and issue the command: set server default queue = local
where I assumed you also created a local queue as per next point.
- create other queues as needed, for your local users. For instance, you can run
qmgr and issue the commands:
create queue local
s q local queue_type = Execution
s q local resources_max.cput = 48:00:00
s q local resources_max.walltime = 72:00:00
s q local enabled=true
s q local started=true
- instruct PBS/Torque to keep memory of recently completed jobs for a reasonable amount of time, to make CREAM-CE daemons happy. Do this by issuing the command:
qmgr -c "set server keep_completed = 120"
VOMS certificates installation
Make sure package
ig-vomscerts-all has been installed (it was not, in my test)
Possible LDAP problem with CREAM-CE
- when installing a CREAM-CE not acting as site-BDII, I found it impossible to connect to its LDAP server. Symptom:
netstat -anp | grep 2170 would show slapd running on 127.0.0.1. The simple fix is:
- make sure variable
BDII_HOST is defined in /opt/bdii/etc/bdii.conf. In my case it was not present, and I set it to BDII_HOST = *
- restart bdii:
/etc/rc.d/init.d/bdii restart
Batch jobs using local disk as working area
You may have a huge local disk on your servers: it would be a pity not to use it, right? One solution could be to make a huge
/tmp area and instructs your users about its existence, but a much more transparent way to accomplish the same result is to instruct the batch system such that GRID batch jobs will use it as their initial working directory. Note that the recipe below is not 100% tested, your suggestions and corrections are more than welcome.
There is, at present, no way to configure this behaviour from
site-info.def. You may need to repeat the CE configuration part after each re-configuration of the CE.
On the WNs, do the following once and for all:
On the CE:
Note that
blah_wn_temporary_home_dir can contain any variable which exists in the generic user's environment, hence the following (may) work:
blah_wn_temporary_home_dir='/data/$USER'
TopBDII : synchronizing configuration file
Until recently, it was possible to make topBDII point to an external configuration file, by setting
BDII_HTTP_URL variable in
site-info.def. At some point this behaviour has been broken, and currently topBDII only cares, by default, of offical EGEE/EGI sources.
To restore the previous behaviour, perform the following steps:
- edit file
/etc/glite/glite-info-update-endpoints.conf such that it reads more or less like the following: of course, change the value for manual_file as appropriate:
[configuration]
EGI = FALSE
OSG = FALSE
manual = True
manual_file = /opt/glite/etc/eumed-bdii.conf
output_file = /opt/glite/etc/gip/top-urls.conf
cache_dir = /var/cache/glite/glite-info-update-endpoints
- get the getConfigFile.sh script and place it under
/usr/local/bin
- create file
/etc/cron.d/getTopBdiiConf with content similar to:
35 */3 * * * root /usr/local/bin/getConfigFile.sh http://www.eumedgrid.eu/conf/eumed-bdii.conf /opt/glite/etc/eumed-bdii.conf 2000
the first argument of the script is the URL where to download the configuration file from (the old BDII_HTTP_URL), second argument is file destination (this needs to be set to same value as manual_file above), third argument is minimum allowed file length for downloaded file (to avoid overwriting a working configuration with a file containing: HTTP error: 404, File not found)
--
FulvioGaleazzi - 2010-11-17