OK, I am now on the openmpi-1.9a1r27954 tarball.
In order to build OMPI and compile apps on this machine I must
1) edit the xe6 platform to --disable-shared/--enable-static (site-specific)
2) edit the xe6 platform file to provide a full path to the alps headers
because the logic in orte_check_alps.m4 for default values is wrong
3) edit the xe6 platform file to remove with_devel_headers=yes because
--with-devel-headers breaks "make install"
4) edit configure (!!!) to allow ras_alps_CPPFLAGS (and other vars) to get
set at configure time
5) edit orte/mca/ras/alps/ras_alps_component.c and/or
orte/mca/ras/alps/ras-alps-command.sh with the proper path to apstat
(perhaps only one needs to be edited?)
Item (1) is due to site differences, and is not an OMPI bug.
The other 4 have all been reported in one form or another on this list.
Now, the *next* bug is the following:
> $ ./INSTALL/bin/mpirun -mca ras_base_verbose 1 -mca orte_debug_verbose 1
> -np 2 ./ring_c 2>&1 | tee -a log
> [nid00704:21984] ras:alps:allocate: Trying ALPS configuration file:
> "/etc/sysconfig/alps"
> [nid00704:21984] ras:alps:allocate: parser_ini
> [nid00704:21984] ras:alps:allocate: Trying ALPS configuration file:
> "/etc/alps.conf"
> [nid00704:21984] ras:alps:allocate: Skipping ALPS configuration file:
> "/etc/alps.conf" (No such file or directory).
> [nid00704:21984] ras:alps:allocate: Could not locate ALPS scheduler file.
> [nid00704:21984] [[8668,0],0] ORTE_ERROR_LOG: Not found in file
> ../../../../orte/mca/ras/base/ras_base_allocate.c at line 178
My best guess is that this is related to something Ralph said in
http://www.open-mpi.org/community/lists/devel/2013/01/11989.php
> I'm currently tracking down a problem on the Cray XE6 - it appears that
> recent OS release changed the way alps stores allocation info :-(
Looking at the debug output prior to the error, and examining the system, I
made the following 1-line addition:
--- openmpi-1.9a1r27954/orte/mca/ras/alps/ras_alps_module.c~ 2013-01-28
23:54:31.443749000 -0800
+++ openmpi-1.9a1r27954/orte/mca/ras/alps/ras_alps_module.c 2013-01-28
23:54:34.770766635 -0800
@@ -74,6 +74,7 @@ static int parser_separated_columns(char
static const orte_ras_alps_sysconfig_t sysconfigs[] = {
{"/etc/sysconfig/alps", "ALPS_SHARED_DIR_PATH", parser_ini},
{"/etc/alps.conf" , "sharedDir" ,
parser_separated_columns},
+ {"/etc/opt/cray/alps/alps.conf", "sharedDir" ,
parser_separated_columns},
/* must be last element */
{NULL , NULL , NULL}
};
That appears to work for locating the allocation:
> $ ./INSTALL/bin/mpirun -mca ras_base_verbose 1 -mca orte_debug_verbose 1
> -np 2 ./ring_c 2>&1 | tee -a log
> [nid00320:22990] ras:alps:allocate: Trying ALPS configuration file:
> "/etc/sysconfig/alps"
> [nid00320:22990] ras:alps:allocate: parser_ini
> [nid00320:22990] ras:alps:allocate: Trying ALPS configuration file:
> "/etc/alps.conf"
> [nid00320:22990] ras:alps:allocate: Skipping ALPS configuration file:
> "/etc/alps.conf" (No such file or directory).
> [nid00320:22990] ras:alps:allocate: Trying ALPS configuration file:
> "/etc/opt/cray/alps/alps.conf"
> [nid00320:22990] ras:alps:allocate: parser_separated_columns
> [nid00320:22990] ras:alps:allocate: Located ALPS scheduler file:
> "/ufs/alps_shared/appinfo"
> [nid00320:22990] ras:alps:allocate: begin processing appinfo file
> [nid00320:22990] ras:alps:allocate: file /ufs/alps_shared/appinfo read
> [nid00320:22990] ras:alps:allocate: 3 entries in file
> [nid00320:22990] ras:alps:allocate: read data for resId 26 - myId 41
> [nid00320:22990] ras:alps:allocate: read data for resId 26 - myId 41
> [nid00320:22990] ras:alps:allocate: read data for resId 41 - myId 41
> [nid00320:22990] ras:alps:allocate: success
But wait, where is the application output?
Did anything even run?
I honestly don't know where to go from here.
Please let me know what I can do to help move forward on any of these
issues.
-Paul
--
Paul H. Hargrove PHHargrove_at_[hidden]
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
|