Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI devel] Open MPI on Cray XC30 status
From: Paul Hargrove (phhargrove_at_[hidden])
Date: 2013-01-29 03:04:36


OK, I am now on the openmpi-1.9a1r27954 tarball.
In order to build OMPI and compile apps on this machine I must

1) edit the xe6 platform to --disable-shared/--enable-static (site-specific)

2) edit the xe6 platform file to provide a full path to the alps headers
because the logic in orte_check_alps.m4 for default values is wrong

3) edit the xe6 platform file to remove with_devel_headers=yes because
--with-devel-headers breaks "make install"

4) edit configure (!!!) to allow ras_alps_CPPFLAGS (and other vars) to get
set at configure time

5) edit orte/mca/ras/alps/ras_alps_component.c and/or
orte/mca/ras/alps/ras-alps-command.sh with the proper path to apstat
(perhaps only one needs to be edited?)

Item (1) is due to site differences, and is not an OMPI bug.
The other 4 have all been reported in one form or another on this list.

Now, the *next* bug is the following:

> $ ./INSTALL/bin/mpirun -mca ras_base_verbose 1 -mca orte_debug_verbose 1
> -np 2 ./ring_c 2>&1 | tee -a log
> [nid00704:21984] ras:alps:allocate: Trying ALPS configuration file:
> "/etc/sysconfig/alps"
> [nid00704:21984] ras:alps:allocate: parser_ini
> [nid00704:21984] ras:alps:allocate: Trying ALPS configuration file:
> "/etc/alps.conf"
> [nid00704:21984] ras:alps:allocate: Skipping ALPS configuration file:
> "/etc/alps.conf" (No such file or directory).
> [nid00704:21984] ras:alps:allocate: Could not locate ALPS scheduler file.
> [nid00704:21984] [[8668,0],0] ORTE_ERROR_LOG: Not found in file
> ../../../../orte/mca/ras/base/ras_base_allocate.c at line 178

My best guess is that this is related to something Ralph said in
http://www.open-mpi.org/community/lists/devel/2013/01/11989.php

> I'm currently tracking down a problem on the Cray XE6 - it appears that
> recent OS release changed the way alps stores allocation info :-(

Looking at the debug output prior to the error, and examining the system, I
made the following 1-line addition:
--- openmpi-1.9a1r27954/orte/mca/ras/alps/ras_alps_module.c~ 2013-01-28
23:54:31.443749000 -0800
+++ openmpi-1.9a1r27954/orte/mca/ras/alps/ras_alps_module.c 2013-01-28
23:54:34.770766635 -0800
@@ -74,6 +74,7 @@ static int parser_separated_columns(char
 static const orte_ras_alps_sysconfig_t sysconfigs[] = {
     {"/etc/sysconfig/alps", "ALPS_SHARED_DIR_PATH", parser_ini},
     {"/etc/alps.conf" , "sharedDir" ,
parser_separated_columns},
+ {"/etc/opt/cray/alps/alps.conf", "sharedDir" ,
parser_separated_columns},
     /* must be last element */
     {NULL , NULL , NULL}
 };

That appears to work for locating the allocation:

> $ ./INSTALL/bin/mpirun -mca ras_base_verbose 1 -mca orte_debug_verbose 1
> -np 2 ./ring_c 2>&1 | tee -a log
> [nid00320:22990] ras:alps:allocate: Trying ALPS configuration file:
> "/etc/sysconfig/alps"
> [nid00320:22990] ras:alps:allocate: parser_ini
> [nid00320:22990] ras:alps:allocate: Trying ALPS configuration file:
> "/etc/alps.conf"
> [nid00320:22990] ras:alps:allocate: Skipping ALPS configuration file:
> "/etc/alps.conf" (No such file or directory).
> [nid00320:22990] ras:alps:allocate: Trying ALPS configuration file:
> "/etc/opt/cray/alps/alps.conf"
> [nid00320:22990] ras:alps:allocate: parser_separated_columns
> [nid00320:22990] ras:alps:allocate: Located ALPS scheduler file:
> "/ufs/alps_shared/appinfo"
> [nid00320:22990] ras:alps:allocate: begin processing appinfo file
> [nid00320:22990] ras:alps:allocate: file /ufs/alps_shared/appinfo read
> [nid00320:22990] ras:alps:allocate: 3 entries in file
> [nid00320:22990] ras:alps:allocate: read data for resId 26 - myId 41
> [nid00320:22990] ras:alps:allocate: read data for resId 26 - myId 41
> [nid00320:22990] ras:alps:allocate: read data for resId 41 - myId 41
> [nid00320:22990] ras:alps:allocate: success

But wait, where is the application output?
Did anything even run?
I honestly don't know where to go from here.

Please let me know what I can do to help move forward on any of these
issues.

-Paul

-- 
Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group
Computer and Data Sciences Department     Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900