Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Ralph Castain (rhc_at_[hidden])
Date: 2006-03-23 18:05:05


This is now fixed in the trunk with r9393. Should show up in the nightly build, I think. We had inadvertently set a maximum limit on the number of app_contexts we could support that came in through the app_file interface - this limitation has been removed.

Thanks for catching it!

Ralph


Ravi Manumachu wrote:
Hi Brian,

I have installed OpenMPI-1.1a1r9260 on my SunOS machines. It has solved
the problems. However there is one more issue that I found in my testing
and that I failed to report. This concerns Linux machines too.

My host file is

hosts.txt
---------
csultra06
csultra02
csultra05
csultra08

My app file is 

mpiinit_appfile
---------------
-np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit
-np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit
-np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit
-np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit
-np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit
-np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit
-np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit
-np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit

My application program is

mpiinit.c
---------

#include <mpi.h>

int main(int argc, char** argv)
{
    int rc, me;
    char pname[MPI_MAX_PROCESSOR_NAME];
    int plen;

    MPI_Init(
       &argc,
       &argv
    );

    rc = MPI_Comm_rank(
            MPI_COMM_WORLD,
            &me
    );

    if (rc != MPI_SUCCESS)
    {
       return rc;
    }

    MPI_Get_processor_name(
       pname,
       &plen
    );

    printf("%s:Hello world from %d\n", pname, me);

    MPI_Finalize();

    return 0;
}

Compilation is successful

csultra06$ mpicc -o mpiinit mpiinit.c

However mpirun prints just 6 statements instead of 8.

csultra06$ mpirun --hostfile hosts.txt --app mpiinit_appfile
csultra02:Hello world from 5
csultra06:Hello world from 0
csultra06:Hello world from 4
csultra02:Hello world from 1
csultra08:Hello world from 3
csultra05:Hello world from 2

The following two more statements are not printed.

csultra05:Hello world from 6
csultra08:Hello world from 7

This behavior I observed on my Linux cluster too.

I have attached the log for "-d" option for your debugging purposes.

Regards,
Ravi.

----- Original Message -----
From: Brian Barrett <brbarret@open-mpi.org>
Date: Monday, March 13, 2006 7:56 pm
Subject: Re: [OMPI users] problems with OpenMPI-1.0.1 on SunOS 5.9;
problems on heterogeneous cluster
To: Open MPI Users <users@open-mpi.org>

  
Hi Ravi -

With the help of another Open MPI user, I spent the weekend finding 
a  
couple of issues with Open MPI on Solaris.  I believe you are 
running  
into the same problems.  We're in the process of certifying the  
changes for release as part of 1.0.2, but it's Monday morning and 
the  
release manager hasn't gotten them into the release branch just 
yet.   
Could you give the nightly tarball from our development trunk a try 

and let us know if it solves your problems on Solaris?  You 
probably  
want last night's 1.1a1r9260 release.

    http://www.open-mpi.org/nightly/trunk/

Thanks,

Brian


On Mar 12, 2006, at 11:23 PM, Ravi Manumachu wrote:

    
 Hi Brian,

 Thank you for your help. I have attached all the files you have 
      
asked>  for in a tar file.
    
 Please find attached the 'config.log' and 'libmpi.la' for my 
      
Solaris>  installation.
    
 The output from 'mpicc -showme' is

 sunos$ mpicc -showme
 gcc -I/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/ 
include
 -I/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-
 5.9/include/openmpi/ompi-L/home/cs/manredd/OpenMPI/openmpi-
 1.0.1/OpenMPI-SunOS-5.9/lib -lmpi
 -lorte -lopal -lnsl -lsocket -lthread -laio -lm -lnsl -lsocket -
 lthread -ldl

 There are serious issues when running on just solaris machines.

 I am using the host file and app file shown below. Both the
 machines are
 SunOS and are similar.

 hosts.txt
 ---------
 csultra01 slots=1
 csultra02 slots=1

 mpiinit_appfile
 ---------------
 -np 1 /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos
 -np 1 /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos

 Running mpirun without -d option hangs.

 csultra01$ mpirun --hostfile hosts.txt --app mpiinit_appfile
 hangs

 Running mpirun with -d option dumps core with output in the file
 "mpirun_output_d_option.txt", which is attached. The core is also
 attached.
 Running just on only one host is also not working. The output from
 mpirun using "-d" option for this scenario is attached in file
 "mpirun_output_d_option_one_host.txt".

 I have also attached the list of packages installed on my solaris
 machine in "pkginfo.txt"

 I hope these will help you to resolve the issue.

 Regards,
 Ravi.

      
----- Original Message -----
From: Brian Barrett <brbarret@open-mpi.org>
Date: Friday, March 10, 2006 7:09 pm
Subject: Re: [OMPI users] problems with OpenMPI-1.0.1 on SunOS 5.9;
problems on heterogeneous cluster
To: Open MPI Users <users@open-mpi.org>

        
On Mar 10, 2006, at 12:09 AM, Ravi Manumachu wrote:

          
I am facing problems running OpenMPI-1.0.1 on a heterogeneous
            
cluster.>
          
I have a Linux machine and a SunOS machine in this cluster.

linux$ uname -a
Linux pg1cluster01 2.6.8-1.521smp #1 SMP Mon Aug 16 09:25:06
            
EDT
        
2004> i686 i686 i386 GNU/Linux
          
sunos$ uname -a
SunOS csultra01 5.9 Generic_112233-10 sun4u sparc SUNW,Ultra-5_10
            
Unfortunately, this will not work with Open MPI at present.  Open
MPI
1.0.x does not have any support for running across platforms with
          
different endianness.  Open MPI 1.1.x has much better support for
          
such situations, but is far from complete, as the MPI datatype
engine
does not properly fix up endian issues.  We're working on the
issue,
but can not give a timetable for completion.

Also note that (while not a problem here) Open MPI also does not
support running in a mixed 32 bit / 64 bit environment.  All
processes must be 32 or 64 bit, but not a mix.

          
$ mpirun --hostfile hosts.txt --app mpiinit_appfile
ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/
mpiinit_sunos:
fatal: relocation error: file
/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/
libmca_common_sm.so.0:
symbol nanosleep: referenced symbol not found
ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/
mpiinit_sunos:
fatal: relocation error: file
/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/
libmca_common_sm.so.0:
symbol nanosleep: referenced symbol not found

I have fixed this by compiling with "-lrt" option to the linker.
            
You shouldn't have to do this...  Could you send me the
          
config.log
        
file configure for Open MPI, the installed $prefix/lib/libmpi.la
file, and the output of mpicc -showme?

          
sunos$ mpicc -o mpiinit_sunos mpiinit.c -lrt

However when I run this again, I get the error:

$ mpirun --hostfile hosts.txt --app mpiinit_appfile
[pg1cluster01:19858] ERROR: A daemon on node csultra01 failed
            
to
        
start> as expected.
          
[pg1cluster01:19858] ERROR: There may be more information
            
available
          
from
[pg1cluster01:19858] ERROR: the remote shell (see above).
[pg1cluster01:19858] ERROR: The daemon exited unexpectedly with
            
status 255.
2 processes killed (possibly by Open MPI)
            
Both of these are quite unexpected.  It looks like there is
something
wrong with your Solaris build.  Can you run on *just* the Solaris
          
machine?  We only have limited resources for testing on Solaris,
but
have not run into this issue before.  What happens if you run
mpirun
on just the Solaris machine with the -d option to mpirun?

          
Sometimes I get the error.

$ mpirun --hostfile hosts.txt --app mpiinit_appfile
[csultra01:06256] mca_common_sm_mmap_init: ftruncate failed
            
with
        
errno=28
[csultra01:06256] mca_mpool_sm_init: unable to create shared
            
memory
          
mapping
---------------------------------------------------------------
            
-
    
--
        
----
          
----
It looks like MPI_INIT failed for some reason; your parallel
process is
likely to abort.  There are many reasons that a parallel
            
process can
        
fail during MPI_INIT; some of which are due to configuration or
            
environment
problems.  This failure appears to be an internal failure;
            
here's
        
some> additional information (which may only be relevant to an
          
Open
        
MPI> developer):
          
  PML add procs failed
  --> Returned value -2 instead of OMPI_SUCCESS
---------------------------------------------------------------
            
-
    
--
        
----
          
----
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
            
This looks like you got far enough along that you ran into our
endianness issues, so this is about the best case you can hope
          
for
        
in
your configuration.  The ftruncate error worries me, however.
          
But
        
I
think this is another symptom of something wrong with your Sun
Sparc
build.

Brian

-- 
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

          
<OpenMPI-1.0.1-SunOS-5.9.tar.gz>
        
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
      
-- 
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

    

_______________________________________________ users mailing list users@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users