Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2006-03-10 08:28:18


On Mar 10, 2006, at 12:09 AM, Ravi Manumachu wrote:

> I am facing problems running OpenMPI-1.0.1 on a heterogeneous cluster.
>
> I have a Linux machine and a SunOS machine in this cluster.
>
> linux$ uname -a
> Linux pg1cluster01 2.6.8-1.521smp #1 SMP Mon Aug 16 09:25:06 EDT 2004
> i686 i686 i386 GNU/Linux
>
> sunos$ uname -a
> SunOS csultra01 5.9 Generic_112233-10 sun4u sparc SUNW,Ultra-5_10

Unfortunately, this will not work with Open MPI at present. Open MPI
1.0.x does not have any support for running across platforms with
different endianness. Open MPI 1.1.x has much better support for
such situations, but is far from complete, as the MPI datatype engine
does not properly fix up endian issues. We're working on the issue,
but can not give a timetable for completion.

Also note that (while not a problem here) Open MPI also does not
support running in a mixed 32 bit / 64 bit environment. All
processes must be 32 or 64 bit, but not a mix.

> $ mpirun --hostfile hosts.txt --app mpiinit_appfile
> ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/
> mpiinit_sunos:
> fatal: relocation error: file
> /home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/
> libmca_common_sm.so.0:
> symbol nanosleep: referenced symbol not found
> ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/
> mpiinit_sunos:
> fatal: relocation error: file
> /home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/
> libmca_common_sm.so.0:
> symbol nanosleep: referenced symbol not found
>
> I have fixed this by compiling with "-lrt" option to the linker.

You shouldn't have to do this... Could you send me the config.log
file configure for Open MPI, the installed $prefix/lib/libmpi.la
file, and the output of mpicc -showme?

> sunos$ mpicc -o mpiinit_sunos mpiinit.c -lrt
>
> However when I run this again, I get the error:
>
> $ mpirun --hostfile hosts.txt --app mpiinit_appfile
> [pg1cluster01:19858] ERROR: A daemon on node csultra01 failed to start
> as expected.
> [pg1cluster01:19858] ERROR: There may be more information available
> from
> [pg1cluster01:19858] ERROR: the remote shell (see above).
> [pg1cluster01:19858] ERROR: The daemon exited unexpectedly with
> status 255.
> 2 processes killed (possibly by Open MPI)

Both of these are quite unexpected. It looks like there is something
wrong with your Solaris build. Can you run on *just* the Solaris
machine? We only have limited resources for testing on Solaris, but
have not run into this issue before. What happens if you run mpirun
on just the Solaris machine with the -d option to mpirun?

> Sometimes I get the error.
>
> $ mpirun --hostfile hosts.txt --app mpiinit_appfile
> [csultra01:06256] mca_common_sm_mmap_init: ftruncate failed with
> errno=28
> [csultra01:06256] mca_mpool_sm_init: unable to create shared memory
> mapping
> ----------------------------------------------------------------------
> ----
> It looks like MPI_INIT failed for some reason; your parallel
> process is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> PML add procs failed
> --> Returned value -2 instead of OMPI_SUCCESS
> ----------------------------------------------------------------------
> ----
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (goodbye)

This looks like you got far enough along that you ran into our
endianness issues, so this is about the best case you can hope for in
your configuration. The ftruncate error worries me, however. But I
think this is another symptom of something wrong with your Sun Sparc
build.

Brian

-- 
   Brian Barrett
   Open MPI developer
   http://www.open-mpi.org/