Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Ravi Manumachu (manumachu.reddy_at_[hidden])
Date: 2006-03-12 23:23:31


 
 Hi Brian,
 
 Thank you for your help. I have attached all the files you have asked
 for in a tar file.
 
 Please find attached the 'config.log' and 'libmpi.la' for my Solaris
 installation.
 
 The output from 'mpicc -showme' is
 
 sunos$ mpicc -showme
 gcc -I/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/include
 -I/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-
 5.9/include/openmpi/ompi-L/home/cs/manredd/OpenMPI/openmpi-
 1.0.1/OpenMPI-SunOS-5.9/lib -lmpi
 -lorte -lopal -lnsl -lsocket -lthread -laio -lm -lnsl -lsocket -
 lthread -ldl
 
 There are serious issues when running on just solaris machines.
 
 I am using the host file and app file shown below. Both the
 machines are
 SunOS and are similar.
 
 hosts.txt
 ---------
 csultra01 slots=1
 csultra02 slots=1
 
 mpiinit_appfile
 ---------------
 -np 1 /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos
 -np 1 /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos
 
 Running mpirun without -d option hangs.
 
 csultra01$ mpirun --hostfile hosts.txt --app mpiinit_appfile
 hangs
 
 Running mpirun with -d option dumps core with output in the file
 "mpirun_output_d_option.txt", which is attached. The core is also
 attached.
 Running just on only one host is also not working. The output from
 mpirun using "-d" option for this scenario is attached in file
 "mpirun_output_d_option_one_host.txt".
 
 I have also attached the list of packages installed on my solaris
 machine in "pkginfo.txt"
 
 I hope these will help you to resolve the issue.
 
 Regards,
 Ravi.
 
> ----- Original Message -----
> From: Brian Barrett <brbarret_at_[hidden]>
> Date: Friday, March 10, 2006 7:09 pm
> Subject: Re: [OMPI users] problems with OpenMPI-1.0.1 on SunOS 5.9;
> problems on heterogeneous cluster
> To: Open MPI Users <users_at_[hidden]>
>
> > On Mar 10, 2006, at 12:09 AM, Ravi Manumachu wrote:
> >
> > > I am facing problems running OpenMPI-1.0.1 on a heterogeneous
> > cluster.>
> > > I have a Linux machine and a SunOS machine in this cluster.
> > >
> > > linux$ uname -a
> > > Linux pg1cluster01 2.6.8-1.521smp #1 SMP Mon Aug 16 09:25:06
> EDT
> > 2004> i686 i686 i386 GNU/Linux
> > >
> > > sunos$ uname -a
> > > SunOS csultra01 5.9 Generic_112233-10 sun4u sparc SUNW,Ultra-5_10
> >
> > Unfortunately, this will not work with Open MPI at present. Open
> > MPI
> > 1.0.x does not have any support for running across platforms with
>
> > different endianness. Open MPI 1.1.x has much better support for
>
> > such situations, but is far from complete, as the MPI datatype
> > engine
> > does not properly fix up endian issues. We're working on the
> > issue,
> > but can not give a timetable for completion.
> >
> > Also note that (while not a problem here) Open MPI also does not
> > support running in a mixed 32 bit / 64 bit environment. All
> > processes must be 32 or 64 bit, but not a mix.
> >
> > > $ mpirun --hostfile hosts.txt --app mpiinit_appfile
> > > ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/
> > > mpiinit_sunos:
> > > fatal: relocation error: file
> > > /home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/
> > > libmca_common_sm.so.0:
> > > symbol nanosleep: referenced symbol not found
> > > ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/
> > > mpiinit_sunos:
> > > fatal: relocation error: file
> > > /home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/
> > > libmca_common_sm.so.0:
> > > symbol nanosleep: referenced symbol not found
> > >
> > > I have fixed this by compiling with "-lrt" option to the linker.
> >
> > You shouldn't have to do this... Could you send me the
> config.log
> > file configure for Open MPI, the installed $prefix/lib/libmpi.la
> > file, and the output of mpicc -showme?
> >
> > > sunos$ mpicc -o mpiinit_sunos mpiinit.c -lrt
> > >
> > > However when I run this again, I get the error:
> > >
> > > $ mpirun --hostfile hosts.txt --app mpiinit_appfile
> > > [pg1cluster01:19858] ERROR: A daemon on node csultra01 failed
> to
> > start> as expected.
> > > [pg1cluster01:19858] ERROR: There may be more information
> > available
> > > from
> > > [pg1cluster01:19858] ERROR: the remote shell (see above).
> > > [pg1cluster01:19858] ERROR: The daemon exited unexpectedly with
>
> > > status 255.
> > > 2 processes killed (possibly by Open MPI)
> >
> > Both of these are quite unexpected. It looks like there is
> > something
> > wrong with your Solaris build. Can you run on *just* the Solaris
>
> > machine? We only have limited resources for testing on Solaris,
> > but
> > have not run into this issue before. What happens if you run
> > mpirun
> > on just the Solaris machine with the -d option to mpirun?
> >
> > > Sometimes I get the error.
> > >
> > > $ mpirun --hostfile hosts.txt --app mpiinit_appfile
> > > [csultra01:06256] mca_common_sm_mmap_init: ftruncate failed
> with
> > > errno=28
> > > [csultra01:06256] mca_mpool_sm_init: unable to create shared
> > memory
> > > mapping
> > > ----------------------------------------------------------------
> --
> > ----
> > > ----
> > > It looks like MPI_INIT failed for some reason; your parallel
> > > process is
> > > likely to abort. There are many reasons that a parallel
> process can
> > > fail during MPI_INIT; some of which are due to configuration or
>
> > > environment
> > > problems. This failure appears to be an internal failure;
> here's
> > some> additional information (which may only be relevant to an
> Open
> > MPI> developer):
> > >
> > > PML add procs failed
> > > --> Returned value -2 instead of OMPI_SUCCESS
> > > ----------------------------------------------------------------
> --
> > ----
> > > ----
> > > *** An error occurred in MPI_Init
> > > *** before MPI was initialized
> > > *** MPI_ERRORS_ARE_FATAL (goodbye)
> >
> > This looks like you got far enough along that you ran into our
> > endianness issues, so this is about the best case you can hope
> for
> > in
> > your configuration. The ftruncate error worries me, however.
> But
> > I
> > think this is another symptom of something wrong with your Sun
> > Sparc
> > build.
> >
> > Brian
> >
> > --
> > Brian Barrett
> > Open MPI developer
> > http://www.open-mpi.org/
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>