Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2006-03-13 09:14:29


Hi Ravi -

With the help of another Open MPI user, I spent the weekend finding a
couple of issues with Open MPI on Solaris. I believe you are running
into the same problems. We're in the process of certifying the
changes for release as part of 1.0.2, but it's Monday morning and the
release manager hasn't gotten them into the release branch just yet.
Could you give the nightly tarball from our development trunk a try
and let us know if it solves your problems on Solaris? You probably
want last night's 1.1a1r9260 release.

     http://www.open-mpi.org/nightly/trunk/

Thanks,

Brian

On Mar 12, 2006, at 11:23 PM, Ravi Manumachu wrote:

>
> Hi Brian,
>
> Thank you for your help. I have attached all the files you have asked
> for in a tar file.
>
> Please find attached the 'config.log' and 'libmpi.la' for my Solaris
> installation.
>
> The output from 'mpicc -showme' is
>
> sunos$ mpicc -showme
> gcc -I/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/
> include
> -I/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-
> 5.9/include/openmpi/ompi-L/home/cs/manredd/OpenMPI/openmpi-
> 1.0.1/OpenMPI-SunOS-5.9/lib -lmpi
> -lorte -lopal -lnsl -lsocket -lthread -laio -lm -lnsl -lsocket -
> lthread -ldl
>
> There are serious issues when running on just solaris machines.
>
> I am using the host file and app file shown below. Both the
> machines are
> SunOS and are similar.
>
> hosts.txt
> ---------
> csultra01 slots=1
> csultra02 slots=1
>
> mpiinit_appfile
> ---------------
> -np 1 /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos
> -np 1 /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos
>
> Running mpirun without -d option hangs.
>
> csultra01$ mpirun --hostfile hosts.txt --app mpiinit_appfile
> hangs
>
> Running mpirun with -d option dumps core with output in the file
> "mpirun_output_d_option.txt", which is attached. The core is also
> attached.
> Running just on only one host is also not working. The output from
> mpirun using "-d" option for this scenario is attached in file
> "mpirun_output_d_option_one_host.txt".
>
> I have also attached the list of packages installed on my solaris
> machine in "pkginfo.txt"
>
> I hope these will help you to resolve the issue.
>
> Regards,
> Ravi.
>
>> ----- Original Message -----
>> From: Brian Barrett <brbarret_at_[hidden]>
>> Date: Friday, March 10, 2006 7:09 pm
>> Subject: Re: [OMPI users] problems with OpenMPI-1.0.1 on SunOS 5.9;
>> problems on heterogeneous cluster
>> To: Open MPI Users <users_at_[hidden]>
>>
>>> On Mar 10, 2006, at 12:09 AM, Ravi Manumachu wrote:
>>>
>>>> I am facing problems running OpenMPI-1.0.1 on a heterogeneous
>>> cluster.>
>>>> I have a Linux machine and a SunOS machine in this cluster.
>>>>
>>>> linux$ uname -a
>>>> Linux pg1cluster01 2.6.8-1.521smp #1 SMP Mon Aug 16 09:25:06
>> EDT
>>> 2004> i686 i686 i386 GNU/Linux
>>>>
>>>> sunos$ uname -a
>>>> SunOS csultra01 5.9 Generic_112233-10 sun4u sparc SUNW,Ultra-5_10
>>>
>>> Unfortunately, this will not work with Open MPI at present. Open
>>> MPI
>>> 1.0.x does not have any support for running across platforms with
>>
>>> different endianness. Open MPI 1.1.x has much better support for
>>
>>> such situations, but is far from complete, as the MPI datatype
>>> engine
>>> does not properly fix up endian issues. We're working on the
>>> issue,
>>> but can not give a timetable for completion.
>>>
>>> Also note that (while not a problem here) Open MPI also does not
>>> support running in a mixed 32 bit / 64 bit environment. All
>>> processes must be 32 or 64 bit, but not a mix.
>>>
>>>> $ mpirun --hostfile hosts.txt --app mpiinit_appfile
>>>> ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/
>>>> mpiinit_sunos:
>>>> fatal: relocation error: file
>>>> /home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/
>>>> libmca_common_sm.so.0:
>>>> symbol nanosleep: referenced symbol not found
>>>> ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/
>>>> mpiinit_sunos:
>>>> fatal: relocation error: file
>>>> /home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/
>>>> libmca_common_sm.so.0:
>>>> symbol nanosleep: referenced symbol not found
>>>>
>>>> I have fixed this by compiling with "-lrt" option to the linker.
>>>
>>> You shouldn't have to do this... Could you send me the
>> config.log
>>> file configure for Open MPI, the installed $prefix/lib/libmpi.la
>>> file, and the output of mpicc -showme?
>>>
>>>> sunos$ mpicc -o mpiinit_sunos mpiinit.c -lrt
>>>>
>>>> However when I run this again, I get the error:
>>>>
>>>> $ mpirun --hostfile hosts.txt --app mpiinit_appfile
>>>> [pg1cluster01:19858] ERROR: A daemon on node csultra01 failed
>> to
>>> start> as expected.
>>>> [pg1cluster01:19858] ERROR: There may be more information
>>> available
>>>> from
>>>> [pg1cluster01:19858] ERROR: the remote shell (see above).
>>>> [pg1cluster01:19858] ERROR: The daemon exited unexpectedly with
>>
>>>> status 255.
>>>> 2 processes killed (possibly by Open MPI)
>>>
>>> Both of these are quite unexpected. It looks like there is
>>> something
>>> wrong with your Solaris build. Can you run on *just* the Solaris
>>
>>> machine? We only have limited resources for testing on Solaris,
>>> but
>>> have not run into this issue before. What happens if you run
>>> mpirun
>>> on just the Solaris machine with the -d option to mpirun?
>>>
>>>> Sometimes I get the error.
>>>>
>>>> $ mpirun --hostfile hosts.txt --app mpiinit_appfile
>>>> [csultra01:06256] mca_common_sm_mmap_init: ftruncate failed
>> with
>>>> errno=28
>>>> [csultra01:06256] mca_mpool_sm_init: unable to create shared
>>> memory
>>>> mapping
>>>> ----------------------------------------------------------------
>> --
>>> ----
>>>> ----
>>>> It looks like MPI_INIT failed for some reason; your parallel
>>>> process is
>>>> likely to abort. There are many reasons that a parallel
>> process can
>>>> fail during MPI_INIT; some of which are due to configuration or
>>
>>>> environment
>>>> problems. This failure appears to be an internal failure;
>> here's
>>> some> additional information (which may only be relevant to an
>> Open
>>> MPI> developer):
>>>>
>>>> PML add procs failed
>>>> --> Returned value -2 instead of OMPI_SUCCESS
>>>> ----------------------------------------------------------------
>> --
>>> ----
>>>> ----
>>>> *** An error occurred in MPI_Init
>>>> *** before MPI was initialized
>>>> *** MPI_ERRORS_ARE_FATAL (goodbye)
>>>
>>> This looks like you got far enough along that you ran into our
>>> endianness issues, so this is about the best case you can hope
>> for
>>> in
>>> your configuration. The ftruncate error worries me, however.
>> But
>>> I
>>> think this is another symptom of something wrong with your Sun
>>> Sparc
>>> build.
>>>
>>> Brian
>>>
>>> --
>>> Brian Barrett
>>> Open MPI developer
>>> http://www.open-mpi.org/
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> <OpenMPI-1.0.1-SunOS-5.9.tar.gz>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
   Brian Barrett
   Open MPI developer
   http://www.open-mpi.org/