Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Ravi Manumachu (manumachu.reddy_at_[hidden])
Date: 2006-03-16 01:32:15


Hi Brian,

I have installed OpenMPI-1.1a1r9260 on my SunOS machines. It has solved
the problems. However there is one more issue that I found in my testing
and that I failed to report. This concerns Linux machines too.

My host file is

hosts.txt
---------
csultra06
csultra02
csultra05
csultra08

My app file is

mpiinit_appfile
---------------
-np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit
-np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit
-np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit
-np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit
-np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit
-np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit
-np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit
-np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit

My application program is

mpiinit.c
---------

#include <mpi.h>

int main(int argc, char** argv)
{
    int rc, me;
    char pname[MPI_MAX_PROCESSOR_NAME];
    int plen;

    MPI_Init(
       &argc,
       &argv
    );

    rc = MPI_Comm_rank(
            MPI_COMM_WORLD,
            &me
    );

    if (rc != MPI_SUCCESS)
    {
       return rc;
    }

    MPI_Get_processor_name(
       pname,
       &plen
    );

    printf("%s:Hello world from %d\n", pname, me);

    MPI_Finalize();

    return 0;
}

Compilation is successful

csultra06$ mpicc -o mpiinit mpiinit.c

However mpirun prints just 6 statements instead of 8.

csultra06$ mpirun --hostfile hosts.txt --app mpiinit_appfile
csultra02:Hello world from 5
csultra06:Hello world from 0
csultra06:Hello world from 4
csultra02:Hello world from 1
csultra08:Hello world from 3
csultra05:Hello world from 2

The following two more statements are not printed.

csultra05:Hello world from 6
csultra08:Hello world from 7

This behavior I observed on my Linux cluster too.

I have attached the log for "-d" option for your debugging purposes.

Regards,
Ravi.

----- Original Message -----
From: Brian Barrett <brbarret_at_[hidden]>
Date: Monday, March 13, 2006 7:56 pm
Subject: Re: [OMPI users] problems with OpenMPI-1.0.1 on SunOS 5.9;
problems on heterogeneous cluster
To: Open MPI Users <users_at_[hidden]>

> Hi Ravi -
>
> With the help of another Open MPI user, I spent the weekend finding
> a
> couple of issues with Open MPI on Solaris. I believe you are
> running
> into the same problems. We're in the process of certifying the
> changes for release as part of 1.0.2, but it's Monday morning and
> the
> release manager hasn't gotten them into the release branch just
> yet.
> Could you give the nightly tarball from our development trunk a try
>
> and let us know if it solves your problems on Solaris? You
> probably
> want last night's 1.1a1r9260 release.
>
> http://www.open-mpi.org/nightly/trunk/
>
> Thanks,
>
> Brian
>
>
> On Mar 12, 2006, at 11:23 PM, Ravi Manumachu wrote:
>
> >
> > Hi Brian,
> >
> > Thank you for your help. I have attached all the files you have
> asked> for in a tar file.
> >
> > Please find attached the 'config.log' and 'libmpi.la' for my
> Solaris> installation.
> >
> > The output from 'mpicc -showme' is
> >
> > sunos$ mpicc -showme
> > gcc -I/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/
> > include
> > -I/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-
> > 5.9/include/openmpi/ompi-L/home/cs/manredd/OpenMPI/openmpi-
> > 1.0.1/OpenMPI-SunOS-5.9/lib -lmpi
> > -lorte -lopal -lnsl -lsocket -lthread -laio -lm -lnsl -lsocket -
> > lthread -ldl
> >
> > There are serious issues when running on just solaris machines.
> >
> > I am using the host file and app file shown below. Both the
> > machines are
> > SunOS and are similar.
> >
> > hosts.txt
> > ---------
> > csultra01 slots=1
> > csultra02 slots=1
> >
> > mpiinit_appfile
> > ---------------
> > -np 1 /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos
> > -np 1 /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos
> >
> > Running mpirun without -d option hangs.
> >
> > csultra01$ mpirun --hostfile hosts.txt --app mpiinit_appfile
> > hangs
> >
> > Running mpirun with -d option dumps core with output in the file
> > "mpirun_output_d_option.txt", which is attached. The core is also
> > attached.
> > Running just on only one host is also not working. The output from
> > mpirun using "-d" option for this scenario is attached in file
> > "mpirun_output_d_option_one_host.txt".
> >
> > I have also attached the list of packages installed on my solaris
> > machine in "pkginfo.txt"
> >
> > I hope these will help you to resolve the issue.
> >
> > Regards,
> > Ravi.
> >
> >> ----- Original Message -----
> >> From: Brian Barrett <brbarret_at_[hidden]>
> >> Date: Friday, March 10, 2006 7:09 pm
> >> Subject: Re: [OMPI users] problems with OpenMPI-1.0.1 on SunOS 5.9;
> >> problems on heterogeneous cluster
> >> To: Open MPI Users <users_at_[hidden]>
> >>
> >>> On Mar 10, 2006, at 12:09 AM, Ravi Manumachu wrote:
> >>>
> >>>> I am facing problems running OpenMPI-1.0.1 on a heterogeneous
> >>> cluster.>
> >>>> I have a Linux machine and a SunOS machine in this cluster.
> >>>>
> >>>> linux$ uname -a
> >>>> Linux pg1cluster01 2.6.8-1.521smp #1 SMP Mon Aug 16 09:25:06
> >> EDT
> >>> 2004> i686 i686 i386 GNU/Linux
> >>>>
> >>>> sunos$ uname -a
> >>>> SunOS csultra01 5.9 Generic_112233-10 sun4u sparc SUNW,Ultra-5_10
> >>>
> >>> Unfortunately, this will not work with Open MPI at present. Open
> >>> MPI
> >>> 1.0.x does not have any support for running across platforms with
> >>
> >>> different endianness. Open MPI 1.1.x has much better support for
> >>
> >>> such situations, but is far from complete, as the MPI datatype
> >>> engine
> >>> does not properly fix up endian issues. We're working on the
> >>> issue,
> >>> but can not give a timetable for completion.
> >>>
> >>> Also note that (while not a problem here) Open MPI also does not
> >>> support running in a mixed 32 bit / 64 bit environment. All
> >>> processes must be 32 or 64 bit, but not a mix.
> >>>
> >>>> $ mpirun --hostfile hosts.txt --app mpiinit_appfile
> >>>> ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/
> >>>> mpiinit_sunos:
> >>>> fatal: relocation error: file
> >>>> /home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/
> >>>> libmca_common_sm.so.0:
> >>>> symbol nanosleep: referenced symbol not found
> >>>> ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/
> >>>> mpiinit_sunos:
> >>>> fatal: relocation error: file
> >>>> /home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/
> >>>> libmca_common_sm.so.0:
> >>>> symbol nanosleep: referenced symbol not found
> >>>>
> >>>> I have fixed this by compiling with "-lrt" option to the linker.
> >>>
> >>> You shouldn't have to do this... Could you send me the
> >> config.log
> >>> file configure for Open MPI, the installed $prefix/lib/libmpi.la
> >>> file, and the output of mpicc -showme?
> >>>
> >>>> sunos$ mpicc -o mpiinit_sunos mpiinit.c -lrt
> >>>>
> >>>> However when I run this again, I get the error:
> >>>>
> >>>> $ mpirun --hostfile hosts.txt --app mpiinit_appfile
> >>>> [pg1cluster01:19858] ERROR: A daemon on node csultra01 failed
> >> to
> >>> start> as expected.
> >>>> [pg1cluster01:19858] ERROR: There may be more information
> >>> available
> >>>> from
> >>>> [pg1cluster01:19858] ERROR: the remote shell (see above).
> >>>> [pg1cluster01:19858] ERROR: The daemon exited unexpectedly with
> >>
> >>>> status 255.
> >>>> 2 processes killed (possibly by Open MPI)
> >>>
> >>> Both of these are quite unexpected. It looks like there is
> >>> something
> >>> wrong with your Solaris build. Can you run on *just* the Solaris
> >>
> >>> machine? We only have limited resources for testing on Solaris,
> >>> but
> >>> have not run into this issue before. What happens if you run
> >>> mpirun
> >>> on just the Solaris machine with the -d option to mpirun?
> >>>
> >>>> Sometimes I get the error.
> >>>>
> >>>> $ mpirun --hostfile hosts.txt --app mpiinit_appfile
> >>>> [csultra01:06256] mca_common_sm_mmap_init: ftruncate failed
> >> with
> >>>> errno=28
> >>>> [csultra01:06256] mca_mpool_sm_init: unable to create shared
> >>> memory
> >>>> mapping
> >>>> ---------------------------------------------------------------
> -
> >> --
> >>> ----
> >>>> ----
> >>>> It looks like MPI_INIT failed for some reason; your parallel
> >>>> process is
> >>>> likely to abort. There are many reasons that a parallel
> >> process can
> >>>> fail during MPI_INIT; some of which are due to configuration or
> >>
> >>>> environment
> >>>> problems. This failure appears to be an internal failure;
> >> here's
> >>> some> additional information (which may only be relevant to an
> >> Open
> >>> MPI> developer):
> >>>>
> >>>> PML add procs failed
> >>>> --> Returned value -2 instead of OMPI_SUCCESS
> >>>> ---------------------------------------------------------------
> -
> >> --
> >>> ----
> >>>> ----
> >>>> *** An error occurred in MPI_Init
> >>>> *** before MPI was initialized
> >>>> *** MPI_ERRORS_ARE_FATAL (goodbye)
> >>>
> >>> This looks like you got far enough along that you ran into our
> >>> endianness issues, so this is about the best case you can hope
> >> for
> >>> in
> >>> your configuration. The ftruncate error worries me, however.
> >> But
> >>> I
> >>> think this is another symptom of something wrong with your Sun
> >>> Sparc
> >>> build.
> >>>
> >>> Brian
> >>>
> >>> --
> >>> Brian Barrett
> >>> Open MPI developer
> >>> http://www.open-mpi.org/
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>
> >> <OpenMPI-1.0.1-SunOS-5.9.tar.gz>
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> Brian Barrett
> Open MPI developer
> http://www.open-mpi.org/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>