Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problems with mpiJava in openmpi-1.9a1r27362
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-09-26 10:53:45


Hmmm...well, this is indeed confusing. I see the following in your attached
output:

[sunpc4.informatik.hs-fulda.de][[4083,1],2][../../../../../openmpi-1.9a1r27362/ompi/mca/btl/sctp/btl_sctp_proc.c:143:mca_btl_sctp_proc_create]
mca_base_modex_recv: failed with return value=-13
[rs0.informatik.hs-fulda.de][[4083,1],3][../../../../../openmpi-1.9a1r27362/ompi/mca/btl/sctp/btl_sctp_proc.c:143:mca_btl_sctp_proc_create]
mca_base_modex_recv: failed with return value=-13
[rs0.informatik.hs-fulda.de][[4083,1],3][../../../../../openmpi-1.9a1r27362/ompi/mca/btl/sctp/btl_sctp_proc.c:143:mca_btl_sctp_proc_create]
mca_base_modex_recv: failed with return value=-13
[rs0.informatik.hs-fulda.de][[4083,1],3][../../../../../openmpi-1.9a1r27362/ompi/mca/btl/sctp/btl_sctp_proc.c:143:mca_btl_sctp_proc_create]
mca_base_modex_recv: failed with return value=-13

This implies that at least some of the processes started and got all the
way into MPI_Init. You should probably exclude the sctp BTL as it's not
necessarily working - just add -mca btl ^sctp to the cmd line.

Does this work if you leave linpc out of it? I'm wondering if this is the
heterogeneous problem again. Are you sure that the /usr/local... OMPI
library on that machine is the Linux x86_64 version, and not the Solaris
one (e.g., if /usr/local was NFS mounted)?

On Wed, Sep 26, 2012 at 7:30 AM, Siegmar Gross <
Siegmar.Gross_at_[hidden]> wrote:

> Hi,
>
> > I'm on the road the rest of this week, but can look at this when I return
> > next week. It looks like something unrelated to the Java bindings failed
> to
> > properly initialize - at a guess, I'd suspect that you are missing the
> > LD_LIBRARY_PATH setting so none of the OMPI libs were found.
>
> Perhaps the output of my environment program is helpful in that case.
> I attached my environment.
>
> mpiexec -np 4 -host linpc4,sunpc4,rs0 environ_mpi \
> >& env_linpc_sunpc_sparc.txt
>
> Thank you very much for your help in advance.
>
>
> Kind regards
>
> Siegmar
>
>
> > On Wed, Sep 26, 2012 at 5:42 AM, Siegmar Gross <
> > Siegmar.Gross_at_[hidden]> wrote:
> >
> > > Hi,
> > >
> > > yesterday I installed openmpi-1.9a1r27362 on Solaris and Linux and
> > > I have a problem with mpiJava on Linux (openSUSE-Linux 12.1, x86_64).
> > >
> > >
> > > linpc4 mpi_classfiles 104 javac HelloMainWithoutMPI.java
> > > linpc4 mpi_classfiles 105 mpijavac HelloMainWithBarrier.java
> > > linpc4 mpi_classfiles 106 mpijavac -showme
> > > /usr/local/jdk1.7.0_07-64/bin/javac \
> > > -cp ...:.:/usr/local/openmpi-1.9_64_cc/lib64/mpi.jar
> > >
> > >
> > > It works with Java without MPI.
> > >
> > > linpc4 mpi_classfiles 107 mpiexec java -cp $HOME/mpi_classfiles \
> > > HelloMainWithoutMPI
> > > Hello from linpc4.informatik.hs-fulda.de/193.174.26.225
> > >
> > >
> > > It breaks with Java and MPI.
> > >
> > > linpc4 mpi_classfiles 108 mpiexec java -cp $HOME/mpi_classfiles \
> > > HelloMainWithBarrier
> > >
> --------------------------------------------------------------------------
> > > It looks like opal_init failed for some reason; your parallel process
> is
> > > likely to abort. There are many reasons that a parallel process can
> > > fail during opal_init; some of which are due to configuration or
> > > environment problems. This failure appears to be an internal failure;
> > > here's some additional information (which may only be relevant to an
> > > Open MPI developer):
> > >
> > > mca_base_open failed
> > > --> Returned value -2 instead of OPAL_SUCCESS
> > >
> --------------------------------------------------------------------------
> > >
> --------------------------------------------------------------------------
> > > It looks like orte_init failed for some reason; your parallel process
> is
> > > likely to abort. There are many reasons that a parallel process can
> > > fail during orte_init; some of which are due to configuration or
> > > environment problems. This failure appears to be an internal failure;
> > > here's some additional information (which may only be relevant to an
> > > Open MPI developer):
> > >
> > > opal_init failed
> > > --> Returned value Out of resource (-2) instead of ORTE_SUCCESS
> > >
> --------------------------------------------------------------------------
> > >
> --------------------------------------------------------------------------
> > > It looks like MPI_INIT failed for some reason; your parallel process is
> > > likely to abort. There are many reasons that a parallel process can
> > > fail during MPI_INIT; some of which are due to configuration or
> environment
> > > problems. This failure appears to be an internal failure; here's some
> > > additional information (which may only be relevant to an Open MPI
> > > developer):
> > >
> > > ompi_mpi_init: orte_init failed
> > > --> Returned "Out of resource" (-2) instead of "Success" (0)
> > >
> --------------------------------------------------------------------------
> > > *** An error occurred in MPI_Init
> > > *** on a NULL communicator
> > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> > > *** and potentially your MPI job)
> > > [linpc4:15332] Local abort before MPI_INIT completed successfully; not
> > > able to
> > > aggregate error messages, and not able to guarantee that all other
> > > processes were
> > > killed!
> > > -------------------------------------------------------
> > > Primary job terminated normally, but 1 process returned
> > > a non-zero exit code.. Per user-direction, the job has been aborted.
> > > -------------------------------------------------------
> > >
> --------------------------------------------------------------------------
> > > mpiexec detected that one or more processes exited with non-zero
> status,
> > > thus
> > > causing
> > > the job to be terminated. The first process to do so was:
> > >
> > > Process name: [[58875,1],0]
> > > Exit code: 1
> > >
> --------------------------------------------------------------------------
> > >
> > >
> > > I configured with the following command.
> > >
> > > ../openmpi-1.9a1r27362/configure --prefix=/usr/local/openmpi-1.9_64_cc
> \
> > > --libdir=/usr/local/openmpi-1.9_64_cc/lib64 \
> > > --with-jdk-bindir=/usr/local/jdk1.7.0_07-64/bin \
> > > --with-jdk-headers=/usr/local/jdk1.7.0_07-64/include \
> > > JAVA_HOME=/usr/local/jdk1.7.0_07-64 \
> > > LDFLAGS="-m64" \
> > > CC="cc" CXX="CC" FC="f95" \
> > > CFLAGS="-m64" CXXFLAGS="-m64 -library=stlport4" FCFLAGS="-m64" \
> > > CPP="cpp" CXXCPP="cpp" \
> > > CPPFLAGS="" CXXCPPFLAGS="" \
> > > C_INCL_PATH="" C_INCLUDE_PATH="" CPLUS_INCLUDE_PATH="" \
> > > OBJC_INCLUDE_PATH="" OPENMPI_HOME="" \
> > > --enable-cxx-exceptions \
> > > --enable-mpi-java \
> > > --enable-heterogeneous \
> > > --enable-opal-multi-threads \
> > > --enable-mpi-thread-multiple \
> > > --with-threads=posix \
> > > --with-hwloc=internal \
> > > --without-verbs \
> > > --without-udapl \
> > > --with-wrapper-cflags=-m64 \
> > > --enable-debug \
> > > |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
> > >
> > >
> > > It works fine on Solaris machines as long as the hosts belong to the
> > > same kind (Sparc or x86_64).
> > >
> > > tyr mpi_classfiles 194 mpiexec -host sunpc0,sunpc1,sunpc4 \
> > > java -cp $HOME/mpi_classfiles HelloMainWithBarrier
> > > Process 1 of 3 running on sunpc1
> > > Process 2 of 3 running on sunpc4.informatik.hs-fulda.de
> > > Process 0 of 3 running on sunpc0
> > >
> > > sunpc4 fd1026 107 mpiexec -host tyr,rs0,rs1 \
> > > java -cp $HOME/mpi_classfiles HelloMainWithBarrier
> > > Process 1 of 3 running on rs0.informatik.hs-fulda.de
> > > Process 2 of 3 running on rs1.informatik.hs-fulda.de
> > > Process 0 of 3 running on tyr.informatik.hs-fulda.de
> > >
> > >
> > > It breaks if the hosts belong to both kinds of machines.
> > >
> > > sunpc4 fd1026 106 mpiexec -host tyr,rs0,sunpc1 \
> > > java -cp $HOME/mpi_classfiles HelloMainWithBarrier
> > > [rs0.informatik.hs-fulda.de:7718] *** An error occurred in
> MPI_Comm_dup
> > > [rs0.informatik.hs-fulda.de:7718] *** reported by process
> [565116929,1]
> > > [rs0.informatik.hs-fulda.de:7718] *** on communicator MPI_COMM_WORLD
> > > [rs0.informatik.hs-fulda.de:7718] *** MPI_ERR_INTERN: internal error
> > > [rs0.informatik.hs-fulda.de:7718] *** MPI_ERRORS_ARE_FATAL (processes
> > > in this communicator will now abort,
> > > [rs0.informatik.hs-fulda.de:7718] *** and potentially your MPI job)
> > > [sunpc4.informatik.hs-fulda.de:07900] 1 more process has sent help
> > > message help-mpi-errors.txt / mpi_errors_are_fatal
> > > [sunpc4.informatik.hs-fulda.de:07900] Set MCA parameter
> > > "orte_base_help_aggregate" to 0 to see all help / error messages
> > >
> > >
> > > Please let me know if I can provide anything else to track these
> errors.
> > > Thank you very much for any help in advance.
> > >
> > >
> > > Kind regards
> > >
> > > Siegmar
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
>
> [sunpc4.informatik.hs-fulda.de][[4083,1],2][../../../../../openmpi-1.9a1r27362/ompi/mca/btl/sctp/btl_sctp_proc.c:143:mca_btl_sctp_proc_create]
> mca_base_modex_recv: failed with return value=-13
> [rs0.informatik.hs-fulda.de][[4083,1],3][../../../../../openmpi-1.9a1r27362/ompi/mca/btl/sctp/btl_sctp_proc.c:143:mca_btl_sctp_proc_create]
> mca_base_modex_recv: failed with return value=-13
> [rs0.informatik.hs-fulda.de][[4083,1],3][../../../../../openmpi-1.9a1r27362/ompi/mca/btl/sctp/btl_sctp_proc.c:143:mca_btl_sctp_proc_create]
> mca_base_modex_recv: failed with return value=-13
> [rs0.informatik.hs-fulda.de][[4083,1],3][../../../../../openmpi-1.9a1r27362/ompi/mca/btl/sctp/btl_sctp_proc.c:143:mca_btl_sctp_proc_create]
> mca_base_modex_recv: failed with return value=-13
>
>
> Now 3 slave tasks are sending their environment.
>
> Environment from task 1:
> message type: 3
> msg length: 3911 characters
> message:
> hostname: linpc4
> operating system: Linux
> release: 3.1.9-1.4-desktop
> processor: x86_64
> PATH
> /usr/local/eclipse-3.6.1
> /usr/local/NetBeans-4.0/bin
> /usr/local/jdk1.7.0_07-64/bin
> /usr/local/apache-ant-1.6.2/bin
> /usr/local/icc-9.1/idb/bin
> /usr/local/icc-9.1/cc/bin
> /usr/local/icc-9.1/fc/bin
> /usr/local/gcc-4.7.1/bin
> /opt/solstudio12.3/bin
> /usr/local/bin
> /usr/local/ssl/bin
> /usr/local/pgsql/bin
> /bin
> /usr/bin
> /usr/X11R6/bin
> /usr/local/teTeX-1.0.7/bin/i586-pc-linux-gnu
> /usr/local/bluej-2.1.2
> /usr/local/openmpi-1.9_64_cc/bin
> /home/fd1026/Linux/x86_64/bin
> .
> /usr/sbin
> LD_LIBRARY_PATH_32
> /usr/lib
> /usr/local/jdk1.7.0_07-64/jre/lib/i386
> /usr/local/gcc-4.7.1/lib
>
> /usr/local/gcc-4.7.1/libexec/gcc/x86_64-unknown-linux-gnu/4.7.1/32
>
> /usr/local/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/32
> /usr/local/lib
> /usr/local/ssl/lib
> /lib
> /usr/lib
> /usr/X11R6/lib
> /usr/local/openmpi-1.9_64_cc/lib
> /home/fd1026/Linux/x86_64/lib
> LD_LIBRARY_PATH_64
> /usr/lib64
> /usr/local/jdk1.7.0_07-64/jre/lib/amd64
> /usr/local/gcc-4.7.1/lib64
>
> /usr/local/gcc-4.7.1/libexec/gcc/x86_64-unknown-linux-gnu/4.7.1
>
> /usr/local/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1
> /usr/local/lib64
> /usr/local/ssl/lib64
> /usr/lib64
> /usr/X11R6/lib64
> /usr/local/openmpi-1.9_64_cc/lib64
> /home/fd1026/Linux/x86_64/lib64
> LD_LIBRARY_PATH
> /usr/lib
> /usr/local/jdk1.7.0_07-64/jre/lib/i386
> /usr/local/gcc-4.7.1/lib
>
> /usr/local/gcc-4.7.1/libexec/gcc/x86_64-unknown-linux-gnu/4.7.1/32
>
> /usr/local/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/32
> /usr/local/lib
> /usr/local/ssl/lib
> /lib
> /usr/lib
> /usr/X11R6/lib
> /usr/local/openmpi-1.9_64_cc/lib
> /usr/lib64
> /usr/local/jdk1.7.0_07-64/jre/lib/amd64
> /usr/local/gcc-4.7.1/lib64
>
> /usr/local/gcc-4.7.1/libexec/gcc/x86_64-unknown-linux-gnu/4.7.1
>
> /usr/local/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1
> /usr/local/lib64
> /usr/local/ssl/lib64
> /usr/lib64
> /usr/X11R6/lib64
> /usr/local/openmpi-1.9_64_cc/lib64
> /home/fd1026/Linux/x86_64/lib64
> CLASSPATH
> /usr/local/junit4.10
> /usr/local/junit4.10/junit-4.10.jar
> //usr/local/jdk1.7.0_07-64/j3d/lib/ext/j3dcore.jar
> //usr/local/jdk1.7.0_07-64/j3d/lib/ext/j3dutils.jar
> //usr/local/jdk1.7.0_07-64/j3d/lib/ext/vecmath.jar
> /usr/local/javacc-5.0/javacc.jar
> .
>
> Environment from task 2:
> message type: 3
> msg length: 4196 characters
> message:
> hostname: sunpc4.informatik.hs-fulda.de
> operating system: SunOS
> release: 5.10
> processor: i86pc
> PATH
> /usr/local/eclipse-3.6.1
> /usr/local/NetBeans-4.0/bin
> /usr/local/jdk1.7.0_07/bin/amd64
> /usr/local/apache-ant-1.6.2/bin
> /usr/local/gcc-4.7.1/bin
> /opt/solstudio12.3/bin
> /usr/local/bin
> /usr/local/ssl/bin
> /usr/local/pgsql/bin
> /usr/bin
> /usr/openwin/bin
> /usr/dt/bin
> /usr/ccs/bin
> /usr/sfw/bin
> /opt/sfw/bin
> /usr/ucb
> /usr/lib/lp/postscript
> /usr/local/teTeX-1.0.7/bin/i386-pc-solaris2.10
> /usr/local/bluej-2.1.2
> /usr/local/openmpi-1.9_64_cc/bin
> /home/fd1026/SunOS/x86_64/bin
> .
> /usr/sbin
> LD_LIBRARY_PATH_32
> /usr/lib
> /usr/local/jdk1.7.0_07/jre/lib/i386
> /usr/local/gcc-4.7.1/lib
>
> /usr/local/gcc-4.7.1/lib/gcc/i386-pc-solaris2.10/4.7.1
> /usr/local/lib
> /usr/local/ssl/lib
> /usr/local/oracle
> /usr/local/pgsql/lib
> /usr/lib
> /usr/openwin/lib
> /usr/openwin/server/lib
> /usr/dt/lib
> /usr/X11R6/lib
> /usr/ccs/lib
> /usr/sfw/lib
> /opt/sfw/lib
> /usr/ucblib
> /usr/local/openmpi-1.9_64_cc/lib
> /home/fd1026/SunOS/x86_64/lib
> LD_LIBRARY_PATH_64
> /usr/lib/amd64
> /usr/local/jdk1.7.0_07/jre/lib/amd64
> /usr/local/gcc-4.7.1/lib/amd64
>
> /usr/local/gcc-4.7.1/lib/gcc/i386-pc-solaris2.10/4.7.1/amd64
> /usr/local/lib/amd64
> /usr/local/ssl/lib/amd64
> /usr/local/lib64
> /usr/lib/amd64
> /usr/openwin/lib/amd64
> /usr/openwin/server/lib/amd64
> /usr/dt/lib/amd64
> /usr/X11R6/lib/amd64
> /usr/ccs/lib/amd64
> /usr/sfw/lib/amd64
> /opt/sfw/lib/amd64
> /usr/ucblib/amd64
> /usr/local/openmpi-1.9_64_cc/lib64
> /home/fd1026/SunOS/x86_64/lib64
> LD_LIBRARY_PATH
> /usr/lib/amd64
> /usr/local/jdk1.7.0_07/jre/lib/amd64
> /usr/local/gcc-4.7.1/lib/amd64
>
> /usr/local/gcc-4.7.1/lib/gcc/i386-pc-solaris2.10/4.7.1/amd64
> /usr/local/lib/amd64
> /usr/local/ssl/lib/amd64
> /usr/local/lib64
> /usr/lib/amd64
> /usr/openwin/lib/amd64
> /usr/openwin/server/lib/amd64
> /usr/dt/lib/amd64
> /usr/X11R6/lib/amd64
> /usr/ccs/lib/amd64
> /usr/sfw/lib/amd64
> /opt/sfw/lib/amd64
> /usr/ucblib/amd64
> /usr/local/openmpi-1.9_64_cc/lib64
> /home/fd1026/SunOS/x86_64/lib64
> CLASSPATH
> /usr/local/junit4.10
> /usr/local/junit4.10/junit-4.10.jar
> //usr/local/jdk1.7.0_07/j3d/lib/ext/j3dcore.jar
> //usr/local/jdk1.7.0_07/j3d/lib/ext/j3dutils.jar
> //usr/local/jdk1.7.0_07/j3d/lib/ext/vecmath.jar
> /usr/local/javacc-5.0/javacc.jar
> .
>
> Environment from task 3:
> message type: 3
> msg length: 4394 characters
> message:
> hostname: rs0.informatik.hs-fulda.de
> operating system: SunOS
> release: 5.10
> processor: sun4u
> PATH
> /usr/local/eclipse-3.6.1
> /usr/local/NetBeans-4.0/bin
> /usr/local/jdk1.7.0_07/bin/sparcv9
> /usr/local/apache-ant-1.6.2/bin
> /usr/local/gcc-4.7.1/bin
> /opt/solstudio12.3/bin
> /usr/local/bin
> /usr/local/ssl/bin
> /usr/local/pgsql/bin
> /usr/bin
> /usr/openwin/bin
> /usr/dt/bin
> /usr/ccs/bin
> /usr/sfw/bin
> /opt/sfw/bin
> /usr/ucb
> /usr/xpg4/bin
> /usr/local/teTeX-1.0.7/bin/sparc-sun-solaris2.10
> /usr/local/bluej-2.1.2
> /usr/local/openmpi-1.9_64_cc/bin
> /home/fd1026/SunOS/sparc/bin
> .
> /usr/sbin
> LD_LIBRARY_PATH_32
> /usr/lib
> /usr/local/jdk1.7.0_07/jre/lib/sparc
> /usr/local/gcc-4.7.1/lib
>
> /usr/local/gcc-4.7.1/lib/gcc/sparc-sun-solaris2.10/4.7.1
> /usr/local/lib
> /usr/local/ssl/lib
> /usr/local/oracle
> /usr/local/pgsql/lib
> /lib
> /usr/lib
> /usr/openwin/lib
> /usr/dt/lib
> /usr/X11R6/lib
> /usr/ccs/lib
> /usr/sfw/lib
> /opt/sfw/lib
> /usr/ucblib
> /usr/local/openmpi-1.9_64_cc/lib
> /home/fd1026/SunOS/sparc/lib
> LD_LIBRARY_PATH_64
> /usr/lib/sparcv9
> /usr/local/jdk1.7.0_07/jre/lib/sparcv9
> /usr/local/gcc-4.7.1/lib/sparcv9
>
> /usr/local/gcc-4.7.1/lib/gcc/sparc-sun-solaris2.10/4.7.1/sparcv9
> /usr/local/lib/sparcv9
> /usr/local/ssl/lib/sparcv9
> /usr/local/lib64
> /usr/local/oracle/sparcv9
> /usr/local/pgsql/lib/sparcv9
> /lib/sparcv9
> /usr/lib/sparcv9
> /usr/openwin/lib/sparcv9
> /usr/dt/lib/sparcv9
> /usr/X11R6/lib/sparcv9
> /usr/ccs/lib/sparcv9
> /usr/sfw/lib/sparcv9
> /opt/sfw/lib/sparcv9
> /usr/ucblib/sparcv9
> /usr/local/openmpi-1.9_64_cc/lib64
> /home/fd1026/SunOS/sparc/lib64
> LD_LIBRARY_PATH
> /usr/lib/sparcv9
> /usr/local/jdk1.7.0_07/jre/lib/sparcv9
> /usr/local/gcc-4.7.1/lib/sparcv9
>
> /usr/local/gcc-4.7.1/lib/gcc/sparc-sun-solaris2.10/4.7.1/sparcv9
> /usr/local/lib/sparcv9
> /usr/local/ssl/lib/sparcv9
> /usr/local/lib64
> /usr/local/oracle/sparcv9
> /usr/local/pgsql/lib/sparcv9
> /lib/sparcv9
> /usr/lib/sparcv9
> /usr/openwin/lib/sparcv9
> /usr/dt/lib/sparcv9
> /usr/X11R6/lib/sparcv9
> /usr/ccs/lib/sparcv9
> /usr/sfw/lib/sparcv9
> /opt/sfw/lib/sparcv9
> /usr/ucblib/sparcv9
> /usr/local/openmpi-1.9_64_cc/lib64
> /home/fd1026/SunOS/sparc/lib
> CLASSPATH
> /usr/local/junit4.10
> /usr/local/junit4.10/junit-4.10.jar
> //usr/local/jdk1.7.0_07/j3d/lib/ext/j3dcore.jar
> //usr/local/jdk1.7.0_07/j3d/lib/ext/j3dutils.jar
> //usr/local/jdk1.7.0_07/j3d/lib/ext/vecmath.jar
> /usr/local/javacc-5.0/javacc.jar
> .
>
>
>