Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] internal error with mpiJava in openmpi-1.9a1r27380
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-10-11 22:27:43


Like I said, I haven't tried any of that, so I have no idea if/how it would
work. I don't have access to any hetero system and we don't see it very
often at all, so it is quite possible the hetero support really isn't there.

I'll look at some of the Java-specific issues later.

On Thu, Oct 11, 2012 at 12:51 AM, Siegmar Gross <
Siegmar.Gross_at_[hidden]> wrote:

> Hi,
>
> > I haven't tried heterogeneous apps on the Java code yet - could well not
> > work. At the least, I would expect you need to compile your Java app
> > against the corresponding OMPI install on each architecture, and ensure
> the
> > right one gets run on each node. Even though it's a Java app, the classes
> > need to get linked against the proper OMPI code for that node.
> >
> > As for Linux-only operation: it works fine for me. Did you remember to
> (a)
> > build mpiexec on those linux machines (as opposed to using the Solaris
> > version), and (b) recompile your app against that OMPI installation?
>
> I didn't know that the classfiles are different, but it doesn't change
> anything, if I create different classfiles. I use a small shell script
> to create all neccessary files on all machines.
>
>
> tyr java 118 make_classfiles
> =========== rs0 ===========
> ...
> mpijavac -d /home/fd1026/SunOS/sparc/mpi_classfiles MsgSendRecvMain.java
> mpijavac -d /home/fd1026/SunOS/sparc/mpi_classfiles ColumnSendRecvMain.java
> mpijavac -d /home/fd1026/SunOS/sparc/mpi_classfiles ColumnScatterMain.java
> mpijavac -d /home/fd1026/SunOS/sparc/mpi_classfiles EnvironVarMain.java
> =========== sunpc1 ===========
> ...
> mpijavac -d /home/fd1026/SunOS/x86_64/mpi_classfiles MsgSendRecvMain.java
> mpijavac -d /home/fd1026/SunOS/x86_64/mpi_classfiles
> ColumnSendRecvMain.java
> mpijavac -d /home/fd1026/SunOS/x86_64/mpi_classfiles ColumnScatterMain.java
> mpijavac -d /home/fd1026/SunOS/x86_64/mpi_classfiles EnvironVarMain.java
> =========== linpc1 ===========
> ...
> mpijavac -d /home/fd1026/Linux/x86_64/mpi_classfiles MsgSendRecvMain.java
> mpijavac -d /home/fd1026/Linux/x86_64/mpi_classfiles
> ColumnSendRecvMain.java
> mpijavac -d /home/fd1026/Linux/x86_64/mpi_classfiles ColumnScatterMain.java
> mpijavac -d /home/fd1026/Linux/x86_64/mpi_classfiles EnvironVarMain.java
>
>
> Every machine should now find its classfiles.
>
> tyr java 119 mpiexec -host sunpc0,linpc0,rs0 java EnvironVarMain
>
> Operating system: SunOS Processor architecture: x86_64
> CLASSPATH: ...:.:/home/fd1026/SunOS/x86_64/mpi_classfiles
>
> Operating system: Linux Processor architecture: x86_64
> CLASSPATH: ...:.:/home/fd1026/Linux/x86_64/mpi_classfiles
>
> Operating system: SunOS Processor architecture: sparc
> CLASSPATH: ...:.:/home/fd1026/SunOS/sparc/mpi_classfiles
>
>
>
> tyr java 120 mpiexec -host sunpc0,linpc0,rs0 java MsgSendRecvMain
> --------------------------------------------------------------------------
> It looks like opal_init failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during opal_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
> mca_base_open failed
> --> Returned value -2 instead of OPAL_SUCCESS
> --------------------------------------------------------------------------
> ...
>
>
>
> tyr java 121 mpiexec -host sunpc0,rs0 java MsgSendRecvMain
> [rs0.informatik.hs-fulda.de:13671] *** An error occurred in MPI_Comm_dup
> [rs0.informatik.hs-fulda.de:13671] *** reported by process [1077346305,1]
> [rs0.informatik.hs-fulda.de:13671] *** on communicator MPI_COMM_WORLD
> [rs0.informatik.hs-fulda.de:13671] *** MPI_ERR_INTERN: internal error
> [rs0.informatik.hs-fulda.de:13671] *** MPI_ERRORS_ARE_FATAL (processes in
> this
> communicator will now abort,
> [rs0.informatik.hs-fulda.de:13671] *** and potentially your MPI job)
>
>
>
> I get an error even then, when I login on a Linux machine, before I
> run the command.
>
> linpc0 fd1026 99 mpiexec -host linpc0,linpc1 java MsgSendRecvMain
> --------------------------------------------------------------------------
> It looks like opal_init failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during opal_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
> mca_base_open failed
> --> Returned value -2 instead of OPAL_SUCCESS
> --------------------------------------------------------------------------
> ...
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> *** and potentially your MPI job)
> [linpc1:3004] Local abort before MPI_INIT completed successfully; not able
> to
> aggregate error messages, and not able to guarantee that all other
> processes
> were killed!
> ...
>
>
> linpc0 fd1026 99 mpijavac -showme
> /usr/local/jdk1.7.0_07-64/bin/javac -cp ...
>
> :.:/home/fd1026/Linux/x86_64/mpi_classfiles:/usr/local/openmpi-1.9_64_cc/lib64/
> mpi.jar
>
>
> By the way I have the same classfiles for all architectures. Are you
> sure that they should be different? I don't find any absolute path names
> in the files, when I use "strings".
>
> tyr java 133 diff ~/SunOS/sparc/mpi_classfiles/MsgSendRecvMain.class \
> ~/SunOS/x86_64/mpi_classfiles/MsgSendRecvMain.class
> tyr java 134 diff ~/SunOS/sparc/mpi_classfiles/MsgSendRecvMain.class \
> ~/Linux/x86_64/mpi_classfiles/MsgSendRecvMain.class
>
>
>
> Can I provide more information to track the problem on my Linux systems?
> I think that I have to wait until you support a heterogeneous system, but
> it would be nice, if Java applications would run on my different
> homogeneous systems. The strange thing is that it works on my different
> Solaris systems and not on Linux this time.
>
> Do you know if my problem with Datatype.Vector is a problem of Open
> MPI as well (one of my other emails)? Do you use the extent of the base
> type and not the extent of the derived data type, if I use a derived
> data type in a scatter/gather operation or an operation with "count"
> greater than one?
>
>
> Kind regards
>
> Siegmar
>
>
>
> > On Wed, Oct 10, 2012 at 5:42 AM, Siegmar Gross <
> > Siegmar.Gross_at_[hidden]> wrote:
> >
> > > Hi,
> > >
> > > I have built openmpi-1.9a1r27380 with Java support and implemented
> > > a small program that sends some kind of "hello" with Send/Recv.
> > >
> > > tyr java 164 make
> > > mpijavac -d /home/fd1026/mpi_classfiles MsgSendRecvMain.java
> > > ...
> > >
> > > Everything works fine, if I use Solaris 10 x86_84.
> > >
> > > tyr java 165 mpiexec -np 3 -host sunpc0,sunpc1 \
> > > java -cp $HOME/mpi_classfiles MsgSendRecvMain
> > >
> > > Now 2 processes are sending greetings.
> > >
> > > Greetings from process 2:
> > > message tag: 3
> > > message length: 6
> > > message: sunpc1
> > >
> > > Greetings from process 1:
> > > message tag: 3
> > > message length: 6
> > > message: sunpc0
> > >
> > >
> > > Everything works fine, if I use Solaris 10 Sparc.
> > >
> > > tyr java 166 mpiexec -np 3 -host rs0,rs1 \
> > > java -cp $HOME/mpi_classfiles MsgSendRecvMain
> > >
> > > Now 2 processes are sending greetings.
> > >
> > > Greetings from process 2:
> > > message tag: 3
> > > message length: 26
> > > message: rs1.informatik.hs-fulda.de
> > >
> > > Greetings from process 1:
> > > message tag: 3
> > > message length: 26
> > > message: rs0.informatik.hs-fulda.de
> > >
> > >
> > > The program breaks, if I use both systems.
> > >
> > > tyr java 167 mpiexec -np 3 -host rs0,sunpc0 \
> > > java -cp $HOME/mpi_classfiles MsgSendRecvMain
> > > [rs0.informatik.hs-fulda.de:9621] *** An error occurred in
> MPI_Comm_dup
> > > [rs0.informatik.hs-fulda.de:9621] *** reported by process
> [1976500225,0]
> > > [rs0.informatik.hs-fulda.de:9621] *** on communicator MPI_COMM_WORLD
> > > [rs0.informatik.hs-fulda.de:9621] *** MPI_ERR_INTERN: internal error
> > > [rs0.informatik.hs-fulda.de:9621] *** MPI_ERRORS_ARE_FATAL (processes
> > > in this communicator will now abort,
> > > [rs0.informatik.hs-fulda.de:9621] *** and potentially your MPI job)
> > > [tyr.informatik.hs-fulda.de:22491] 1 more process has sent help
> message
> > > help-mpi-errors.txt / mpi_errors_are_fatal
> > > [tyr.informatik.hs-fulda.de:22491] Set MCA parameter
> > > "orte_base_help_aggregate" to 0 to see all help / error messages
> > >
> > >
> > > The program breaks, if I use Linux x86_64.
> > >
> > > tyr java 168 mpiexec -np 3 -host linpc0,linpc1 \
> > > java -cp $HOME/mpi_classfiles MsgSendRecvMain
> > >
> --------------------------------------------------------------------------
> > > It looks like opal_init failed for some reason; your parallel process
> is
> > > likely to abort. There are many reasons that a parallel process can
> > > fail during opal_init; some of which are due to configuration or
> > > environment problems. This failure appears to be an internal failure;
> > > here's some additional information (which may only be relevant to an
> > > Open MPI developer):
> > >
> > > mca_base_open failed
> > > --> Returned value -2 instead of OPAL_SUCCESS
> > >
> --------------------------------------------------------------------------
> > > ...
> > >
> > > *** An error occurred in MPI_Init
> > > *** on a NULL communicator
> > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> > > *** and potentially your MPI job)
> > > [linpc0:20277] Local abort before MPI_INIT completed successfully;
> > > not able to aggregate error messages, and not able to guarantee that
> > > all other processes were killed!
> > > ...
> > >
> > >
> > > Please let me know if you need more information to track the problem.
> > > Thank you very much for any help in advance.
> > >
> > >
> > > Kind regards
> > >
> > > Siegmar
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
>
>