Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] internal error with mpiJava in openmpi-1.9a1r27380
From: Siegmar Gross (Siegmar.Gross_at_[hidden])
Date: 2012-10-11 03:51:46


Hi,

> I haven't tried heterogeneous apps on the Java code yet - could well not
> work. At the least, I would expect you need to compile your Java app
> against the corresponding OMPI install on each architecture, and ensure the
> right one gets run on each node. Even though it's a Java app, the classes
> need to get linked against the proper OMPI code for that node.
>
> As for Linux-only operation: it works fine for me. Did you remember to (a)
> build mpiexec on those linux machines (as opposed to using the Solaris
> version), and (b) recompile your app against that OMPI installation?

I didn't know that the classfiles are different, but it doesn't change
anything, if I create different classfiles. I use a small shell script
to create all neccessary files on all machines.

tyr java 118 make_classfiles
=========== rs0 ===========
...
mpijavac -d /home/fd1026/SunOS/sparc/mpi_classfiles MsgSendRecvMain.java
mpijavac -d /home/fd1026/SunOS/sparc/mpi_classfiles ColumnSendRecvMain.java
mpijavac -d /home/fd1026/SunOS/sparc/mpi_classfiles ColumnScatterMain.java
mpijavac -d /home/fd1026/SunOS/sparc/mpi_classfiles EnvironVarMain.java
=========== sunpc1 ===========
...
mpijavac -d /home/fd1026/SunOS/x86_64/mpi_classfiles MsgSendRecvMain.java
mpijavac -d /home/fd1026/SunOS/x86_64/mpi_classfiles ColumnSendRecvMain.java
mpijavac -d /home/fd1026/SunOS/x86_64/mpi_classfiles ColumnScatterMain.java
mpijavac -d /home/fd1026/SunOS/x86_64/mpi_classfiles EnvironVarMain.java
=========== linpc1 ===========
...
mpijavac -d /home/fd1026/Linux/x86_64/mpi_classfiles MsgSendRecvMain.java
mpijavac -d /home/fd1026/Linux/x86_64/mpi_classfiles ColumnSendRecvMain.java
mpijavac -d /home/fd1026/Linux/x86_64/mpi_classfiles ColumnScatterMain.java
mpijavac -d /home/fd1026/Linux/x86_64/mpi_classfiles EnvironVarMain.java

Every machine should now find its classfiles.

tyr java 119 mpiexec -host sunpc0,linpc0,rs0 java EnvironVarMain

Operating system: SunOS Processor architecture: x86_64
  CLASSPATH: ...:.:/home/fd1026/SunOS/x86_64/mpi_classfiles

Operating system: Linux Processor architecture: x86_64
  CLASSPATH: ...:.:/home/fd1026/Linux/x86_64/mpi_classfiles

Operating system: SunOS Processor architecture: sparc
  CLASSPATH: ...:.:/home/fd1026/SunOS/sparc/mpi_classfiles

tyr java 120 mpiexec -host sunpc0,linpc0,rs0 java MsgSendRecvMain
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  mca_base_open failed
  --> Returned value -2 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
...

tyr java 121 mpiexec -host sunpc0,rs0 java MsgSendRecvMain
[rs0.informatik.hs-fulda.de:13671] *** An error occurred in MPI_Comm_dup
[rs0.informatik.hs-fulda.de:13671] *** reported by process [1077346305,1]
[rs0.informatik.hs-fulda.de:13671] *** on communicator MPI_COMM_WORLD
[rs0.informatik.hs-fulda.de:13671] *** MPI_ERR_INTERN: internal error
[rs0.informatik.hs-fulda.de:13671] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[rs0.informatik.hs-fulda.de:13671] *** and potentially your MPI job)

I get an error even then, when I login on a Linux machine, before I
run the command.

linpc0 fd1026 99 mpiexec -host linpc0,linpc1 java MsgSendRecvMain
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  mca_base_open failed
  --> Returned value -2 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
...
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[linpc1:3004] Local abort before MPI_INIT completed successfully; not able to
aggregate error messages, and not able to guarantee that all other processes
were killed!
...

linpc0 fd1026 99 mpijavac -showme
/usr/local/jdk1.7.0_07-64/bin/javac -cp ...
:.:/home/fd1026/Linux/x86_64/mpi_classfiles:/usr/local/openmpi-1.9_64_cc/lib64/
mpi.jar

By the way I have the same classfiles for all architectures. Are you
sure that they should be different? I don't find any absolute path names
in the files, when I use "strings".

tyr java 133 diff ~/SunOS/sparc/mpi_classfiles/MsgSendRecvMain.class \
  ~/SunOS/x86_64/mpi_classfiles/MsgSendRecvMain.class
tyr java 134 diff ~/SunOS/sparc/mpi_classfiles/MsgSendRecvMain.class \
 ~/Linux/x86_64/mpi_classfiles/MsgSendRecvMain.class

Can I provide more information to track the problem on my Linux systems?
I think that I have to wait until you support a heterogeneous system, but
it would be nice, if Java applications would run on my different
homogeneous systems. The strange thing is that it works on my different
Solaris systems and not on Linux this time.

Do you know if my problem with Datatype.Vector is a problem of Open
MPI as well (one of my other emails)? Do you use the extent of the base
type and not the extent of the derived data type, if I use a derived
data type in a scatter/gather operation or an operation with "count"
greater than one?

Kind regards

Siegmar

> On Wed, Oct 10, 2012 at 5:42 AM, Siegmar Gross <
> Siegmar.Gross_at_[hidden]> wrote:
>
> > Hi,
> >
> > I have built openmpi-1.9a1r27380 with Java support and implemented
> > a small program that sends some kind of "hello" with Send/Recv.
> >
> > tyr java 164 make
> > mpijavac -d /home/fd1026/mpi_classfiles MsgSendRecvMain.java
> > ...
> >
> > Everything works fine, if I use Solaris 10 x86_84.
> >
> > tyr java 165 mpiexec -np 3 -host sunpc0,sunpc1 \
> > java -cp $HOME/mpi_classfiles MsgSendRecvMain
> >
> > Now 2 processes are sending greetings.
> >
> > Greetings from process 2:
> > message tag: 3
> > message length: 6
> > message: sunpc1
> >
> > Greetings from process 1:
> > message tag: 3
> > message length: 6
> > message: sunpc0
> >
> >
> > Everything works fine, if I use Solaris 10 Sparc.
> >
> > tyr java 166 mpiexec -np 3 -host rs0,rs1 \
> > java -cp $HOME/mpi_classfiles MsgSendRecvMain
> >
> > Now 2 processes are sending greetings.
> >
> > Greetings from process 2:
> > message tag: 3
> > message length: 26
> > message: rs1.informatik.hs-fulda.de
> >
> > Greetings from process 1:
> > message tag: 3
> > message length: 26
> > message: rs0.informatik.hs-fulda.de
> >
> >
> > The program breaks, if I use both systems.
> >
> > tyr java 167 mpiexec -np 3 -host rs0,sunpc0 \
> > java -cp $HOME/mpi_classfiles MsgSendRecvMain
> > [rs0.informatik.hs-fulda.de:9621] *** An error occurred in MPI_Comm_dup
> > [rs0.informatik.hs-fulda.de:9621] *** reported by process [1976500225,0]
> > [rs0.informatik.hs-fulda.de:9621] *** on communicator MPI_COMM_WORLD
> > [rs0.informatik.hs-fulda.de:9621] *** MPI_ERR_INTERN: internal error
> > [rs0.informatik.hs-fulda.de:9621] *** MPI_ERRORS_ARE_FATAL (processes
> > in this communicator will now abort,
> > [rs0.informatik.hs-fulda.de:9621] *** and potentially your MPI job)
> > [tyr.informatik.hs-fulda.de:22491] 1 more process has sent help message
> > help-mpi-errors.txt / mpi_errors_are_fatal
> > [tyr.informatik.hs-fulda.de:22491] Set MCA parameter
> > "orte_base_help_aggregate" to 0 to see all help / error messages
> >
> >
> > The program breaks, if I use Linux x86_64.
> >
> > tyr java 168 mpiexec -np 3 -host linpc0,linpc1 \
> > java -cp $HOME/mpi_classfiles MsgSendRecvMain
> > --------------------------------------------------------------------------
> > It looks like opal_init failed for some reason; your parallel process is
> > likely to abort. There are many reasons that a parallel process can
> > fail during opal_init; some of which are due to configuration or
> > environment problems. This failure appears to be an internal failure;
> > here's some additional information (which may only be relevant to an
> > Open MPI developer):
> >
> > mca_base_open failed
> > --> Returned value -2 instead of OPAL_SUCCESS
> > --------------------------------------------------------------------------
> > ...
> >
> > *** An error occurred in MPI_Init
> > *** on a NULL communicator
> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> > *** and potentially your MPI job)
> > [linpc0:20277] Local abort before MPI_INIT completed successfully;
> > not able to aggregate error messages, and not able to guarantee that
> > all other processes were killed!
> > ...
> >
> >
> > Please let me know if you need more information to track the problem.
> > Thank you very much for any help in advance.
> >
> >
> > Kind regards
> >
> > Siegmar
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >