Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] broadcasting basic data items in Java
From: Siegmar Gross (Siegmar.Gross_at_[hidden])
Date: 2012-12-21 04:41:28


Hi

I'm still using "Open MPI: 1.9a1r27668" and Java 1.7.0_07. Today I
implemented a few programs to broadcast int, int[], double, or
double[]. I can compile all four programs without problems, which
means that "Object buf" as a parameter in "MPI.COMM_WORLD.Bcast"
isn't a problem for basic datatypes. Unfortunately I only get the
expected result for arrays of a basic datatype.

Process 1 doesn't receive an int value (both processes run on
Solaris 10 Sparc)

tyr java 159 mpiexec -np 2 java BcastIntMain
Process 1 running on tyr.informatik.hs-fulda.de.
  intValue: 0
Process 0 running on tyr.informatik.hs-fulda.de.
  intValue: 1234567

Process 1 receives all values from an int array.

tyr java 160 mpiexec -np 2 java BcastIntArrayMain
Process 0 running on tyr.informatik.hs-fulda.de.
  intValues[0]: 1234567 intValues[1]: 7654321
Process 1 running on tyr.informatik.hs-fulda.de.
  intValues[0]: 1234567 intValues[1]: 7654321

The program breaks if I use one little endian and one big endian
machine.

tyr java 161 mpiexec -np 2 -host sunpc0,tyr java BcastIntMain
[tyr:7657] *** An error occurred in MPI_Comm_dup
[tyr:7657] *** reported by process [3150053377,1]
[tyr:7657] *** on communicator MPI_COMM_WORLD
[tyr:7657] *** MPI_ERR_INTERN: internal error
[tyr:7657] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
  will now abort,
[tyr:7657] *** and potentially your MPI job)

The program works if I use two "Solaris 10 x86_64" machines.

tyr java 163 mpiexec -np 2 -host sunpc0,sunpc1 java BcastIntArrayMain
Process 1 running on sunpc1.
  intValues[0]: 1234567 intValues[1]: 7654321
Process 0 running on sunpc0.
  intValues[0]: 1234567 intValues[1]: 7654321

The program breaks if I use two Linux.x86_64 machines (Open Suse 12.1).

linpc1 etc 101 mpiexec -np 2 -host linpc0,linpc1 java BcastIntArrayMain
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  mca_base_open failed
  --> Returned value -2 instead of OPAL_SUCCESS
...
  ompi_mpi_init: orte_init failed
  --> Returned "Out of resource" (-2) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[(null):10586] Local abort before MPI_INIT completed successfully; not able to
aggregate error messages, and not able to guarantee that all other processes
were killed!
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus
causing
the job to be terminated. The first process to do so was:

  Process name: [[16706,1],1]
  Exit code: 1
--------------------------------------------------------------------------

I use a valid environment on all machines. The problem occurs as well
when I compile and run the program directly on the Linux system.

linpc1 java 101 mpijavac BcastIntMain.java
linpc1 java 102 mpiexec -np 2 -host linpc0,linpc1 java -cp `pwd` BcastIntMain
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  mca_base_open failed
  --> Returned value -2 instead of OPAL_SUCCESS

I get the same errors for the programs with double values. Does anybody
have any suggestions how to solve the problems. Thank you very much for
any help in advance.

Kind regards

Siegmar