Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine
From: Tom Bryan (tombry_at_[hidden])
Date: 2012-02-08 16:52:11


On 2/6/12 5:10 PM, "Reuti" <reuti_at_[hidden]> wrote:

> Am 06.02.2012 um 22:28 schrieb Tom Bryan:
>
>> On 2/6/12 8:14 AM, "Reuti" <reuti_at_[hidden]> wrote:
>>
>>>> If I need MPI_THREAD_MULTIPLE, and openmpi is compiled with thread support,
>>>> it's not clear to me whether MPI::Init_Thread() and
>>>> MPI::Inint_Thread(MPI::THREAD_MULTIPLE) would give me the same behavior
>>>> from
>>>> Open MPI.
>>>
>>> If you need thread support, you will need MPI::Init_Thread and it needs one
>>> argument (or three).
>>
>> Sorry, typo on my side. I meant to compare
>> MPI::Init_thread(MPI::THREAD_MULTIPLE) and MPI::Init(). I think that your
>> first reply mentioned replacing MPI::Init_thread by MPI::Init.
>
> Yes, if you don't need threads, I don't see any reason why it should add
> anything to the environment what you could make use of.

Got it. Unfortunately, we *definitely* need THREAD_MULTIPLE in our case.

>>> Yes, this should work across multiple machines. And it's using `qrsh
>>> -inherit
>>> ...` so it's failing somewhere in Open MPI - is it working with 1.4.4?
>>
>> I'm not sure. We no longer have our 1.4 test environment, so I'm in the
>> process of building that now. I'll let you know once I have a chance to run
>> that experiment.

You said that both of these cases worked for you in 1.4. Were you running a
modified version that did not use THREAD_MULTIPLE? I ask because I'm
getting worse errors in 1.4. I'm using the same code that was working (in
some cases) with 1.5.4.

I built 1.4.4 with (among other option)
--with-threads=posix --enable-mpi-threads

I rebuilt my code against 1.4.4.

When I run my test "e" from before, which is basically just
mpiexec -np 1 ./mpitest
I get the following in the output file for the job.
 
Calling init_thread
[vxr-lnx-11.cisco.com:64618] [[32207,1],0] ORTE_ERROR_LOG: Data unpack would
read past end of buffer in file util/nidmap.c at line 398
[vxr-lnx-11.cisco.com:64618] [[32207,1],0] ORTE_ERROR_LOG: Data unpack would
read past end of buffer in file base/ess_base_nidmap.c at line 62
[vxr-lnx-11.cisco.com:64618] [[32207,1],0] ORTE_ERROR_LOG: Data unpack would
read past end of buffer in file ess_env_module.c at line 173
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_base_build_nidmap failed
  --> Returned value Data unpack would read past end of buffer (-26) instead
of ORTE_SUCCESS
--------------------------------------------------------------------------
[vxr-lnx-11.cisco.com:64618] [[32207,1],0] ORTE_ERROR_LOG: Data unpack would
read past end of buffer in file runtime/orte_init.c at line 132
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Data unpack would read past end of buffer (-26) instead
of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: orte_init failed
  --> Returned "Data unpack would read past end of buffer" (-26) instead of
"Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init_thread() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[vxr-lnx-11.cisco.com:64618] Abort before MPI_INIT completed successfully;
not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 64618 on
node vxr-lnx-11.cisco.com exiting improperly. There are two reasons this
could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------