Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Duplicated modex issue.
From: Victor Kocheganov (victor.kocheganov_at_[hidden])
Date: 2012-12-21 00:28:32


Actually, if I reuse id's in equivalent calls like this:

...
'modex' block;
'modex' block;
'modex' block;
...

or

...
'barrier' block;
'barrier' block;
'barrier' block;
...

there is no hanging. The hang only occurs if this "reusing" follows after
using of another collective id, In the way I wrote in the first letter:

...
'modex' block;
'barrier' block;
'modex' block; <- hangs
...

or in this way

...
'barrier' block;
'modex' block;
'barrier' block; <- hangs
...

If I use different collective id while calling modex (1, 2 , ... , but not
 0==orte_process_info.peer_modex), that also won't work, unfortunately..

On Thu, Dec 20, 2012 at 10:39 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> Yeah, that won't work. The id's cannot be reused, so you'd have to assign
> a different one in each case.
>
> On Dec 20, 2012, at 9:12 AM, Victor Kocheganov <
> victor.kocheganov_at_[hidden]> wrote:
>
> In every 'modex' block I use coll->id = orte_process_info.peer_modex;
> id and in every 'barrier' block I use coll->id =
> orte_process_info.peer_init_barrier; id.
>
> P.s. In general (as I wrote in first letter), I use 'modex' term for
> following code:
> coll = OBJ_NEW(orte_grpcomm_collective_t);
> coll->id = orte_process_info.peer_modex;
> if (ORTE_SUCCESS != (ret = orte_grpcomm.modex(coll))) {
> error = "orte_grpcomm_modex failed";
> goto error;
> }
> /* wait for modex to complete - this may be moved anywhere in mpi_init
> * so long as it occurs prior to calling a function that needs
> * the modex info!
> */
> while (coll->active) {
> opal_progress(); /* block in progress pending events */
> }
> OBJ_RELEASE(coll);
>
> and 'barrier' for this:
>
> coll = OBJ_NEW(orte_grpcomm_collective_t);
> coll->id = orte_process_info.peer_init_barrier;
> if (ORTE_SUCCESS != (ret = orte_grpcomm.barrier(coll))) {
> error = "orte_grpcomm_barrier failed";
> goto error;
> }
> /* wait for barrier to complete */
> while (coll->active) {
> opal_progress(); /* block in progress pending events */
> }
> OBJ_RELEASE(coll);
>
> On Thu, Dec 20, 2012 at 8:57 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>>
>> On Dec 20, 2012, at 8:29 AM, Victor Kocheganov <
>> victor.kocheganov_at_[hidden]> wrote:
>>
>> Thanks for fast answer, Ralph.
>>
>> In my example I use different collective objects. I mean in every
>> mentioned block I call *coll = OBJ_NEW(orte_grpcomm_**collective_t);*
>> and *OBJ_RELEASE(coll);* , so all the grpcomm operations use unique
>> collective object.
>>
>>
>> How are the procs getting the collective id for those new calls? They all
>> have to match
>>
>>
>>
>> On Thu, Dec 20, 2012 at 7:48 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>> Absolutely it will hang as the collective object passed into any grpcomm
>>> operation (modex or barrier) is only allowed to be used once - any attempt
>>> to reuse it will fail.
>>>
>>>
>>> On Dec 20, 2012, at 6:57 AM, Victor Kocheganov <
>>> victor.kocheganov_at_[hidden]> wrote:
>>>
>>> Hi.
>>>
>>> I have an issue with understanding *ompi_mpi_init() *logic. Could you
>>> please tell me if you have any guesses about following behavior.
>>>
>>> I wonder if I understand ringh, there is a block in *ompi_mpi_init() *function
>>> for exchanging procs information between processes (denote this block
>>> 'modex'):
>>>
>>> coll = OBJ_NEW(orte_grpcomm_collective_t);
>>> coll->id = orte_process_info.peer_modex;
>>> if (ORTE_SUCCESS != (ret = orte_grpcomm.modex(coll))) {
>>> error = "orte_grpcomm_modex failed";
>>> goto error;
>>> }
>>> /* wait for modex to complete - this may be moved anywhere in
>>> mpi_init
>>> * so long as it occurs prior to calling a function that needs
>>> * the modex info!
>>> */
>>> while (coll->active) {
>>> opal_progress(); /* block in progress pending events */
>>> }
>>> OBJ_RELEASE(coll);
>>>
>>> and several instructions after this there is a block for processes
>>> synchronization (denote this block 'barrier'):
>>>
>>> coll = OBJ_NEW(orte_grpcomm_collective_t);
>>> coll->id = orte_process_info.peer_init_barrier;
>>> if (ORTE_SUCCESS != (ret = orte_grpcomm.barrier(coll))) {
>>> error = "orte_grpcomm_barrier failed";
>>> goto error;
>>> }
>>> /* wait for barrier to complete */
>>> while (coll->active) {
>>> opal_progress(); /* block in progress pending events */
>>> }
>>> OBJ_RELEASE(coll);
>>>
>>> So,* *initially* **ompi_mpi_init()* has following structure:
>>>
>>> ...
>>> 'modex' block;
>>> ...
>>> 'barrier' block;
>>> ...
>>>
>>> I made several experiments with this code and the following one is of
>>> interest: if I add sequence of two additional blocks, 'barrier' and
>>> 'modex', right after 'modex' block, then* **ompi_mpi_init() *hangs in *
>>> opal_progress()* of the last 'modex' block.
>>>
>>> ...
>>> 'modex' block;
>>> 'barrier' block;
>>> 'modex' block; <- hangs
>>> ...
>>> 'barrier' block;
>>> ...
>>>
>>> Thanks,
>>> Victor Kocheganov.
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>