Thanks for help. All work as you said.
Don't know how many times I can repeat it, but I'll try again: you are not allowed to reuse a collective id. If it happens to work, it's by accident.If you want to implement multiple modex/barrier operations, they each need to have their own unique collective id.On Dec 20, 2012, at 9:28 PM, Victor Kocheganov <victor.kocheganov@itseez.com> wrote:Actually, if I reuse id's in equivalent calls like this:...'modex' block;'modex' block;'modex' block;...or...'barrier' block;'barrier' block;'barrier' block;...there is no hanging. The hang only occurs if this "reusing" follows after using of another collective id, In the way I wrote in the first letter:...'modex' block;'barrier' block;'modex' block; <- hangs...or in this way...'barrier' block;'modex' block;'barrier' block; <- hangs..._______________________________________________
If I use different collective id while calling modex (1, 2 , ... , but not 0==orte_process_info.peer_modex), that also won't work, unfortunately..
On Thu, Dec 20, 2012 at 10:39 PM, Ralph Castain <rhc@open-mpi.org> wrote:
Yeah, that won't work. The id's cannot be reused, so you'd have to assign a different one in each case.On Dec 20, 2012, at 9:12 AM, Victor Kocheganov <victor.kocheganov@itseez.com> wrote:In every 'modex' block I use coll->id = orte_process_info.peer_modex; id and in every 'barrier' block I use coll->id = orte_process_info.peer_init_barrier; id.P.s. In general (as I wrote in first letter), I use 'modex' term for following code:
coll = OBJ_NEW(orte_grpcomm_collective_t);
coll->id = orte_process_info.peer_modex;
if (ORTE_SUCCESS != (ret = orte_grpcomm.modex(coll))) {
error = "orte_grpcomm_modex failed";
goto error;
}
/* wait for modex to complete - this may be moved anywhere in mpi_init
* so long as it occurs prior to calling a function that needs
* the modex info!
*/
while (coll->active) {
opal_progress(); /* block in progress pending events */
}
OBJ_RELEASE(coll);
and 'barrier' for this:coll = OBJ_NEW(orte_grpcomm_collective_t);_______________________________________________
coll->id = orte_process_info.peer_init_barrier;
if (ORTE_SUCCESS != (ret = orte_grpcomm.barrier(coll))) {
error = "orte_grpcomm_barrier failed";
goto error;
}
/* wait for barrier to complete */
while (coll->active) {
opal_progress(); /* block in progress pending events */
}
OBJ_RELEASE(coll);On Thu, Dec 20, 2012 at 8:57 PM, Ralph Castain <rhc@open-mpi.org> wrote:
How are the procs getting the collective id for those new calls? They all have to matchOn Dec 20, 2012, at 8:29 AM, Victor Kocheganov <victor.kocheganov@itseez.com> wrote:Thanks for fast answer, Ralph.In my example I use different collective objects. I mean in every mentioned block I call coll = OBJ_NEW(orte_grpcomm_collective_t);and OBJ_RELEASE(coll); , so all the grpcomm operations use unique collective object._______________________________________________On Thu, Dec 20, 2012 at 7:48 PM, Ralph Castain <rhc@open-mpi.org> wrote:Absolutely it will hang as the collective object passed into any grpcomm operation (modex or barrier) is only allowed to be used once - any attempt to reuse it will fail.On Dec 20, 2012, at 6:57 AM, Victor Kocheganov <victor.kocheganov@itseez.com> wrote:_______________________________________________Hi.
I have an issue with understanding ompi_mpi_init() logic. Could you please tell me if you have any guesses about following behavior.
I wonder if I understand ringh, there is a block in ompi_mpi_init() function for exchanging procs information between processes (denote this block 'modex'):
coll = OBJ_NEW(orte_grpcomm_collective_t);and several instructions after this there is a block for processes synchronization (denote this block 'barrier'):
coll->id = orte_process_info.peer_modex;
if (ORTE_SUCCESS != (ret = orte_grpcomm.modex(coll))) {
error = "orte_grpcomm_modex failed";
goto error;
}
/* wait for modex to complete - this may be moved anywhere in mpi_init
* so long as it occurs prior to calling a function that needs
* the modex info!
*/
while (coll->active) {
opal_progress(); /* block in progress pending events */
}
OBJ_RELEASE(coll);
coll = OBJ_NEW(orte_grpcomm_collective_t);So, initially ompi_mpi_init() has following structure:
coll->id = orte_process_info.peer_init_barrier;
if (ORTE_SUCCESS != (ret = orte_grpcomm.barrier(coll))) {
error = "orte_grpcomm_barrier failed";
goto error;
}
/* wait for barrier to complete */
while (coll->active) {
opal_progress(); /* block in progress pending events */
}
OBJ_RELEASE(coll);
...I made several experiments with this code and the following one is of interest: if I add sequence of two additional blocks, 'barrier' and 'modex', right after 'modex' block, then ompi_mpi_init() hangs in opal_progress() of the last 'modex' block.
'modex' block;
...
'barrier' block;
...
...Thanks,
'modex' block;
'barrier' block;
'modex' block; <- hangs
...
'barrier' block;
...
Victor Kocheganov.
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel