>After doing that, the MPI_Init procedure calls grpcomm.modex to distribute the data across all procs in the job. Unfortunately, being a collective, all procs must participate. In your case, you'll have to find a different way to do it. Upon receipt, each proc updates its own modex db to include the new info.
>Look in orte/mca/grpcomm/bad/grpcomm_bad_module.c at the modex function and follow that code thru the grpcomm/base functions to see how the modex info is retrieved, passed, and decoded on the far end.
I will take a look to this Ralph and let you know how it goes. But today looking at the code with a partner, he suggested to me to try to capture an error when sending data through the btl_tcp_endpoint, more precisely in mca_btl_tcp_frag_send and capture there an error when we try to write to the fd of the socket. I've tried this but when a process moves and try to send a message, or someone try to send a message for him, i cannot capture the moment of the failure in the mca_btl_tcp_frag_send, but i don't know why, it is supposed to fail when someone try to send, is there any other place where this is capture? If i do in this way, i can reset connections on demand i suppose. What do you think of this? it's a good idea? And after i detect this failure, i will try to update de modex db of that process from here it's ok?