Looking deeper, I believe we may have a race condition in the code. Sadly, that error message is actually irrelevant, but causes the code to abort.

It can be triggered by race conditions in the app as well, but ultimately is something we need to clean up.


On Jun 27, 2011, at 9:29 AM, Rodrigo Oliveira wrote:

Hi there.
I am developing a server/client application using Open MPI 1.5.3. In a point of the server code I open a port to receive connections from a client. After that, I call the function MPI_Comm_accept and on the client side I call MPI_Comm_connect. Sometimes I get an ORTE_ERROR_LOG, as showed bellow.
before accept in host hydra9 port name = 4108386304.0;tcp://150.164.3.204:48761;tcp://192.168.63.9:48761+4108386305.0tcp://150.164.3.204:49211;tcp://192.168.63.9:49211:300                                             
[hydra9:11199] [[62689,1],0] ORTE_ERROR_LOG: Not found in file base/grpcomm_base_allgather.c at line 220              
[hydra9:11199] [[62689,1],0] ORTE_ERROR_LOG: Not found in file base/grpcomm_base_modex.c at line 116                  
[hydra9:11199] [[62689,1],0] ORTE_ERROR_LOG: Not found in file grpcomm_bad_module.c at line 608                       
[hydra9:11199] [[62689,1],0] ORTE_ERROR_LOG: Not found in file dpm_orte.c at line 379                                 
MPI 2 C++ exception throwing is disabled, MPI::mpi_errno has the error code                                           
after accept in host hydra9 error code = 17                                                                           
MPI 2 C++ exception throwing is disabled, MPI::mpi_errno has the error code
The mpi_errno is 17 and I could not find a clear explanation about this error. It occurs sporadically. Sometimes the application works, sometimes does not.

Any ideas?

Thanks
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users