That is bizarre. Afraid you have me stumped here - I can't think why an action in the perl script would trigger an action in OMPI. If your OMPI proc doesn't in any way read the info in "log" (using your example), does it still have a problem? In other words, if the perl script still executes a system command, but the OMPI proc doesn't interact with it in any way, does the problem persist?

What I'm searching for is the connection. If your OMPI proc reads the results of that system command, then it's possible that something in your app is corrupting memory during the read operation - e.g., you are reading in more info than you have allocated memory.

I'm using TCP on 1.4.1 (its actually IPoIB)
OpenIB is compiled in.
Note that these nodes are containers running in OpenVZ where IB is not available.  there may be some SDP running in system level routines on the VH but this is unlikely.
OpenIB is not available to the VMs.  they happily get TCP services from the VH
In any case, the problem still occurs if I use: --mca btl tcp,self

I have traced the perl code and observed that OpenMPI gets confused whenever the perl program executes a system command itself
`command 2>&1 1> log`;

This probably narrows it down (I hope)

Which network transport are you using, and what version of Open MPI are you using?  Do you have OpenFabrics support compiled into your Open MPI installation?

If you're just using TCP and/or shared memory, I can't think of a reason immediately as to why this wouldn't work, but there may be a subtle interaction in there somewhere that causes badness (e.g., memory corruption).

> I have a section in my code running in rank 0 that must start a perl program that it then connects to via a tcp socket.
> The initialisation section is shown here:
>    sprintf(buf, "%s/ -p %d &", PATH,port);
>    int i = system(buf);
>    printf("system returned %d\n", i);
> Some time after I run this code, while waiting for the data from the perl program, the error below occurs:
> qplan connection
> DCsession_fetch: waiting for Mcode data...
> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 105
> [dc1:05387] [[40050,1],0] could not get route to [[INVALID],INVALID]
> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file base/plm_base_proxy.c at line 86
> It seems that the linux system() call is breaking OpenMPI internal connections.  Removing the system() call and executing the perl code externaly fixes the problem but I can't go into production like that as its a security issue.
> Any ideas ?
> (environment: OpenMPI 1.4.1 on kernel Linux dc1 2.6.18-274.3.1.el5.028stab094.3  using TCP and mpirun)
