Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2007-01-08 21:45:31


On Jan 8, 2007, at 9:34 PM, Reese Faucette wrote:

>> Right, that's the maximum number of open MX channels, i.e. processes
>> than can run on the node using MX. With MX (1.2.0c I think), I get
>> weird messages if I run a second mpirun quickly after the first one
>> failed. The myrinet guys, I quite sure, can explain why and how.
>> Somehow, when an application segfault while the MX port is open
>> things are not cleaned up right away. It take few seconds (not more
>> than one minute) to have everything running correctly after that.
>
> Supposedly I am a "myrinet guy" ;-) Yeah, the endpoint cleanup
> stuff could
> take a few seconds after an ungraceful exit. But, if you're
> getting some
> behavior that looks like you ought not be getting, please let us know!

I think it make sense what I get. If I loop in a script starting
mpiruns and one of the run segfault, the next one usually is unable
to open the MX endpoints. That's happens only if I run 4 processes by
node, where 4 is the number of instances as reported by mx_info. If I
put a sleep of 30 seconds between my runs, then everything runs just
fine.

   george.

> -reese
> Myricom, Inc.
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users