On Jan 8, 2007, at 9:34 PM, Reese Faucette wrote:
>> Right, that's the maximum number of open MX channels, i.e. processes
>> than can run on the node using MX. With MX (1.2.0c I think), I get
>> weird messages if I run a second mpirun quickly after the first one
>> failed. The myrinet guys, I quite sure, can explain why and how.
>> Somehow, when an application segfault while the MX port is open
>> things are not cleaned up right away. It take few seconds (not more
>> than one minute) to have everything running correctly after that.
> Supposedly I am a "myrinet guy" ;-) Yeah, the endpoint cleanup
> stuff could
> take a few seconds after an ungraceful exit. But, if you're
> getting some
> behavior that looks like you ought not be getting, please let us know!
I think it make sense what I get. If I loop in a script starting
mpiruns and one of the run segfault, the next one usually is unable
to open the MX endpoints. That's happens only if I run 4 processes by
node, where 4 is the number of instances as reported by mx_info. If I
put a sleep of 30 seconds between my runs, then everything runs just
> Myricom, Inc.
> users mailing list