On Jun 2, 2010, at 12:42 PM, George Bosilca wrote:
> > 1. In this case, the openib BTL was not finalized, so there was a stub still there listening on the RDMACM CPC. When another process tried to connect to X's RDMACM CPC port, Bad Things happened (because it was only half setup) and we segv'ed.
> > Obviously, this should be fixed. "Fixed" in this case probably means closing down the RDMACM CPC listening port. But then that leads to another form of Badness.
> I wonder how this is possible. If a process X fails to connect to Y, how can Y succeed to connect to X ? Please enlighten me ...
It doesn't. Process X segvs after it goes into the RDMACM CPC accept code (because the openib BTL was only half setup).
> > 2. If the openib BTL cleanly shuts down and is *not* still listening on its modex-advertised RDMACM CPC contact port, then if some other process tries to contact process X via the modex info, it'll fail. This will then be judged to be a fatal error. Failover in the BML will simply have delayed the job abort until someone tries to contact X via the openib BTL.
> Isn't there any kind of timeout mechanism in the RDMACM CPC? If there is one and the connection fails, then the PML will automatically try to use the next available BTL, so it will eventually fail over TCP (if available).
Yes, there is a timeout. I forget offhand what we do if the timeout occurs. We probably report the connect failure in the "normal" way, but I don't know that for sure.
> > I think that the majority of this discussion about the BML failure (or not) behavior assumed that *all* processes had the same failure (at least: *I* assumed this). But if only *some* of the processes fail a given BTL add_procs, we have a problem because we're beyond the point of deciding who can connect to whom. Shutting down a single BTL module at that point will create an inconsistency of the distributed data.
> We did assume that at least the errors are symmetric, i.e. if A fails to connect to B then B will fail when trying to connect to A. However, if there are other BTL the connection is supposed to smoothly move over some other BTL. As an example in the MX BTL, if two nodes have MX support, but they do not share the same mapper the add_procs will silently fails.
This sounds dodgy and icky. We have to wait for a connect timeout to fail over to the next BTL?
How long is the typical/default TCP timeout?
For corporate legal information go to: