Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Fault Tolerant with openib
From: Guilherme V (list.vilela_at_[hidden])
Date: 2011-09-27 09:39:41


Do you know if is there another patch available so my application treats the
fail of one node instead of mpi kill the job? This is very important for me,
I have a big cluster and I can't stop my job every time I have some problem
with just one node.

Regards

On Fri, Sep 23, 2011 at 4:34 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> On Sep 23, 2011, at 1:21 PM, Guilherme V wrote:
>
> I'm using version 1.4.3 and I forgot to tell that I have made a change in
> the orterun.c line 792:
>
> if (ORTE_JOB_STATE_TERMINATED != exit_state) {
> exit(0); /* patch*/
>
>
> I don't see how that change can keep your job running - we should still
> have terminated it. All this does is suppress the error reporting.
>
> Regardless, openib will cause the process to fail under the described
> circumstances, which should cause OMPI to terminate all running procs. I'm
> not sure what you are doing with tcp, but it could be that there are
> alternative paths available - e.g., you have multiple NICs and remove one
> cable, but the other paths remain viable.
>
> Regards
>
>
> > What version of OMPI are you using? The job should terminate in either
> case - what did you do to keep it running after node failure with tcp?
>
> >On Sep 23, 2011, at 12:34 PM, Guilherme V wrote:
> >> Hi,
> >> I want to know if anybody is having problems with fault tolerant job
> using infiniband. When I run my job with tcp if anything happens with one
> node, my job keeps running, but if I change my job to use infiniband if
> anything happens with the infiniband (i.e cable problems) my job fails.
> >>
> >> Anybody knows if there is something different that need to be done when
> using openib instead tcp?
> >>
> >> Bellow a example of the message I'm receiving from the mpi.
> >>
> >> Regards,
> >> Guilherme
>
>
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>