On Sep 23, 2011, at 1:21 PM, Guilherme V wrote:

I'm using version 1.4.3 and I forgot to tell that I have made a change in the orterun.c line 792:

    if (ORTE_JOB_STATE_TERMINATED != exit_state) {
                    exit(0); /* patch*/

I don't see how that change can keep your job running - we should still have terminated it. All this does is suppress the error reporting.

Regardless, openib will cause the process to fail under the described circumstances, which should cause OMPI to terminate all running procs. I'm not sure what you are doing with tcp, but it could be that there are alternative paths available - e.g., you have multiple NICs and remove one cable, but the other paths remain viable.


> What version of OMPI are you using? The job should terminate in either case - what did you do to keep it running after node failure with tcp?

>On Sep 23, 2011, at 12:34 PM, Guilherme V wrote:

>> Hi,
>> I want to know if anybody is having problems with fault tolerant job using infiniband. When I run my job with tcp if anything happens with one node, my job keeps running, but if I change my job to use infiniband if anything happens with the infiniband (i.e cable problems) my job fails.
>> Anybody knows if there is something different that need to be done when using openib instead tcp?
>> Bellow a example of the message I'm receiving from the mpi.
>> Regards,
>> Guilherme

users mailing list