Do you know if is there another patch available so my application treats the fail of one node instead of mpi kill the job? This is very important for me, I have a big cluster and I can't stop my job every time I have some problem with just one node.
On Sep 23, 2011, at 1:21 PM, Guilherme V wrote:I don't see how that change can keep your job running - we should still have terminated it. All this does is suppress the error reporting.I'm using version 1.4.3 and I forgot to tell that I have made a change in the orterun.c line 792:
if (ORTE_JOB_STATE_TERMINATED != exit_state) {
exit(0); /* patch*/Regardless, openib will cause the process to fail under the described circumstances, which should cause OMPI to terminate all running procs. I'm not sure what you are doing with tcp, but it could be that there are alternative paths available - e.g., you have multiple NICs and remove one cable, but the other paths remain viable._______________________________________________
Regards
> What version of OMPI are you using? The job should terminate in either case - what did you do to keep it running after node failure with tcp?
>On Sep 23, 2011, at 12:34 PM, Guilherme V wrote:
>> Hi,
>> I want to know if anybody is having problems with fault tolerant job using infiniband. When I run my job with tcp if anything happens with one node, my job keeps running, but if I change my job to use infiniband if anything happens with the infiniband (i.e cable problems) my job fails.
>>
>> Anybody knows if there is something different that need to be done when using openib instead tcp?
>>
>> Bellow a example of the message I'm receiving from the mpi.
>>
>> Regards,
>> Guilherme
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users