Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] How to make a job abort when one host dies?
From: Oskar Enoksson (enok_at_[hidden])
Date: 2009-08-11 07:07:54


I searched the FAQ and google but couldn't come up with a solution to
this problem.

My problem is that when one MPI execution host dies or the network
connection goes down the job is not aborted. Instead the remaining
processes continue to eat 100% CPU indefinitely. How can I make jobs
abort in these cases?

I use OpenMPI 1.3.2. We have a myrinet network and I use mtl/mx for mpi
communication. We also use gridengine 6.2u3. The output from the running
job indicates that the remaining processes detect a timeout trying to
communicate with the (dead) host cl120.foi.se. But why do they not
terminate after this failure?

Thanks.

Max retransmit retries reached (1000) for message
        type (1): send_small
        state (0x14): buffered dead
        requeued: 1000 (timeout=501000ms)
        dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
        partner: peer_index=1, endpoint=1, seqnum=0x2b8f
        matched_val: 0x0004000d_fffffff4
        slength=48, xfer_length=48
        seg: 0x7fffe11ff830,48
        caller: 0xdb

Was trying to contact
        00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
        type (1): send_small
        state (0x14): buffered dead
        requeued: 1000 (timeout=501000ms)
        dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
        partner: peer_index=116, endpoint=1, seqnum=0x3726
        matched_val: 0x00040001_fffffff4
        slength=48, xfer_length=48
        seg: 0x7ffff124b7b0,48
        caller: 0x9b

Was trying to contact
        00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
        type (1): send_small
        state (0x14): buffered dead
        requeued: 1000 (timeout=501000ms)
        dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
        partner: peer_index=1, endpoint=0, seqnum=0x1048
        matched_val: 0x00040006_fffffff4
        slength=48, xfer_length=48
        seg: 0x7fffc6470eb0,48
        caller: 0x70

Was trying to contact
        00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
        type (1): send_small
        state (0x14): buffered dead
        requeued: 1000 (timeout=501000ms)
        dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
        partner: peer_index=1, endpoint=1, seqnum=0xd53
        matched_val: 0x00040007_fffffff4
        slength=48, xfer_length=48
        seg: 0x1f54360,48
        caller: 0xda

Was trying to contact
        00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
        type (1): send_small
        state (0x14): buffered dead
        requeued: 1000 (timeout=501000ms)
        dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
        partner: peer_index=116, endpoint=0, seqnum=0x376c
        matched_val: 0x00040000_fffffff4
        slength=48, xfer_length=48
        seg: 0x82ec040,48
        caller: 0x12

Was trying to contact
        00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
        type (1): send_small
        state (0x14): buffered dead
        requeued: 1000 (timeout=501000ms)
        dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
        partner: peer_index=1, endpoint=0, seqnum=0x2746
        matched_val: 0x0004000c_fffffff4
        slength=48, xfer_length=48
        seg: 0x1116f410,48
        caller: 0x30

Was trying to contact
        00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
        type (1): send_small
        state (0x14): buffered dead
        requeued: 1000 (timeout=501000ms)
        dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
        partner: peer_index=1, endpoint=1, seqnum=0x18de
        matched_val: 0x00250001_fffffff4
        slength=104, xfer_length=104
        seg: 0x181c3100,104
        caller: 0x18

Was trying to contact
        00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
        type (2): send_medium
        state (0x14): buffered dead
        requeued: 1000 (timeout=501000ms)
        dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
        partner: peer_index=116, endpoint=0, seqnum=0x3361
        matched_val: 0x0004000f_00000010
        slength=7168, xfer_length=7168
        seg: 0x23e8a838,7168
        caller: 0x7e

Was trying to contact
        00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
        type (2): send_medium
        state (0x14): buffered dead
        requeued: 1000 (timeout=501000ms)
        dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
        partner: peer_index=116, endpoint=1, seqnum=0x3361
        matched_val: 0x0004000f_00000010
        slength=560, xfer_length=560
        seg: 0x23ec9fe0,560
        caller: 0x2d

Was trying to contact
        00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
        type (2): send_medium
        state (0x14): buffered dead
        requeued: 1000 (timeout=501000ms)
        dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
        partner: peer_index=1, endpoint=1, seqnum=0x3361
        matched_val: 0x0004000c_0000000d
        slength=840, xfer_length=840
        seg: 0x1a471a90,840
        caller: 0xf9

Was trying to contact
        00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
        type (3): send_large
        state (0x0):
        requeued: 1000 (timeout=501000ms)
        dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
        partner: peer_index=1, endpoint=1, seqnum=0xad1
        matched_val: 0x00040006_00000007
        slength=133504, xfer_length=79352
        seg: 0x1b0daae0,133504
        local_rdma_id: 6e
        caller: 0xe6

Was trying to contact
        00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
        type (2): send_medium
        state (0x14): buffered dead
        requeued: 1000 (timeout=501000ms)
        dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
        partner: peer_index=116, endpoint=0, seqnum=0x3361
        matched_val: 0x00040001_00000002
        slength=5992, xfer_length=5992
        seg: 0x1b136890,5992
        caller: 0x9f

Was trying to contact
        00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected
Max retransmit retries reached (1000) for message
        type (3): send_large
        state (0x0):
        requeued: 1000 (timeout=501000ms)
        dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
        partner: peer_index=1, endpoint=0, seqnum=0xad1
        matched_val: 0x00040007_00000008
        slength=134400, xfer_length=134400
        seg: 0xb1d5600,134400
        local_rdma_id: 82
        caller: 0xc4

Was trying to contact
        00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
(cl120.foi.se:0) disconnected