Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How to make a job abort when one host dies?
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-08-17 14:43:42


George / Myricom --

Does the MX MTL abort if it gets a "disconnected" error back from
libmyriexpress?

On Aug 11, 2009, at 7:07 AM, Oskar Enoksson wrote:

> I searched the FAQ and google but couldn't come up with a solution to
> this problem.
>
> My problem is that when one MPI execution host dies or the network
> connection goes down the job is not aborted. Instead the remaining
> processes continue to eat 100% CPU indefinitely. How can I make jobs
> abort in these cases?
>
> I use OpenMPI 1.3.2. We have a myrinet network and I use mtl/mx for
> mpi
> communication. We also use gridengine 6.2u3. The output from the
> running
> job indicates that the remaining processes detect a timeout trying to
> communicate with the (dead) host cl120.foi.se. But why do they not
> terminate after this failure?
>
> Thanks.
>
> Max retransmit retries reached (1000) for message
> type (1): send_small
> state (0x14): buffered dead
> requeued: 1000 (timeout=501000ms)
> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
> partner: peer_index=1, endpoint=1, seqnum=0x2b8f
> matched_val: 0x0004000d_fffffff4
> slength=48, xfer_length=48
> seg: 0x7fffe11ff830,48
> caller: 0xdb
>
> Was trying to contact
> 00:60:dd:49:78:59 (cl120.foi.se:0)/1
> Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
> (cl120.foi.se:0) disconnected
> Max retransmit retries reached (1000) for message
> type (1): send_small
> state (0x14): buffered dead
> requeued: 1000 (timeout=501000ms)
> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
> partner: peer_index=116, endpoint=1, seqnum=0x3726
> matched_val: 0x00040001_fffffff4
> slength=48, xfer_length=48
> seg: 0x7ffff124b7b0,48
> caller: 0x9b
>
> Was trying to contact
> 00:60:dd:49:78:59 (cl120.foi.se:0)/1
> Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
> (cl120.foi.se:0) disconnected
> Max retransmit retries reached (1000) for message
> type (1): send_small
> state (0x14): buffered dead
> requeued: 1000 (timeout=501000ms)
> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
> partner: peer_index=1, endpoint=0, seqnum=0x1048
> matched_val: 0x00040006_fffffff4
> slength=48, xfer_length=48
> seg: 0x7fffc6470eb0,48
> caller: 0x70
>
> Was trying to contact
> 00:60:dd:49:78:59 (cl120.foi.se:0)/0
> Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
> (cl120.foi.se:0) disconnected
> Max retransmit retries reached (1000) for message
> type (1): send_small
> state (0x14): buffered dead
> requeued: 1000 (timeout=501000ms)
> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
> partner: peer_index=1, endpoint=1, seqnum=0xd53
> matched_val: 0x00040007_fffffff4
> slength=48, xfer_length=48
> seg: 0x1f54360,48
> caller: 0xda
>
> Was trying to contact
> 00:60:dd:49:78:59 (cl120.foi.se:0)/1
> Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
> (cl120.foi.se:0) disconnected
> Max retransmit retries reached (1000) for message
> type (1): send_small
> state (0x14): buffered dead
> requeued: 1000 (timeout=501000ms)
> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
> partner: peer_index=116, endpoint=0, seqnum=0x376c
> matched_val: 0x00040000_fffffff4
> slength=48, xfer_length=48
> seg: 0x82ec040,48
> caller: 0x12
>
> Was trying to contact
> 00:60:dd:49:78:59 (cl120.foi.se:0)/0
> Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
> (cl120.foi.se:0) disconnected
> Max retransmit retries reached (1000) for message
> type (1): send_small
> state (0x14): buffered dead
> requeued: 1000 (timeout=501000ms)
> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
> partner: peer_index=1, endpoint=0, seqnum=0x2746
> matched_val: 0x0004000c_fffffff4
> slength=48, xfer_length=48
> seg: 0x1116f410,48
> caller: 0x30
>
> Was trying to contact
> 00:60:dd:49:78:59 (cl120.foi.se:0)/0
> Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
> (cl120.foi.se:0) disconnected
> Max retransmit retries reached (1000) for message
> type (1): send_small
> state (0x14): buffered dead
> requeued: 1000 (timeout=501000ms)
> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
> partner: peer_index=1, endpoint=1, seqnum=0x18de
> matched_val: 0x00250001_fffffff4
> slength=104, xfer_length=104
> seg: 0x181c3100,104
> caller: 0x18
>
> Was trying to contact
> 00:60:dd:49:78:59 (cl120.foi.se:0)/1
> Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
> (cl120.foi.se:0) disconnected
> Max retransmit retries reached (1000) for message
> type (2): send_medium
> state (0x14): buffered dead
> requeued: 1000 (timeout=501000ms)
> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
> partner: peer_index=116, endpoint=0, seqnum=0x3361
> matched_val: 0x0004000f_00000010
> slength=7168, xfer_length=7168
> seg: 0x23e8a838,7168
> caller: 0x7e
>
> Was trying to contact
> 00:60:dd:49:78:59 (cl120.foi.se:0)/0
> Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
> (cl120.foi.se:0) disconnected
> Max retransmit retries reached (1000) for message
> type (2): send_medium
> state (0x14): buffered dead
> requeued: 1000 (timeout=501000ms)
> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
> partner: peer_index=116, endpoint=1, seqnum=0x3361
> matched_val: 0x0004000f_00000010
> slength=560, xfer_length=560
> seg: 0x23ec9fe0,560
> caller: 0x2d
>
> Was trying to contact
> 00:60:dd:49:78:59 (cl120.foi.se:0)/1
> Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
> (cl120.foi.se:0) disconnected
> Max retransmit retries reached (1000) for message
> type (2): send_medium
> state (0x14): buffered dead
> requeued: 1000 (timeout=501000ms)
> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
> partner: peer_index=1, endpoint=1, seqnum=0x3361
> matched_val: 0x0004000c_0000000d
> slength=840, xfer_length=840
> seg: 0x1a471a90,840
> caller: 0xf9
>
> Was trying to contact
> 00:60:dd:49:78:59 (cl120.foi.se:0)/1
> Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
> (cl120.foi.se:0) disconnected
> Max retransmit retries reached (1000) for message
> type (3): send_large
> state (0x0):
> requeued: 1000 (timeout=501000ms)
> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
> partner: peer_index=1, endpoint=1, seqnum=0xad1
> matched_val: 0x00040006_00000007
> slength=133504, xfer_length=79352
> seg: 0x1b0daae0,133504
> local_rdma_id: 6e
> caller: 0xe6
>
> Was trying to contact
> 00:60:dd:49:78:59 (cl120.foi.se:0)/1
> Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
> (cl120.foi.se:0) disconnected
> Max retransmit retries reached (1000) for message
> type (2): send_medium
> state (0x14): buffered dead
> requeued: 1000 (timeout=501000ms)
> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
> partner: peer_index=116, endpoint=0, seqnum=0x3361
> matched_val: 0x00040001_00000002
> slength=5992, xfer_length=5992
> seg: 0x1b136890,5992
> caller: 0x9f
>
> Was trying to contact
> 00:60:dd:49:78:59 (cl120.foi.se:0)/0
> Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
> (cl120.foi.se:0) disconnected
> Max retransmit retries reached (1000) for message
> type (3): send_large
> state (0x0):
> requeued: 1000 (timeout=501000ms)
> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
> partner: peer_index=1, endpoint=0, seqnum=0xad1
> matched_val: 0x00040007_00000008
> slength=134400, xfer_length=134400
> seg: 0xb1d5600,134400
> local_rdma_id: 82
> caller: 0xc4
>
> Was trying to contact
> 00:60:dd:49:78:59 (cl120.foi.se:0)/0
> Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
> (cl120.foi.se:0) disconnected
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Jeff Squyres
jsquyres_at_[hidden]