Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How to make a job abort when one host dies?
From: Scott Atchley (atchley_at_[hidden])
Date: 2009-08-17 16:22:04


On Aug 17, 2009, at 2:43 PM, Jeff Squyres wrote:

> George / Myricom --
>
> Does the MX MTL abort if it gets a "disconnected" error back from
> libmyriexpress?

Short answer: yes.

Long answer:

The messages below indicate that these processes were all trying to
send to cl120. It did not ack their messages after 1000 resend
attempts (each retry is attempted with a 0.5 second interval) which is
about 8.3 minutes (500 seconds).

The messages also indicate that the message was a send_small which
means it was 128 bytes or less. MX has MPI like semantics and allow
for completion after the message has been either buffered or
delivered. In this case, it was buffered and OMPI was most likely able
to complete it successfully. The message was not able to be delivered,
however, and its timeout caused MX to fail all future sends to that
host. On the next mx_isend(), OMPI will detect a failure.

Since it does not detect failure, my guess is that the process has not
tried to send again to that host. They then end up waiting forever.

They can change MX's behavior so that it does not complete a send
until the receiver has acked it by exporting:

MX_ZOMBIE_SEND=0

This will hurt benchmark performance, but real application performance
should not be affected.

The question is, however, why is cl120 not acking messages? What is
the application? What MPI calls does this application use?

Scott

> On Aug 11, 2009, at 7:07 AM, Oskar Enoksson wrote:
>
>> I searched the FAQ and google but couldn't come up with a solution to
>> this problem.
>>
>> My problem is that when one MPI execution host dies or the network
>> connection goes down the job is not aborted. Instead the remaining
>> processes continue to eat 100% CPU indefinitely. How can I make jobs
>> abort in these cases?
>>
>> I use OpenMPI 1.3.2. We have a myrinet network and I use mtl/mx for
>> mpi
>> communication. We also use gridengine 6.2u3. The output from the
>> running
>> job indicates that the remaining processes detect a timeout trying to
>> communicate with the (dead) host cl120.foi.se. But why do they not
>> terminate after this failure?
>>
>> Thanks.
>>
>> Max retransmit retries reached (1000) for message
>> type (1): send_small
>> state (0x14): buffered dead
>> requeued: 1000 (timeout=501000ms)
>> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
>> partner: peer_index=1, endpoint=1, seqnum=0x2b8f
>> matched_val: 0x0004000d_fffffff4
>> slength=48, xfer_length=48
>> seg: 0x7fffe11ff830,48
>> caller: 0xdb
>>
>> Was trying to contact
>> 00:60:dd:49:78:59 (cl120.foi.se:0)/1
>> Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
>> (cl120.foi.se:0) disconnected
>> Max retransmit retries reached (1000) for message
>> type (1): send_small
>> state (0x14): buffered dead
>> requeued: 1000 (timeout=501000ms)
>> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
>> partner: peer_index=116, endpoint=1, seqnum=0x3726
>> matched_val: 0x00040001_fffffff4
>> slength=48, xfer_length=48
>> seg: 0x7ffff124b7b0,48
>> caller: 0x9b
>>
>> Was trying to contact
>> 00:60:dd:49:78:59 (cl120.foi.se:0)/1
>> Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
>> (cl120.foi.se:0) disconnected
>> Max retransmit retries reached (1000) for message
>> type (1): send_small
>> state (0x14): buffered dead
>> requeued: 1000 (timeout=501000ms)
>> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
>> partner: peer_index=1, endpoint=0, seqnum=0x1048
>> matched_val: 0x00040006_fffffff4
>> slength=48, xfer_length=48
>> seg: 0x7fffc6470eb0,48
>> caller: 0x70
>>
>> Was trying to contact
>> 00:60:dd:49:78:59 (cl120.foi.se:0)/0
>> Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
>> (cl120.foi.se:0) disconnected
>> Max retransmit retries reached (1000) for message
>> type (1): send_small
>> state (0x14): buffered dead
>> requeued: 1000 (timeout=501000ms)
>> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
>> partner: peer_index=1, endpoint=1, seqnum=0xd53
>> matched_val: 0x00040007_fffffff4
>> slength=48, xfer_length=48
>> seg: 0x1f54360,48
>> caller: 0xda
>>
>> Was trying to contact
>> 00:60:dd:49:78:59 (cl120.foi.se:0)/1
>> Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
>> (cl120.foi.se:0) disconnected
>> Max retransmit retries reached (1000) for message
>> type (1): send_small
>> state (0x14): buffered dead
>> requeued: 1000 (timeout=501000ms)
>> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
>> partner: peer_index=116, endpoint=0, seqnum=0x376c
>> matched_val: 0x00040000_fffffff4
>> slength=48, xfer_length=48
>> seg: 0x82ec040,48
>> caller: 0x12
>>
>> Was trying to contact
>> 00:60:dd:49:78:59 (cl120.foi.se:0)/0
>> Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
>> (cl120.foi.se:0) disconnected
>> Max retransmit retries reached (1000) for message
>> type (1): send_small
>> state (0x14): buffered dead
>> requeued: 1000 (timeout=501000ms)
>> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
>> partner: peer_index=1, endpoint=0, seqnum=0x2746
>> matched_val: 0x0004000c_fffffff4
>> slength=48, xfer_length=48
>> seg: 0x1116f410,48
>> caller: 0x30
>>
>> Was trying to contact
>> 00:60:dd:49:78:59 (cl120.foi.se:0)/0
>> Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
>> (cl120.foi.se:0) disconnected
>> Max retransmit retries reached (1000) for message
>> type (1): send_small
>> state (0x14): buffered dead
>> requeued: 1000 (timeout=501000ms)
>> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
>> partner: peer_index=1, endpoint=1, seqnum=0x18de
>> matched_val: 0x00250001_fffffff4
>> slength=104, xfer_length=104
>> seg: 0x181c3100,104
>> caller: 0x18
>>
>> Was trying to contact
>> 00:60:dd:49:78:59 (cl120.foi.se:0)/1
>> Aborted 2 send requests due to remote peer 00:60:dd:49:78:59
>> (cl120.foi.se:0) disconnected
>> Max retransmit retries reached (1000) for message
>> type (2): send_medium
>> state (0x14): buffered dead
>> requeued: 1000 (timeout=501000ms)
>> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
>> partner: peer_index=116, endpoint=0, seqnum=0x3361
>> matched_val: 0x0004000f_00000010
>> slength=7168, xfer_length=7168
>> seg: 0x23e8a838,7168
>> caller: 0x7e
>>
>> Was trying to contact
>> 00:60:dd:49:78:59 (cl120.foi.se:0)/0
>> Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
>> (cl120.foi.se:0) disconnected
>> Max retransmit retries reached (1000) for message
>> type (2): send_medium
>> state (0x14): buffered dead
>> requeued: 1000 (timeout=501000ms)
>> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
>> partner: peer_index=116, endpoint=1, seqnum=0x3361
>> matched_val: 0x0004000f_00000010
>> slength=560, xfer_length=560
>> seg: 0x23ec9fe0,560
>> caller: 0x2d
>>
>> Was trying to contact
>> 00:60:dd:49:78:59 (cl120.foi.se:0)/1
>> Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
>> (cl120.foi.se:0) disconnected
>> Max retransmit retries reached (1000) for message
>> type (2): send_medium
>> state (0x14): buffered dead
>> requeued: 1000 (timeout=501000ms)
>> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
>> partner: peer_index=1, endpoint=1, seqnum=0x3361
>> matched_val: 0x0004000c_0000000d
>> slength=840, xfer_length=840
>> seg: 0x1a471a90,840
>> caller: 0xf9
>>
>> Was trying to contact
>> 00:60:dd:49:78:59 (cl120.foi.se:0)/1
>> Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
>> (cl120.foi.se:0) disconnected
>> Max retransmit retries reached (1000) for message
>> type (3): send_large
>> state (0x0):
>> requeued: 1000 (timeout=501000ms)
>> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
>> partner: peer_index=1, endpoint=1, seqnum=0xad1
>> matched_val: 0x00040006_00000007
>> slength=133504, xfer_length=79352
>> seg: 0x1b0daae0,133504
>> local_rdma_id: 6e
>> caller: 0xe6
>>
>> Was trying to contact
>> 00:60:dd:49:78:59 (cl120.foi.se:0)/1
>> Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
>> (cl120.foi.se:0) disconnected
>> Max retransmit retries reached (1000) for message
>> type (2): send_medium
>> state (0x14): buffered dead
>> requeued: 1000 (timeout=501000ms)
>> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
>> partner: peer_index=116, endpoint=0, seqnum=0x3361
>> matched_val: 0x00040001_00000002
>> slength=5992, xfer_length=5992
>> seg: 0x1b136890,5992
>> caller: 0x9f
>>
>> Was trying to contact
>> 00:60:dd:49:78:59 (cl120.foi.se:0)/0
>> Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
>> (cl120.foi.se:0) disconnected
>> Max retransmit retries reached (1000) for message
>> type (3): send_large
>> state (0x0):
>> requeued: 1000 (timeout=501000ms)
>> dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
>> partner: peer_index=1, endpoint=0, seqnum=0xad1
>> matched_val: 0x00040007_00000008
>> slength=134400, xfer_length=134400
>> seg: 0xb1d5600,134400
>> local_rdma_id: 82
>> caller: 0xc4
>>
>> Was trying to contact
>> 00:60:dd:49:78:59 (cl120.foi.se:0)/0
>> Aborted 1 send requests due to remote peer 00:60:dd:49:78:59
>> (cl120.foi.se:0) disconnected
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users