Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] MPIEXEC_TIMEOUT broken in v1.7 branch @ r31103
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-03-18 16:11:21


Okay, fixed and cmr'd to you

On Mar 18, 2014, at 11:00 AM, Ralph Castain <rhc_at_[hidden]> wrote:

>
> On Mar 18, 2014, at 10:54 AM, Dave Goodell (dgoodell) <dgoodell_at_[hidden]> wrote:
>
>> Ralph,
>>
>> I'm seeing problems with MPIEXEC_TIMEOUT in v1.7 @ r31103 (fairly close to HEAD):
>>
>> ----8<----
>> MPIEXEC_TIMEOUT=8 mpirun --mca btl usnic,sm,self -np 4 ./sleeper
>> --------------------------------------------------------------------------
>> The user-provided time limit for job execution has been
>> reached:
>>
>> MPIEXEC_TIMEOUT: 8 seconds
>>
>> The job will now be aborted. Please check your code and/or
>> adjust/remove the job execution time limit (as specified
>> by MPIEXEC_TIMEOUT in your environment).
>>
>> --------------------------------------------------------------------------
>> srun: error: mpi015: task 0: Killed
>> srun: Terminating job step 689585.2
>> srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
>> ^C[savbu-usnic-a:26668] [[14634,0],0]->[[14634,0],1] mca_oob_tcp_msg_send_bytes: write failed: Connection reset by peer (104) [sd = 16]
>> [savbu-usnic-a:26668] [[14634,0],0]-[[14634,0],1] mca_oob_tcp_peer_send_handler: unable to send header
>>
>> ^CAbort is in progress...hit ctrl-c again within 5 seconds to forcibly terminate
>>
>> ^C
>> ----8<----
>>
>> Where each of the "^C" is a ctrl-c with arbitrary was allowed to pass beforehand (several minutes for the first two, <5s in the third).
>>
>> Where "sleeper" is just an MPI program that does:
>>
>> ----8<----
>> MPI_Init(&argc, &argv);
>> MPI_Comm_rank(MPI_COMM_WORLD, &wrank);
>> MPI_Comm_size(MPI_COMM_WORLD, &wsize);
>>
>> while (1) {
>> sleep(60);
>> }
>>
>> MPI_Finalize();
>> ----8<----
>>
>> It happens under slurm and SSH. If I launch on localhost (no --host/--hostfile option, no slurm, etc.) then it exits just fine. The example output I gave above used the "usnic" BTL, but "tcp" has identical behavior.
>>
>> This worked fine in v1.7.4. I've bisected the change in behavior down to r30981: https://svn.open-mpi.org/trac/ompi/changeset/30981
>>
>> Should I file a ticket?
>>
>
> Crud - no, I'll take a look in a little bit
>
>
>> -Dave
>>
>