Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] MPIEXEC_TIMEOUT broken in v1.7 branch @ r31103
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-03-18 14:00:35


On Mar 18, 2014, at 10:54 AM, Dave Goodell (dgoodell) <dgoodell_at_[hidden]> wrote:

> Ralph,
>
> I'm seeing problems with MPIEXEC_TIMEOUT in v1.7 @ r31103 (fairly close to HEAD):
>
> ----8<----
> MPIEXEC_TIMEOUT=8 mpirun --mca btl usnic,sm,self -np 4 ./sleeper
> --------------------------------------------------------------------------
> The user-provided time limit for job execution has been
> reached:
>
> MPIEXEC_TIMEOUT: 8 seconds
>
> The job will now be aborted. Please check your code and/or
> adjust/remove the job execution time limit (as specified
> by MPIEXEC_TIMEOUT in your environment).
>
> --------------------------------------------------------------------------
> srun: error: mpi015: task 0: Killed
> srun: Terminating job step 689585.2
> srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
> ^C[savbu-usnic-a:26668] [[14634,0],0]->[[14634,0],1] mca_oob_tcp_msg_send_bytes: write failed: Connection reset by peer (104) [sd = 16]
> [savbu-usnic-a:26668] [[14634,0],0]-[[14634,0],1] mca_oob_tcp_peer_send_handler: unable to send header
>
> ^CAbort is in progress...hit ctrl-c again within 5 seconds to forcibly terminate
>
> ^C
> ----8<----
>
> Where each of the "^C" is a ctrl-c with arbitrary was allowed to pass beforehand (several minutes for the first two, <5s in the third).
>
> Where "sleeper" is just an MPI program that does:
>
> ----8<----
> MPI_Init(&argc, &argv);
> MPI_Comm_rank(MPI_COMM_WORLD, &wrank);
> MPI_Comm_size(MPI_COMM_WORLD, &wsize);
>
> while (1) {
> sleep(60);
> }
>
> MPI_Finalize();
> ----8<----
>
> It happens under slurm and SSH. If I launch on localhost (no --host/--hostfile option, no slurm, etc.) then it exits just fine. The example output I gave above used the "usnic" BTL, but "tcp" has identical behavior.
>
> This worked fine in v1.7.4. I've bisected the change in behavior down to r30981: https://svn.open-mpi.org/trac/ompi/changeset/30981
>
> Should I file a ticket?
>

Crud - no, I'll take a look in a little bit

> -Dave
>