Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] MPIEXEC_TIMEOUT broken in v1.7 branch @ r31103
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2014-03-18 16:27:46


This seems to be working, but I think we now have a pid group problem -- I think we need to setpgid() right after the fork. Otherwise, when we kill the group, we might end up killing much more than just the one MPI process (including the orted and/or orted's parent!).

Ping me on IM -- I'm testing this idea and it seems to work properly.

On Mar 18, 2014, at 4:11 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> Okay, fixed and cmr'd to you
>
>
> On Mar 18, 2014, at 11:00 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>>
>> On Mar 18, 2014, at 10:54 AM, Dave Goodell (dgoodell) <dgoodell_at_[hidden]> wrote:
>>
>>> Ralph,
>>>
>>> I'm seeing problems with MPIEXEC_TIMEOUT in v1.7 @ r31103 (fairly close to HEAD):
>>>
>>> ----8<----
>>> MPIEXEC_TIMEOUT=8 mpirun --mca btl usnic,sm,self -np 4 ./sleeper
>>> --------------------------------------------------------------------------
>>> The user-provided time limit for job execution has been
>>> reached:
>>>
>>> MPIEXEC_TIMEOUT: 8 seconds
>>>
>>> The job will now be aborted. Please check your code and/or
>>> adjust/remove the job execution time limit (as specified
>>> by MPIEXEC_TIMEOUT in your environment).
>>>
>>> --------------------------------------------------------------------------
>>> srun: error: mpi015: task 0: Killed
>>> srun: Terminating job step 689585.2
>>> srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
>>> ^C[savbu-usnic-a:26668] [[14634,0],0]->[[14634,0],1] mca_oob_tcp_msg_send_bytes: write failed: Connection reset by peer (104) [sd = 16]
>>> [savbu-usnic-a:26668] [[14634,0],0]-[[14634,0],1] mca_oob_tcp_peer_send_handler: unable to send header
>>>
>>> ^CAbort is in progress...hit ctrl-c again within 5 seconds to forcibly terminate
>>>
>>> ^C
>>> ----8<----
>>>
>>> Where each of the "^C" is a ctrl-c with arbitrary was allowed to pass beforehand (several minutes for the first two, <5s in the third).
>>>
>>> Where "sleeper" is just an MPI program that does:
>>>
>>> ----8<----
>>> MPI_Init(&argc, &argv);
>>> MPI_Comm_rank(MPI_COMM_WORLD, &wrank);
>>> MPI_Comm_size(MPI_COMM_WORLD, &wsize);
>>>
>>> while (1) {
>>> sleep(60);
>>> }
>>>
>>> MPI_Finalize();
>>> ----8<----
>>>
>>> It happens under slurm and SSH. If I launch on localhost (no --host/--hostfile option, no slurm, etc.) then it exits just fine. The example output I gave above used the "usnic" BTL, but "tcp" has identical behavior.
>>>
>>> This worked fine in v1.7.4. I've bisected the change in behavior down to r30981: https://svn.open-mpi.org/trac/ompi/changeset/30981
>>>
>>> Should I file a ticket?
>>>
>>
>> Crud - no, I'll take a look in a little bit
>>
>>
>>> -Dave
>>>
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/03/14367.php

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/