Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] r23023 change to trunk causes problems with exit value
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-04-27 12:02:33


I didn't runt that specific test, but I did run a test that calls MPI_Abort. I found a bug this morning, though (reported by Sam) that was causing the state of remote procs to be incorrectly reported.

Try with r23048 or higher.

On Apr 27, 2010, at 9:15 AM, Rolf vandeVaart wrote:

> Ralph, did you get a chance to run the ibm/final test to see if these changes fixed the problem? I just rebuilt the trunk and tried it and I still get an exit status of 0 back. I will run it again to make sure I have not made a mistake.
>
> Rolf
>
> On 04/26/10 23:43, Ralph Castain wrote:
>>
>> Okay, this should finally be fixed. See the commit message for r23045 for an explanation.
>>
>> It really wasn't anything in the cited changeset that caused the problem. The root cause is that $#@$ abort file we dropped in the session dir to indicate you called MPI_Abort vs trying to thoroughly cleanup. Been biting us in the butt for years - finally removed it.
>>
>>
>> On Apr 26, 2010, at 12:58 PM, Rolf vandeVaart wrote:
>>
>>> The ibm/final test does not call MPI_Abort directly. It is calling MPI_Barrier after MPI_Finalize is called, which is a no-no. This is detected and eventually the library calls ompi_mpi_abort(). This is very similar to MPI_Abort() which ultimately calls ompi_mpi_abort as well. So, I guess I am saying for all intents and purposes, it calls MPI_Abort.
>>>
>>> Rolf
>>>
>>> On 04/26/10 14:41, Ralph Castain wrote:
>>>>
>>>> I'll try to keep it in mind as I continue the errmgr work. I gather these tests all call MPI_Abort?
>>>>
>>>>
>>>> On Apr 26, 2010, at 12:31 PM, Rolf vandeVaart wrote:
>>>>
>>>>
>>>>> With our MTT testing we have noticed a problem that has cropped up in the trunk. There are some tests that are supposed to return a non-zero status because they are getting errors, but are instead returning 0. This problem does not exist in r23022 but does exist in r23023.
>>>>>
>>>>> One can use the ibm/final test to reproduce the problem. An example of a passing case followed by a failing case is shown below.
>>>>>
>>>>> Ralph, you want me to open a ticket on this? Or do you just want to take a look. I am asking you since you did the r23023 commit.
>>>>>
>>>>> Rolf
>>>>>
>>>>>
>>>>> TRUNK VERSION r23022:
>>>>> [rolfv_at_burl-ct-x2200-6 environment]$ mpirun -np 1 -mca btl sm,self final
>>>>> **************************************************************************
>>>>> This test should generate a message about MPI is either not initialized or
>>>>> has already been finialized.
>>>>> ERRORS ARE EXPECTED AND NORMAL IN THIS PROGRAM!!
>>>>> **************************************************************************
>>>>> *** The MPI_Barrier() function was called after MPI_FINALIZE was invoked.
>>>>> *** This is disallowed by the MPI standard.
>>>>> *** Your MPI job will now abort.
>>>>> [burl-ct-x2200-6:6072] Abort after MPI_FINALIZE completed successfully; not able to guarantee that all other processes were killed!
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>> that caused that situation.
>>>>> --------------------------------------------------------------------------
>>>>> [rolfv_at_burl-ct-x2200-6 environment]$ echo $status
>>>>> 1
>>>>> [rolfv_at_burl-ct-x2200-6 environment]$
>>>>>
>>>>>
>>>>> TRUNK VERSION r23023:
>>>>> [rolfv_at_burl-ct-x2200-6 environment]$ mpirun -np 1 -mca btl sm,self final
>>>>> **************************************************************************
>>>>> This test should generate a message about MPI is either not initialized or
>>>>> has already been finialized.
>>>>> ERRORS ARE EXPECTED AND NORMAL IN THIS PROGRAM!!
>>>>> **************************************************************************
>>>>> *** The MPI_Barrier() function was called after MPI_FINALIZE was invoked.
>>>>> *** This is disallowed by the MPI standard.
>>>>> *** Your MPI job will now abort.
>>>>> [burl-ct-x2200-6:4089] Abort after MPI_FINALIZE completed successfully; not able to guarantee that all other processes were killed!
>>>>> [rolfv_at_burl-ct-x2200-6 environment]$ echo $status
>>>>> 0
>>>>> [rolfv_at_burl-ct-x2200-6 environment]$
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel