Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] TotalView Memory debugging and OpenMPI
From: Peter Thompson (peter.thompson_at_[hidden])
Date: 2011-05-16 14:45:50


Hi Ralph,

We've had a number of user complaints about this. Since it seems on
the face of it that it is a debugger issue, it may have not made it's
way back here. Is your objection that the patch basically aborts if it
gets a bad value? I could understand that being a concern. Of
course, it aborts on TotalView now if we attempt to move forward without
this patch.

I've passed your comment back to the engineer, with a suspicion about
the concerns about the abort, but if you have other objections, let me know.

Cheers,
PeterT

Ralph Castain wrote:
> That would be a problem, I fear. We need to push those envars into the environment.
>
> Is there some particular problem causing what you see? We have no other reports of this issue, and orterun has had that code forever.
>
>
>
> Sent from my iPad
>
> On May 11, 2011, at 2:05 PM, Peter Thompson <peter.thompson_at_[hidden]> wrote:
>
>
>> We've gotten a few reports of problems with memory debugging when using OpenMPI under TotalView. Usually, TotalView will attach tot he processes started after an MPI_Init. However in the case where memory debugging is enabled, things seemed to run away or fail. My analysis showed that we had a number of core files left over from the attempt, and all were mpirun (or orterun) cores. It seemed to be a regression on our part, since testing seemed to indicate this worked okay before TotalView 8.9.0-0, so I filed an internal bug and passed it to engineering. After giving our engineer a brief tutorial on how to build a debug version of OpenMPI, he found what appears to be a problem in the code for orterun.c. He's made a slight change that fixes the issue in 1.4.2, 1.4.3, 1.4.4rc2 and 1.5.3, those being the versions he's tested with so far. He doesn't subscribe to this list that I know of, so I offered to pass this by the group. Of course, I'm not sure if this is exactly the right place to submit p
atches, but I'm sure you'd tell me where to put it if I'm in the wrong here. It's a short patch, so I'll cut and paste it, and attach as well, since cut and paste can do weird things to formatting.
>>
>> Credit goes to Ariel Burton for this patch. Of course he used TotalVIew to find this ;-) It shows up if you do 'mpirun -tv -np 4 ./foo' or 'totalview mpirun -a -np 4 ./foo'
>>
>> Cheers,
>> PeterT
>>
>>
>> more ~/patches/anbs-patch
>> *** orte/tools/orterun/orterun.c 2010-04-13 13:30:34.000000000 -0400
>> --- /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
>> ./src/openmpi-1.4.2/orte/tools/orterun/orterun.c 2011-05-09 20:28:16.5881
>> 83000 -0400
>> ***************
>> *** 1578,1588 ****
>> }
>> if (NULL != env) {
>> size1 = opal_argv_count(env);
>> for (j = 0; j < size1; ++j) {
>> ! putenv(env[j]);
>> }
>> }
>> /* All done */
>> --- 1578,1600 ----
>> }
>> if (NULL != env) {
>> size1 = opal_argv_count(env);
>> for (j = 0; j < size1; ++j) {
>> ! /* Use-after-Free error possible here. putenv does not copy
>> ! the string passed to it, and instead stores only the pointer.
>> ! env[j] may be freed later, in which case the pointer
>> ! in environ will now be left dangling into a deallocated
>> ! region.
>> ! So we make a copy of the variable.
>> ! */
>> ! char *s = strdup(env[j]);
>> !
>> ! if (NULL == s) {
>> ! return OPAL_ERR_OUT_OF_RESOURCE;
>> ! }
>> ! putenv(s);
>> }
>> }
>> /* All done */
>>
>> *** orte/tools/orterun/orterun.c 2010-04-13 13:30:34.000000000 -0400
>> --- /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c 2011-05-09 20:28:16.588183000 -0400
>> ***************
>> *** 1578,1588 ****
>> }
>>
>> if (NULL != env) {
>> size1 = opal_argv_count(env);
>> for (j = 0; j < size1; ++j) {
>> ! putenv(env[j]);
>> }
>> }
>>
>> /* All done */
>>
>> --- 1578,1600 ----
>> }
>>
>> if (NULL != env) {
>> size1 = opal_argv_count(env);
>> for (j = 0; j < size1; ++j) {
>> ! /* Use-after-Free error possible here. putenv does not copy
>> ! the string passed to it, and instead stores only the pointer.
>> ! env[j] may be freed later, in which case the pointer
>> ! in environ will now be left dangling into a deallocated
>> ! region.
>> ! So we make a copy of the variable.
>> ! */
>> ! char *s = strdup(env[j]);
>> !
>> ! if (NULL == s) {
>> ! return OPAL_ERR_OUT_OF_RESOURCE;
>> ! }
>> ! putenv(s);
>> }
>> }
>>
>> /* All done */
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>