Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] OMPI 1.3 branch
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-05-14 13:09:41


On Thu, May 14, 2009 at 10:47 AM, Terry Dontje <Terry.Dontje_at_[hidden]> wrote:

> Ralph Castain wrote:
>
>> Hi folks
>>
>> I encourage people to please look at your MTT outputs. As we are preparing
>> to roll the 1.3.3 release, I am seeing a lot of problems on the branch:
>>
>> 1. timeouts, coming in two forms: (a) MPI_Abort hanging, and (b)
>> collectives hanging (this is mostly on Solaris)
>>
>> Can you clarify or send me a link that makes you believe b is mostly
> solaris. Looking at last night's Sun's MTT 1.3 nightly runs I see 47
> timeouts on Linux and 24 timeouts on Solaris. That doesn't constitute
> mostly Solaris to me. Also how are you determining these timeouts are
> Collective based? I have a theory they are but I don't have a clear smoking
> gun as of yet.

I looked at this MTT report, which showed it hanging in a whole bunch of
collective tests:

http://www.open-mpi.org/mtt/index.php?limit=&wrap=&trial=&enable_drilldowns=&yaxis_scale=&xaxis_scale=&hide_subtitle=&split_graphs=&remote_go=&do_cookies=&phase=test_run&text_start_timestamp=2009-05-13+15%3A15%3A25+-+2009-05-14+15%3A15%3A25&text_platform_hardware=
^x86_64%24&show_platform_hardware=show&text_os_name=^Linux%24&show_os_name=show&text_mpi_name=^ompi-nightly-v1.3%24&show_mpi_name=show&text_mpi_version=^1.3.3a1r21173%24&show_mpi_version=show&text_suite_name=all&show_suite_name=show&text_test_name=all&show_test_name=hide&text_np=all&show_np=show&text_full_command=&show_full_command=show&text_http_username=^sun%24&show_http_username=show&text_local_username=all&show_local_username=hide&text_platform_name=^burl-ct-v20z-10%24&show_platform_name=show&click=Detail&phase=test_run&test_result=_rt&text_os_version=&show_os_version=&text_platform_type=&show_platform_type=&text_hostname=&show_hostname=&text_compiler_name=&show_compiler_name=&text_compiler_version=&show_compiler_version=&text_vpath_mode=&show_vpath_mode=&text_endian=&show_endian=&text_bitness=&show_bitness=&text_configure_arguments=&text_exit_value=&show_exit_value=&text_exit_signal=&show_exit_signal=&text_duration=&show_duration=&text_client_serial=&show_client_serial=&text_result_message=&text_result_stdout=&text_result_stderr=&text_environment=&text_description=&text_launcher=&show_launcher=&text_resource_mgr=&show_resource_mgr=&text_network=&show_network=&text_parameters=&show_parameters=&lastgo=summary

When I look at the hangs on other systems, they are in non-collective tests.
I'm not sure what that really means, though - it was just an observation
based on this one set of tests.

>
>
> I've been looking at some collective hangs and segv's. These seem to
> happen across different platform and OS (Linux and Solaris). I've been
> finding it really hard to reproduce. I ran MPI_Allreduce_loc_c on a three
> clusters for 2 days without a hang or segv. I am really concerned whether
> we'll even be able to get this to fail with debugging on.
> I have not been able to get a core or time with a hung run in order to get
> more information.
>
>> 2. segfaults - mostly on sif, but occasionally elsewhere
>>
>> 3. daemon failed to report back - this was only on sif
>>
>> We will need to correct many of these for the release - unless it proves
>> to be due to trivial errors, I don't see how we will be ready to roll
>> release candidates next week.
>>
>> So let's please start taking a look at these?!
>>
>> I've actually been looking at ours though I have not been extremely
> vocal. I was hoping to get more info on our timeouts before requesting
> help.

No problem - I wasn't pointing a finger at anyone in particular. Just wanted
to highlight that the branch is not in great shape since we had talked on
the telecon about trying to do a release next week.

> Ralph
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>