On Thu, May 14, 2009 at 10:47 AM, Terry Dontje <Terry.Dontje@sun.com> wrote:
Ralph Castain wrote:
Hi folks

I encourage people to please look at your MTT outputs. As we are preparing to roll the 1.3.3 release, I am seeing a lot of problems on the branch:

1. timeouts, coming in two forms: (a) MPI_Abort hanging, and (b) collectives hanging (this is mostly on Solaris)

Can you clarify or send me a link that makes you believe b is mostly solaris.  Looking at last night's Sun's MTT 1.3 nightly runs I see 47 timeouts on Linux and 24 timeouts on Solaris.  That doesn't constitute mostly Solaris to me.  Also how are you determining these timeouts are Collective based?  I have a theory they are but I don't have a clear smoking gun as of yet.

I looked at this MTT report, which showed it hanging in a whole bunch of collective tests:

http://www.open-mpi.org/mtt/index.php?limit=&wrap=&trial=&enable_drilldowns=&yaxis_scale=&xaxis_scale=&hide_subtitle=&split_graphs=&remote_go=&do_cookies=&phase=test_run&text_start_timestamp=2009-05-13+15%3A15%3A25+-+2009-05-14+15%3A15%3A25&text_platform_hardware=^x86_64%24&show_platform_hardware=show&text_os_name=^Linux%24&show_os_name=show&text_mpi_name=^ompi-nightly-v1.3%24&show_mpi_name=show&text_mpi_version=^1.3.3a1r21173%24&show_mpi_version=show&text_suite_name=all&show_suite_name=show&text_test_name=all&show_test_name=hide&text_np=all&show_np=show&text_full_command=&show_full_command=show&text_http_username=^sun%24&show_http_username=show&text_local_username=all&show_local_username=hide&text_platform_name=^burl-ct-v20z-10%24&show_platform_name=show&click=Detail&phase=test_run&test_result=_rt&text_os_version=&show_os_version=&text_platform_type=&show_platform_type=&text_hostname=&show_hostname=&text_compiler_name=&show_compiler_name=&text_compiler_version=&show_compiler_version=&text_vpath_mode=&show_vpath_mode=&text_endian=&show_endian=&text_bitness=&show_bitness=&text_configure_arguments=&text_exit_value=&show_exit_value=&text_exit_signal=&show_exit_signal=&text_duration=&show_duration=&text_client_serial=&show_client_serial=&text_result_message=&text_result_stdout=&text_result_stderr=&text_environment=&text_description=&text_launcher=&show_launcher=&text_resource_mgr=&show_resource_mgr=&text_network=&show_network=&text_parameters=&show_parameters=&lastgo=summary

When I look at the hangs on other systems, they are in non-collective tests. I'm not sure what that really means, though - it was just an observation based on this one set of tests.
 


I've been looking at some collective hangs and segv's.  These seem to happen across different platform and OS (Linux and Solaris).  I've been finding it really hard to reproduce.  I ran MPI_Allreduce_loc_c on a three clusters for 2 days without a hang or segv.  I am really concerned whether we'll even be able to get this to fail with debugging on.
I have not been able to get a core or time with a hung run in order to get more information.
2. segfaults - mostly on sif, but occasionally elsewhere

3. daemon failed to report back - this was only on sif

We will need to correct many of these for the release - unless it proves to be due to trivial errors, I don't see how we will be ready to roll release candidates next week.

So let's please start taking a look at these?!

I've actually been looking at ours though I have not been extremely vocal.  I was hoping to get more info on our timeouts before requesting help.

No problem - I wasn't pointing a finger at anyone in particular. Just wanted to highlight that the branch is not in great shape since we had talked on the telecon about trying to do a release next week.



Ralph

------------------------------------------------------------------------

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel