Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.3 test failures
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-10-31 09:37:05


Ralf, Rolf, and I talked about this issue this morning on the phone.
We're pretty sure that it's an overflow because of the large number of
procs being run. LANL is going to try running with -DLARGE_CLUSTER
and see what happens. Rolf thinks he's run the Intel C tests up to 1k
procs, so hopefully that should be sufficient.

On Oct 30, 2008, at 7:01 PM, Ralph Castain wrote:

> Hi folks
>
> We aren't running a full MTT here (which is why I'm reporting these
> results to the list instead of into the MTT database), but we are
> running a subset of tests on the 1.3 beta and hitting a consistent
> set of errors involving five tests. For reference, we see all of
> these tests pass on 1.2.6, but fail in the identical way on 1.2.8 -
> so it appears that something systematic may have entered the system
> and gotten into the 1.2 series as well.
>
> The tests are:
> MPI_Pack_user_type
> MPI_Type_hindexed_blklen
> MPI_Type_vector_stride
> MPI_Cart_get_c
> MPI_Graph_neighbors_c
>
> The tests are running under slurm on RHEL5 with 16-cores of Opteron
> processors on each node plus IB. The below results are with 40 nodes
> at 16ppn.
>
> Any thoughts would be appreciated. Meantime, we are trying different
> ppn to see if that has an impact.
>
> Thanks
> Ralph
>
> Here is what we see:
>
>>> MPITEST error (585): 1 errors in buffer (17, 5) len 1024 commsize
>>> 214
>>> commtype -14 extent 64 root 194 MPITEST error (591): Received buffer
>>> overflow, Expected buffer[65536]: -197, Actual buffer[65536]: 59
>>> MPITEST
>>> error (591): 1 errors in buffer (17, 5) len 1024 commsize 214
>>> commtype -14
>>> extent 64 root 196
>>> MPITEST_results: MPI_Pack_user_type 60480 tests FAILED (of 21076704)
>>>
>>> MPITEST error (597): Received buffer overflow, Expected
>>> buffer[16384]: -199,
>>> Actual buffer[16384]: 57 MPITEST error (597): 1 errors in buffer
>>> (17, 5) len
>>> 16 commsize 214 commtype
>>> -14 extent 64 root 198
>>> MPITEST error (585): Received buffer overflow, Expected
>>> buffer[16384]: -195,
>>> Actual buffer[16384]: 61 MPITEST error (585): 1 errors in buffer
>>> (17, 5) len
>>> 16 commsize 214 commtype
>>> -14 extent 64 root 194
>>> MPITEST_results: MPI_Type_hindexed_blklen 60480 tests FAILED (of
>>> 21076704)
>>>
>>>
>>> MPITEST error (597): Received buffer overflow, Expected
>>> buffer[65536]: -199,
>>> Actual buffer[65536]: 57 MPITEST error (597): 1 errors in buffer
>>> (17, 5) len
>>> 512 commsize 214 commtype -14 extent 64 root 198 MPITEST error
>>> (615):
>>> Received buffer overflow, Expected buffer[65536]: -205, Actual
>>> buffer[65536]: 51 MPITEST error (615): 1 errors in buffer (17, 5)
>>> len 512
>>> commsize 214 commtype -14 extent 64 root 204
>>> MPITEST_results: MPI_Type_vector_stride 60480 tests FAILED (of
>>> 21076704)
>>>
>>> [lob097:32556] *** Process received signal *** mpirun noticed that
>>> job rank
>>> 0 with PID 32556 on node lob097 exited on signal 11 (Segmentation
>>> fault).
>>> 639 additional processes aborted (not shown)
>>> make[1]: *** [MPI_Cart_get_c] Error 139
>>>
>>>
>>> MPITEST fatal error (568): MPI_ERR_COMM: invalid communicator
>>> MPITEST fatal
>>> error (572): MPI_ERR_COMM: invalid communicator MPITEST fatal error
>>> (574):
>>> MPI_ERR_COMM: invalid communicator mpirun noticed that job rank 37
>>> with PID
>>> 32074 on node lob099 exited on signal 1 (Hangup).
>>> 18 additional processes aborted (not shown)
>>> make[1]: *** [MPI_Graph_neighbors_c] Error 1
>>
>
>>
>
> Here is how the different versions are built:
>
> 1.2.6 and 1.2.8
>>> oob_tcp_connect_timeout=600
>>> pml_ob1_use_early_completion=0
>>> mca_component_show_load_errors=0
>>> btl_openib_ib_retry_count=7
>>> btl_openib_ib_timeout=31
>>> mpi_keep_peer_hostnames=1
>>>
>>>
>>> RPMBUILD parameters
>>> setenv CPPFLAGS -I/opt/panfs/include
>>> setenv CFLAGS -I/opt/panfs/include
>>>
>>> rpmbuild -bb ./SPECS/loboopenmpi128.spec \
>>> --with gcc \
>>> --with root=/opt/OpenMPI \
>>> --with shared \
>>> --with openib \
>>> --with slurm \
>>> --without pty_support \
>>> --without dlopen \
>>> --with io_romio_flags=--with-file-system=ufs+nfs+panfs
>>>
>>
>
>
>
>
> 1.3beta
>
>>> # Basic behavior to smooth startup
>>> mca_component_show_load_errors = 0
>>> orte_abort_timeout = 10
>>> opal_set_max_sys_limits = 1
>>>
>>> ## Protect the shared file systems
>>> orte_no_session_dirs = /panfs,/scratch,/users,/usr/projects
>>> orte_tmpdir_base = /tmp
>>>
>>> ## Require an allocation to run - protects the frontend
>>> ## from inadvertent job executions
>>> orte_allocation_required = 1
>>>
>>> ## Add the interface for out-of-band communication
>>> ## and set it up
>>> oob_tcp_if_include=ib0
>>> oob_tcp_peer_retries = 10
>>> oob_tcp_disable_family = IPv6
>>> oob_tcp_listen_mode = listen_thread
>>> oob_tcp_sndbuf = 32768
>>> oob_tcp_rcvbuf = 32768
>>>
>>> ## Define the MPI interconnects
>>> btl = sm,openib,self
>>>
>>> ## Setup OpenIB
>>> btl_openib_want_fork_support = 0
>>> btl_openib_cpc_include = oob
>>> #btl_openib_receive_queues = P,128,256,64,32,32:S,
>>> 2048,1024,128,32:S,12288,1024,128,32:S,65536,1024,128,32
>>>
>>> ## Enable cpu affinity
>>> mpi_paffinity_alone = 1
>>>
>>> ## Setup MPI options
>>> mpi_show_handle_leaks = 0
>>> mpi_warn_on_fork = 1
>
>>> enable_dlopen=no
>>> with_openib=/opt/ofed
>>> with_openib_libdir=/opt/ofed/lib64
>>> enable_mem_debug=no
>>> enable_mem_profile=no
>>> enable_debug_symbols=no
>>> enable_binaries=yes
>>> with_devel_headers=yes
>>> enable_heterogeneous=yes
>>> enable_debug=no
>>> enable_shared=yes
>>> enable_static=yes
>>> with_slurm=yes
>>> enable_memchecker=no
>>> enable_ipv6=no
>>> enable_mpi_f77=yes
>>> enable_mpi_f90=yes
>>> enable_mpi_cxx=yes
>>> enable_mpi_cxx_seek=yes
>>> enable_cxx_exceptions=yes
>>> enable_mca_no_build=pml-dr,pml-crcp2,crcp,filem
>>> with_io_romio_flags=--with-file-system=ufs+nfs+panfs
>>> with_threads=posix
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems