Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] 1.3 test failures
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-10-30 19:01:29


Hi folks

We aren't running a full MTT here (which is why I'm reporting these
results to the list instead of into the MTT database), but we are
running a subset of tests on the 1.3 beta and hitting a consistent set
of errors involving five tests. For reference, we see all of these
tests pass on 1.2.6, but fail in the identical way on 1.2.8 - so it
appears that something systematic may have entered the system and
gotten into the 1.2 series as well.

The tests are:
MPI_Pack_user_type
MPI_Type_hindexed_blklen
MPI_Type_vector_stride
MPI_Cart_get_c
MPI_Graph_neighbors_c

The tests are running under slurm on RHEL5 with 16-cores of Opteron
processors on each node plus IB. The below results are with 40 nodes
at 16ppn.

Any thoughts would be appreciated. Meantime, we are trying different
ppn to see if that has an impact.

Thanks
Ralph

Here is what we see:

>> MPITEST error (585): 1 errors in buffer (17, 5) len 1024 commsize 214
>> commtype -14 extent 64 root 194 MPITEST error (591): Received buffer
>> overflow, Expected buffer[65536]: -197, Actual buffer[65536]: 59
>> MPITEST
>> error (591): 1 errors in buffer (17, 5) len 1024 commsize 214
>> commtype -14
>> extent 64 root 196
>> MPITEST_results: MPI_Pack_user_type 60480 tests FAILED (of 21076704)
>>
>> MPITEST error (597): Received buffer overflow, Expected
>> buffer[16384]: -199,
>> Actual buffer[16384]: 57 MPITEST error (597): 1 errors in buffer
>> (17, 5) len
>> 16 commsize 214 commtype
>> -14 extent 64 root 198
>> MPITEST error (585): Received buffer overflow, Expected
>> buffer[16384]: -195,
>> Actual buffer[16384]: 61 MPITEST error (585): 1 errors in buffer
>> (17, 5) len
>> 16 commsize 214 commtype
>> -14 extent 64 root 194
>> MPITEST_results: MPI_Type_hindexed_blklen 60480 tests FAILED (of
>> 21076704)
>>
>>
>> MPITEST error (597): Received buffer overflow, Expected
>> buffer[65536]: -199,
>> Actual buffer[65536]: 57 MPITEST error (597): 1 errors in buffer
>> (17, 5) len
>> 512 commsize 214 commtype -14 extent 64 root 198 MPITEST error (615):
>> Received buffer overflow, Expected buffer[65536]: -205, Actual
>> buffer[65536]: 51 MPITEST error (615): 1 errors in buffer (17, 5)
>> len 512
>> commsize 214 commtype -14 extent 64 root 204
>> MPITEST_results: MPI_Type_vector_stride 60480 tests FAILED (of
>> 21076704)
>>
>> [lob097:32556] *** Process received signal *** mpirun noticed that
>> job rank
>> 0 with PID 32556 on node lob097 exited on signal 11 (Segmentation
>> fault).
>> 639 additional processes aborted (not shown)
>> make[1]: *** [MPI_Cart_get_c] Error 139
>>
>>
>> MPITEST fatal error (568): MPI_ERR_COMM: invalid communicator
>> MPITEST fatal
>> error (572): MPI_ERR_COMM: invalid communicator MPITEST fatal error
>> (574):
>> MPI_ERR_COMM: invalid communicator mpirun noticed that job rank 37
>> with PID
>> 32074 on node lob099 exited on signal 1 (Hangup).
>> 18 additional processes aborted (not shown)
>> make[1]: *** [MPI_Graph_neighbors_c] Error 1
>

>

Here is how the different versions are built:

1.2.6 and 1.2.8
>> oob_tcp_connect_timeout=600
>> pml_ob1_use_early_completion=0
>> mca_component_show_load_errors=0
>> btl_openib_ib_retry_count=7
>> btl_openib_ib_timeout=31
>> mpi_keep_peer_hostnames=1
>>
>>
>> RPMBUILD parameters
>> setenv CPPFLAGS -I/opt/panfs/include
>> setenv CFLAGS -I/opt/panfs/include
>>
>> rpmbuild -bb ./SPECS/loboopenmpi128.spec \
>> --with gcc \
>> --with root=/opt/OpenMPI \
>> --with shared \
>> --with openib \
>> --with slurm \
>> --without pty_support \
>> --without dlopen \
>> --with io_romio_flags=--with-file-system=ufs+nfs+panfs
>>
>

1.3beta

>> # Basic behavior to smooth startup
>> mca_component_show_load_errors = 0
>> orte_abort_timeout = 10
>> opal_set_max_sys_limits = 1
>>
>> ## Protect the shared file systems
>> orte_no_session_dirs = /panfs,/scratch,/users,/usr/projects
>> orte_tmpdir_base = /tmp
>>
>> ## Require an allocation to run - protects the frontend
>> ## from inadvertent job executions
>> orte_allocation_required = 1
>>
>> ## Add the interface for out-of-band communication
>> ## and set it up
>> oob_tcp_if_include=ib0
>> oob_tcp_peer_retries = 10
>> oob_tcp_disable_family = IPv6
>> oob_tcp_listen_mode = listen_thread
>> oob_tcp_sndbuf = 32768
>> oob_tcp_rcvbuf = 32768
>>
>> ## Define the MPI interconnects
>> btl = sm,openib,self
>>
>> ## Setup OpenIB
>> btl_openib_want_fork_support = 0
>> btl_openib_cpc_include = oob
>> #btl_openib_receive_queues = P,128,256,64,32,32:S,
>> 2048,1024,128,32:S,12288,1024,128,32:S,65536,1024,128,32
>>
>> ## Enable cpu affinity
>> mpi_paffinity_alone = 1
>>
>> ## Setup MPI options
>> mpi_show_handle_leaks = 0
>> mpi_warn_on_fork = 1

>> enable_dlopen=no
>> with_openib=/opt/ofed
>> with_openib_libdir=/opt/ofed/lib64
>> enable_mem_debug=no
>> enable_mem_profile=no
>> enable_debug_symbols=no
>> enable_binaries=yes
>> with_devel_headers=yes
>> enable_heterogeneous=yes
>> enable_debug=no
>> enable_shared=yes
>> enable_static=yes
>> with_slurm=yes
>> enable_memchecker=no
>> enable_ipv6=no
>> enable_mpi_f77=yes
>> enable_mpi_f90=yes
>> enable_mpi_cxx=yes
>> enable_mpi_cxx_seek=yes
>> enable_cxx_exceptions=yes
>> enable_mca_no_build=pml-dr,pml-crcp2,crcp,filem
>> with_io_romio_flags=--with-file-system=ufs+nfs+panfs
>> with_threads=posix