On Wednesday, May 07, 2014 5:23 AM, devel [devel-bounces_at_[hidden]] on behalf of Gilles Gouaillardet [gilles.gouaillardet_at_[hidden]] wrote:
> To: Open MPI Developers
> Subject: [OMPI devel] scif btl side effects
> Dear OpenMPI Folks,
> i noticed some crashes when running OpenMPI (both latest v1.8 and trunk
> from svn) on a single linux system where a MIC is available.
> /* strictly speaking, MIC hardware is not needed: libscif.so, mic kernel
> module and accessible /dev/mic/* are enough */
> the attached test_scif program can be used in order to evidence this issue.
> /* this is an over simplified version of collective/bcast_struct.c from
> the ibm test suite,
> it is currently failing on the bend-rsh cluster at intel */
> this program will cause a silent failure
> (MPI_Recv receives truncated data without issuing any warning)
> i ran a few investigations and basically, here is what i found :
> MPI_Send will split the message into two fragments. the first fragment
> is sent via the vader btl
> and the second fragment is sent with the scif btl.
> the program will success if the scif btl is disabled (mpirun --mca btl
> interestingly, i found that
> mpirun -host localhost -np 2 --mca btl scif,self ./test_scif
> does produce correct results with ompi v1.8 r31309 (a crash might happen
> in MPI_Finalize)
> and it procuce incorrect results with ompi v1.8 r31671 and trunk (r31667)
> imho :
> a) the scif btl could/should be automatically disabled if no MIC is
> detected on a host
> b) the scif btl could/should not be used to communicates between two
> cores of the host
Also, no. SCIF can give excellent performance for host-host communications as it provides an xpmem-like interface to sharing pages. I have not seen any issues when running just the scif btl on its own so I suspect the issue must be elsewhere.
> (e.g. it should be used *only* when at least one peer is on the MIC)
> c) that being said, that should work so there is a bug
> d) there is a regression in v1.8 and a bug that might have been always here
This is probably not a regression. The SCIF btl has been part of the 1.7 series for some time. The nightly MTTs are probably missing one of the cases that causes this problem. Hopefully we can get this fixed before 1.8.2.