Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [openib] segfault when using openib btl
From: Eloi Gaudry (eg_at_[hidden])
Date: 2010-07-13 11:28:08


Hi,

I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch to the basic linear algorithm.
Anyway whatever the algorithm used, the segmentation fault remains.

Does anyone could give some advice on ways to diagnose the issue I'm facing ?

Regards,
Eloi

On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
> Hi,
>
> I'm focusing on the MPI_Bcast routine that seems to randomly segfault when
> using the openib btl. I'd like to know if there is any way to make OpenMPI
> switch to a different algorithm than the default one being selected for
> MPI_Bcast.
>
> Thanks for your help,
> Eloi
>
> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
> > Hi,
> >
> > I'm observing a random segmentation fault during an internode parallel
> > computation involving the openib btl and OpenMPI-1.4.2 (the same issue
> > can be observed with OpenMPI-1.3.3).
> >
> > mpirun (Open MPI) 1.4.2
> > Report bugs to http://www.open-mpi.org/community/help/
> > [pbn08:02624] *** Process received signal ***
> > [pbn08:02624] Signal: Segmentation fault (11)
> > [pbn08:02624] Signal code: Address not mapped (1)
> > [pbn08:02624] Failing at address: (nil)
> > [pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
> > [pbn08:02624] *** End of error message ***
> > sh: line 1: 2624 Segmentation fault
> >
> > \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5\/x86_6
> > 4\ /bin\/actranpy_mp
> > '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x86_64/A
> > c tran_11.0.rc2.41872'
> > '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2.dat'
> > '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch' '--mem=3200'
> > '--threads=1' '--errorlevel=FATAL' '--t_max=0.1' '--parallel=domain'
> >
> > If I choose not to use the openib btl (by using --mca btl self,sm,tcp on
> > the command line, for instance), I don't encounter any problem and the
> > parallel computation runs flawlessly.
> >
> > I would like to get some help to be able:
> > - to diagnose the issue I'm facing with the openib btl
> > - understand why this issue is observed only when using the openib btl
> > and not when using self,sm,tcp
> >
> > Any help would be very much appreciated.
> >
> > The outputs of ompi_info and the configure scripts of OpenMPI are
> > enclosed to this email, and some information on the infiniband drivers
> > as well.
> >
> > Here is the command line used when launching a parallel computation
> >
> > using infiniband:
> > path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list --mca
> >
> > btl openib,sm,self,tcp --display-map --verbose --version --mca
> > mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
> >
> > and the command line used if not using infiniband:
> > path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list --mca
> >
> > btl self,sm,tcp --display-map --verbose --version --mca
> > mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
> >
> > Thanks,
> > Eloi
>