Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [openib] segfault when using openib btl
From: Eloi Gaudry (eg_at_[hidden])
Date: 2010-07-15 11:18:36


hi edgar,

thanks for the tips, I'm gonna try this option as well. the segmentation fault i'm observing always happened during a collective communication indeed...
does it basically switch all collective communication to basic mode, right ?

sorry for my ignorance, but what's a NCA ?

thanks,
éloi

On Thursday 15 July 2010 16:20:54 Edgar Gabriel wrote:
> you could try first to use the algorithms in the basic module, e.g.
>
> mpirun -np x --mca coll basic ./mytest
>
> and see whether this makes a difference. I used to observe sometimes a
> (similar ?) problem in the openib btl triggered from the tuned
> collective component, in cases where the ofed libraries were installed
> but no NCA was found on a node. It used to work however with the basic
> component.
>
> Thanks
> Edgar
>
> On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
> > hi Rolf,
> >
> > unfortunately, i couldn't get rid of that annoying segmentation fault
> > when selecting another bcast algorithm. i'm now going to replace
> > MPI_Bcast with a naive implementation (using MPI_Send and MPI_Recv) and
> > see if that helps.
> >
> > regards,
> > éloi
> >
> > On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
> >> Hi Rolf,
> >>
> >> thanks for your input. You're right, I miss the
> >> coll_tuned_use_dynamic_rules option.
> >>
> >> I'll check if I the segmentation fault disappears when using the basic
> >> bcast linear algorithm using the proper command line you provided.
> >>
> >> Regards,
> >> Eloi
> >>
> >> On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart wrote:
> >>> Hi Eloi:
> >>> To select the different bcast algorithms, you need to add an extra mca
> >>> parameter that tells the library to use dynamic selection.
> >>> --mca coll_tuned_use_dynamic_rules 1
> >>>
> >>> One way to make sure you are typing this in correctly is to use it with
> >>> ompi_info. Do the following:
> >>> ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll
> >>>
> >>> You should see lots of output with all the different algorithms that
> >>> can be selected for the various collectives.
> >>> Therefore, you need this:
> >>>
> >>> --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_bcast_algorithm 1
> >>>
> >>> Rolf
> >>>
> >>> On 07/13/10 11:28, Eloi Gaudry wrote:
> >>>> Hi,
> >>>>
> >>>> I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch
> >>>> to the basic linear algorithm. Anyway whatever the algorithm used, the
> >>>> segmentation fault remains.
> >>>>
> >>>> Does anyone could give some advice on ways to diagnose the issue I'm
> >>>> facing ?
> >>>>
> >>>> Regards,
> >>>> Eloi
> >>>>
> >>>> On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I'm focusing on the MPI_Bcast routine that seems to randomly segfault
> >>>>> when using the openib btl. I'd like to know if there is any way to
> >>>>> make OpenMPI switch to a different algorithm than the default one
> >>>>> being selected for MPI_Bcast.
> >>>>>
> >>>>> Thanks for your help,
> >>>>> Eloi
> >>>>>
> >>>>> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> I'm observing a random segmentation fault during an internode
> >>>>>> parallel computation involving the openib btl and OpenMPI-1.4.2 (the
> >>>>>> same issue can be observed with OpenMPI-1.3.3).
> >>>>>>
> >>>>>> mpirun (Open MPI) 1.4.2
> >>>>>> Report bugs to http://www.open-mpi.org/community/help/
> >>>>>> [pbn08:02624] *** Process received signal ***
> >>>>>> [pbn08:02624] Signal: Segmentation fault (11)
> >>>>>> [pbn08:02624] Signal code: Address not mapped (1)
> >>>>>> [pbn08:02624] Failing at address: (nil)
> >>>>>> [pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
> >>>>>> [pbn08:02624] *** End of error message ***
> >>>>>> sh: line 1: 2624 Segmentation fault
> >>>>>>
> >>>>>> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5\/
> >>>>>> x 86 _6 4\ /bin\/actranpy_mp
> >>>>>> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x86
> >>>>>> _ 64 /A c tran_11.0.rc2.41872'
> >>>>>> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2.d
> >>>>>> a t' '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch'
> >>>>>> '--mem=3200' '--threads=1' '--errorlevel=FATAL' '--t_max=0.1'
> >>>>>> '--parallel=domain'
> >>>>>>
> >>>>>> If I choose not to use the openib btl (by using --mca btl
> >>>>>> self,sm,tcp on the command line, for instance), I don't encounter
> >>>>>> any problem and the parallel computation runs flawlessly.
> >>>>>>
> >>>>>> I would like to get some help to be able:
> >>>>>> - to diagnose the issue I'm facing with the openib btl
> >>>>>> - understand why this issue is observed only when using the openib
> >>>>>> btl and not when using self,sm,tcp
> >>>>>>
> >>>>>> Any help would be very much appreciated.
> >>>>>>
> >>>>>> The outputs of ompi_info and the configure scripts of OpenMPI are
> >>>>>> enclosed to this email, and some information on the infiniband
> >>>>>> drivers as well.
> >>>>>>
> >>>>>> Here is the command line used when launching a parallel computation
> >>>>>>
> >>>>>> using infiniband:
> >>>>>> path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list
> >>>>>> --mca
> >>>>>>
> >>>>>> btl openib,sm,self,tcp --display-map --verbose --version --mca
> >>>>>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
> >>>>>>
> >>>>>> and the command line used if not using infiniband:
> >>>>>> path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list
> >>>>>> --mca
> >>>>>>
> >>>>>> btl self,sm,tcp --display-map --verbose --version --mca
> >>>>>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Eloi
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Eloi Gaudry
Free Field Technologies
Company Website: http://www.fft.be
Company Phone:   +32 10 487 959