Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [openib] segfault when using openib btl
From: Eloi Gaudry (eg_at_[hidden])
Date: 2010-07-16 10:01:02


Hi Edgar,

The only difference I could observed was that the segmentation fault appeared sometimes later during the parallel computation.

I'm running out of idea here. I wish I could use the "--mca coll tuned" with "--mca self,sm,tcp" so that I could check that the issue is not somehow limited to the tuned collective routines.

Thanks,
Eloi

On Thursday 15 July 2010 17:24:24 Edgar Gabriel wrote:
> On 7/15/2010 10:18 AM, Eloi Gaudry wrote:
> > hi edgar,
> >
> > thanks for the tips, I'm gonna try this option as well. the segmentation
> > fault i'm observing always happened during a collective communication
> > indeed... does it basically switch all collective communication to basic
> > mode, right ?
> >
> > sorry for my ignorance, but what's a NCA ?
>
> sorry, I meant to type HCA (InifinBand networking card)
>
> Thanks
> Edgar
>
> > thanks,
> > éloi
> >
> > On Thursday 15 July 2010 16:20:54 Edgar Gabriel wrote:
> >> you could try first to use the algorithms in the basic module, e.g.
> >>
> >> mpirun -np x --mca coll basic ./mytest
> >>
> >> and see whether this makes a difference. I used to observe sometimes a
> >> (similar ?) problem in the openib btl triggered from the tuned
> >> collective component, in cases where the ofed libraries were installed
> >> but no NCA was found on a node. It used to work however with the basic
> >> component.
> >>
> >> Thanks
> >> Edgar
> >>
> >> On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
> >>> hi Rolf,
> >>>
> >>> unfortunately, i couldn't get rid of that annoying segmentation fault
> >>> when selecting another bcast algorithm. i'm now going to replace
> >>> MPI_Bcast with a naive implementation (using MPI_Send and MPI_Recv) and
> >>> see if that helps.
> >>>
> >>> regards,
> >>> éloi
> >>>
> >>> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
> >>>> Hi Rolf,
> >>>>
> >>>> thanks for your input. You're right, I miss the
> >>>> coll_tuned_use_dynamic_rules option.
> >>>>
> >>>> I'll check if I the segmentation fault disappears when using the basic
> >>>> bcast linear algorithm using the proper command line you provided.
> >>>>
> >>>> Regards,
> >>>> Eloi
> >>>>
> >>>> On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart wrote:
> >>>>> Hi Eloi:
> >>>>> To select the different bcast algorithms, you need to add an extra
> >>>>> mca parameter that tells the library to use dynamic selection.
> >>>>> --mca coll_tuned_use_dynamic_rules 1
> >>>>>
> >>>>> One way to make sure you are typing this in correctly is to use it
> >>>>> with ompi_info. Do the following:
> >>>>> ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll
> >>>>>
> >>>>> You should see lots of output with all the different algorithms that
> >>>>> can be selected for the various collectives.
> >>>>> Therefore, you need this:
> >>>>>
> >>>>> --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_bcast_algorithm
> >>>>> 1
> >>>>>
> >>>>> Rolf
> >>>>>
> >>>>> On 07/13/10 11:28, Eloi Gaudry wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to
> >>>>>> switch to the basic linear algorithm. Anyway whatever the algorithm
> >>>>>> used, the segmentation fault remains.
> >>>>>>
> >>>>>> Does anyone could give some advice on ways to diagnose the issue I'm
> >>>>>> facing ?
> >>>>>>
> >>>>>> Regards,
> >>>>>> Eloi
> >>>>>>
> >>>>>> On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I'm focusing on the MPI_Bcast routine that seems to randomly
> >>>>>>> segfault when using the openib btl. I'd like to know if there is
> >>>>>>> any way to make OpenMPI switch to a different algorithm than the
> >>>>>>> default one being selected for MPI_Bcast.
> >>>>>>>
> >>>>>>> Thanks for your help,
> >>>>>>> Eloi
> >>>>>>>
> >>>>>>> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I'm observing a random segmentation fault during an internode
> >>>>>>>> parallel computation involving the openib btl and OpenMPI-1.4.2
> >>>>>>>> (the same issue can be observed with OpenMPI-1.3.3).
> >>>>>>>>
> >>>>>>>> mpirun (Open MPI) 1.4.2
> >>>>>>>> Report bugs to http://www.open-mpi.org/community/help/
> >>>>>>>> [pbn08:02624] *** Process received signal ***
> >>>>>>>> [pbn08:02624] Signal: Segmentation fault (11)
> >>>>>>>> [pbn08:02624] Signal code: Address not mapped (1)
> >>>>>>>> [pbn08:02624] Failing at address: (nil)
> >>>>>>>> [pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
> >>>>>>>> [pbn08:02624] *** End of error message ***
> >>>>>>>> sh: line 1: 2624 Segmentation fault
> >>>>>>>>
> >>>>>>>> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5
> >>>>>>>> \/ x 86 _6 4\ /bin\/actranpy_mp
> >>>>>>>> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x
> >>>>>>>> 86 _ 64 /A c tran_11.0.rc2.41872'
> >>>>>>>> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2
> >>>>>>>> .d a t' '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch'
> >>>>>>>> '--mem=3200' '--threads=1' '--errorlevel=FATAL' '--t_max=0.1'
> >>>>>>>> '--parallel=domain'
> >>>>>>>>
> >>>>>>>> If I choose not to use the openib btl (by using --mca btl
> >>>>>>>> self,sm,tcp on the command line, for instance), I don't encounter
> >>>>>>>> any problem and the parallel computation runs flawlessly.
> >>>>>>>>
> >>>>>>>> I would like to get some help to be able:
> >>>>>>>> - to diagnose the issue I'm facing with the openib btl
> >>>>>>>> - understand why this issue is observed only when using the openib
> >>>>>>>> btl and not when using self,sm,tcp
> >>>>>>>>
> >>>>>>>> Any help would be very much appreciated.
> >>>>>>>>
> >>>>>>>> The outputs of ompi_info and the configure scripts of OpenMPI are
> >>>>>>>> enclosed to this email, and some information on the infiniband
> >>>>>>>> drivers as well.
> >>>>>>>>
> >>>>>>>> Here is the command line used when launching a parallel
> >>>>>>>> computation
> >>>>>>>>
> >>>>>>>> using infiniband:
> >>>>>>>> path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list
> >>>>>>>> --mca
> >>>>>>>>
> >>>>>>>> btl openib,sm,self,tcp --display-map --verbose --version --mca
> >>>>>>>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
> >>>>>>>>
> >>>>>>>> and the command line used if not using infiniband:
> >>>>>>>> path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list
> >>>>>>>> --mca
> >>>>>>>>
> >>>>>>>> btl self,sm,tcp --display-map --verbose --version --mca
> >>>>>>>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Eloi
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> users_at_[hidden]
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Eloi Gaudry
Free Field Technologies
Company Website: http://www.fft.be
Company Phone:   +32 10 487 959