Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [openib] segfault when using openib btl
From: Edgar Gabriel (gabriel_at_[hidden])
Date: 2010-07-15 10:20:54


you could try first to use the algorithms in the basic module, e.g.

mpirun -np x --mca coll basic ./mytest

and see whether this makes a difference. I used to observe sometimes a
(similar ?) problem in the openib btl triggered from the tuned
collective component, in cases where the ofed libraries were installed
but no NCA was found on a node. It used to work however with the basic
component.

Thanks
Edgar

On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
> hi Rolf,
>
> unfortunately, i couldn't get rid of that annoying segmentation fault when selecting another bcast algorithm.
> i'm now going to replace MPI_Bcast with a naive implementation (using MPI_Send and MPI_Recv) and see if that helps.
>
> regards,
> éloi
>
>
> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
>> Hi Rolf,
>>
>> thanks for your input. You're right, I miss the
>> coll_tuned_use_dynamic_rules option.
>>
>> I'll check if I the segmentation fault disappears when using the basic
>> bcast linear algorithm using the proper command line you provided.
>>
>> Regards,
>> Eloi
>>
>> On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart wrote:
>>> Hi Eloi:
>>> To select the different bcast algorithms, you need to add an extra mca
>>> parameter that tells the library to use dynamic selection.
>>> --mca coll_tuned_use_dynamic_rules 1
>>>
>>> One way to make sure you are typing this in correctly is to use it with
>>> ompi_info. Do the following:
>>> ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll
>>>
>>> You should see lots of output with all the different algorithms that can
>>> be selected for the various collectives.
>>> Therefore, you need this:
>>>
>>> --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_bcast_algorithm 1
>>>
>>> Rolf
>>>
>>> On 07/13/10 11:28, Eloi Gaudry wrote:
>>>> Hi,
>>>>
>>>> I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch
>>>> to the basic linear algorithm. Anyway whatever the algorithm used, the
>>>> segmentation fault remains.
>>>>
>>>> Does anyone could give some advice on ways to diagnose the issue I'm
>>>> facing ?
>>>>
>>>> Regards,
>>>> Eloi
>>>>
>>>> On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
>>>>> Hi,
>>>>>
>>>>> I'm focusing on the MPI_Bcast routine that seems to randomly segfault
>>>>> when using the openib btl. I'd like to know if there is any way to
>>>>> make OpenMPI switch to a different algorithm than the default one
>>>>> being selected for MPI_Bcast.
>>>>>
>>>>> Thanks for your help,
>>>>> Eloi
>>>>>
>>>>> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm observing a random segmentation fault during an internode
>>>>>> parallel computation involving the openib btl and OpenMPI-1.4.2 (the
>>>>>> same issue can be observed with OpenMPI-1.3.3).
>>>>>>
>>>>>> mpirun (Open MPI) 1.4.2
>>>>>> Report bugs to http://www.open-mpi.org/community/help/
>>>>>> [pbn08:02624] *** Process received signal ***
>>>>>> [pbn08:02624] Signal: Segmentation fault (11)
>>>>>> [pbn08:02624] Signal code: Address not mapped (1)
>>>>>> [pbn08:02624] Failing at address: (nil)
>>>>>> [pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
>>>>>> [pbn08:02624] *** End of error message ***
>>>>>> sh: line 1: 2624 Segmentation fault
>>>>>>
>>>>>> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5\/x
>>>>>> 86 _6 4\ /bin\/actranpy_mp
>>>>>> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x86_
>>>>>> 64 /A c tran_11.0.rc2.41872'
>>>>>> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2.da
>>>>>> t' '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch'
>>>>>> '--mem=3200' '--threads=1' '--errorlevel=FATAL' '--t_max=0.1'
>>>>>> '--parallel=domain'
>>>>>>
>>>>>> If I choose not to use the openib btl (by using --mca btl self,sm,tcp
>>>>>> on the command line, for instance), I don't encounter any problem and
>>>>>> the parallel computation runs flawlessly.
>>>>>>
>>>>>> I would like to get some help to be able:
>>>>>> - to diagnose the issue I'm facing with the openib btl
>>>>>> - understand why this issue is observed only when using the openib
>>>>>> btl and not when using self,sm,tcp
>>>>>>
>>>>>> Any help would be very much appreciated.
>>>>>>
>>>>>> The outputs of ompi_info and the configure scripts of OpenMPI are
>>>>>> enclosed to this email, and some information on the infiniband
>>>>>> drivers as well.
>>>>>>
>>>>>> Here is the command line used when launching a parallel computation
>>>>>>
>>>>>> using infiniband:
>>>>>> path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list
>>>>>> --mca
>>>>>>
>>>>>> btl openib,sm,self,tcp --display-map --verbose --version --mca
>>>>>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
>>>>>>
>>>>>> and the command line used if not using infiniband:
>>>>>> path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list
>>>>>> --mca
>>>>>>
>>>>>> btl self,sm,tcp --display-map --verbose --version --mca
>>>>>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
>>>>>>
>>>>>> Thanks,
>>>>>> Eloi
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>