Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Myricom MX2G Segmentation fault on OMPI 1.6
From: Yong Qin (yong.qin_at_[hidden])
Date: 2012-06-11 19:17:35


Hi Aurelien,

Thanks for the explanation. But I'm not following it. There's no MX
device on the test machine as I mentioned, so ompi should not find it
at all in the first place. I'm also not able to locate the ob1 MTL.
There's the ob1 PML but I don't understand how that's going to affect
the mx BTL.

Thanks,

Yong Qin

On Mon, Jun 11, 2012 at 3:57 PM, Aurélien Bouteiller
<bouteill_at_[hidden]> wrote:
> Hi,
>
> If some mx devices are found, the logic is not only to use the mx BTL but also to use the mx MTL. You can try to disable this with --mca mtl ob1.
>
> Aurelien
>
>
>
>
> Le 11 juin 2012 à 18:24, Yong Qin a écrit :
>
>> Hi,
>>
>> We are migrating to Open MPI 1.6 but since 1.6 dropped support for
>> Myricom GM driver so we have to switch to the MX driver. We have the
>> Myricom MX2G 1.2.16 driver installed. However upon testing the new
>> build of Open MPI on a node without the actual Myrinet device, we are
>> getting the following segmentation fault.
>>
>> <---->
>> [yqin_at_n0007.scs00 ~]$ mpirun -np 2  -np 2 osu_bw
>> [n0007.scs00:03075] Error in mx_open_endpoint (error No MX device
>> entry in /dev.)
>> [n0007.scs00:03074] Error in mx_open_endpoint (error No MX device
>> entry in /dev.)
>> --------------------------------------------------------------------------
>> [[32626,1],0]: A high-performance Open MPI point-to-point messaging module
>> was unable to find any relevant network interfaces:
>>
>> Module: Myrinet/MX
>>  Host: n0007.scs00
>>
>> Another transport will be used instead, although this may result in
>> lower performance.
>> --------------------------------------------------------------------------
>> [n0007:03074] *** Process received signal ***
>> [n0007:03074] Signal: Segmentation fault (11)
>> [n0007:03074] Signal code: Invalid permissions (2)
>> [n0007:03074] Failing at address: 0x2b9112128130
>> [n0007:03075] *** Process received signal ***
>> [n0007:03075] Signal: Segmentation fault (11)
>> [n0007:03075] Signal code: Invalid permissions (2)
>> [n0007:03075] Failing at address: 0x2b041c9f1130
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 1 with PID 3075 on node n0007.scs00
>> exited on signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
>> [n0007.scs00:03073] 1 more process has sent help message
>> help-mpi-btl-base.txt / btl:no-nics
>> [n0007.scs00:03073] Set MCA parameter "orte_base_help_aggregate" to 0
>> to see all help / error messages
>> <---->
>>
>> Excluding the MX BTL does not get anywhere further.
>>
>> <---->
>> [yqin_at_n0007.scs00 ~]$ mpirun -np 2 -mca btl ^mx -np 2 osu_bw
>> [n0007.scs00:03453] Error in mx_open_endpoint (error No MX device
>> entry in /dev.)
>> [n0007.scs00:03454] Error in mx_open_endpoint (error No MX device
>> entry in /dev.)
>> [n0007:03453] *** Process received signal ***
>> [n0007:03453] Signal: Segmentation fault (11)
>> [n0007:03453] Signal code: Address not mapped (1)
>> [n0007:03453] Failing at address: 0x2b3c1fe73130
>> [n0007:03454] *** Process received signal ***
>> [n0007:03454] Signal: Segmentation fault (11)
>> [n0007:03454] Signal code: Address not mapped (1)
>> [n0007:03454] Failing at address: 0x2b2431bf0130
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 1 with PID 3454 on node n0007.scs00
>> exited on signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
>> <---->
>>
>> If we use only designated BTL such as SM and SELF, the binary runs but
>> still getting segmentation fault towards the end.
>>
>> <---->
>> [yqin_at_n0007.scs00 ~]$ mpirun -np 2 -mca btl sm,self -np 2 osu_bw
>> [n0007.scs00:03460] Error in mx_open_endpoint (error No MX device
>> entry in /dev.)
>> [n0007.scs00:03461] Error in mx_open_endpoint (error No MX device
>> entry in /dev.)
>> # OSU MPI Bandwidth Test v3.3
>> # Size        Bandwidth (MB/s)
>> 1                         2.54
>> 2                         5.22
>> 4                        10.92
>> 8                        21.61
>> 16                       43.89
>> 32                       62.19
>> 64                      121.95
>> 128                     212.28
>> 256                     337.52
>> 512                     516.67
>> 1024                    701.29
>> 2048                    845.69
>> 4096                    836.45
>> 8192                    934.31
>> 16384                  1035.53
>> 32768                  1186.90
>> 65536                  1390.41
>> 131072                 1519.14
>> 262144                 1562.96
>> 524288                 1596.78
>> 1048576                1611.48
>> 2097152                1616.09
>> 4194304                1620.47
>> [n0007:03461] *** Process received signal ***
>> [n0007:03460] *** Process received signal ***
>> [n0007:03460] Signal: Segmentation fault (11)
>> [n0007:03460] Signal code: Address not mapped (1)
>> [n0007:03460] Failing at address: 0x2acac044d130
>> [n0007:03461] Signal: Segmentation fault (11)
>> [n0007:03461] Signal code: Address not mapped (1)
>> [n0007:03461] Failing at address: 0x2b8bc4121130
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 0 with PID 3460 on node n0007.scs00
>> exited on signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
>> <---->
>>
>>
>> Can anybody shed some light here? It looks like ompi is trying to open
>> the MX device no matter what. This is on a fresh build of Open MPI 1.6
>> with "--with-mx --with-openib" options. We didn't have such an issue
>> with the old GM BTL.
>>
>> Thanks,
>>
>> Yong Qin
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> * Dr. Aurélien Bouteiller
> * Researcher at Innovative Computing Laboratory
> * University of Tennessee
> * 1122 Volunteer Boulevard, suite 309b
> * Knoxville, TN 37996
> * 865 974 9375
>
>
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users