Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Myricom MX2G Segmentation fault on OMPI 1.6
From: Aurélien Bouteiller (bouteill_at_[hidden])
Date: 2012-06-11 18:57:01


Hi,

If some mx devices are found, the logic is not only to use the mx BTL but also to use the mx MTL. You can try to disable this with --mca mtl ob1.

Aurelien

Le 11 juin 2012 à 18:24, Yong Qin a écrit :

> Hi,
>
> We are migrating to Open MPI 1.6 but since 1.6 dropped support for
> Myricom GM driver so we have to switch to the MX driver. We have the
> Myricom MX2G 1.2.16 driver installed. However upon testing the new
> build of Open MPI on a node without the actual Myrinet device, we are
> getting the following segmentation fault.
>
> <---->
> [yqin_at_n0007.scs00 ~]$ mpirun -np 2 -np 2 osu_bw
> [n0007.scs00:03075] Error in mx_open_endpoint (error No MX device
> entry in /dev.)
> [n0007.scs00:03074] Error in mx_open_endpoint (error No MX device
> entry in /dev.)
> --------------------------------------------------------------------------
> [[32626,1],0]: A high-performance Open MPI point-to-point messaging module
> was unable to find any relevant network interfaces:
>
> Module: Myrinet/MX
> Host: n0007.scs00
>
> Another transport will be used instead, although this may result in
> lower performance.
> --------------------------------------------------------------------------
> [n0007:03074] *** Process received signal ***
> [n0007:03074] Signal: Segmentation fault (11)
> [n0007:03074] Signal code: Invalid permissions (2)
> [n0007:03074] Failing at address: 0x2b9112128130
> [n0007:03075] *** Process received signal ***
> [n0007:03075] Signal: Segmentation fault (11)
> [n0007:03075] Signal code: Invalid permissions (2)
> [n0007:03075] Failing at address: 0x2b041c9f1130
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 3075 on node n0007.scs00
> exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> [n0007.scs00:03073] 1 more process has sent help message
> help-mpi-btl-base.txt / btl:no-nics
> [n0007.scs00:03073] Set MCA parameter "orte_base_help_aggregate" to 0
> to see all help / error messages
> <---->
>
> Excluding the MX BTL does not get anywhere further.
>
> <---->
> [yqin_at_n0007.scs00 ~]$ mpirun -np 2 -mca btl ^mx -np 2 osu_bw
> [n0007.scs00:03453] Error in mx_open_endpoint (error No MX device
> entry in /dev.)
> [n0007.scs00:03454] Error in mx_open_endpoint (error No MX device
> entry in /dev.)
> [n0007:03453] *** Process received signal ***
> [n0007:03453] Signal: Segmentation fault (11)
> [n0007:03453] Signal code: Address not mapped (1)
> [n0007:03453] Failing at address: 0x2b3c1fe73130
> [n0007:03454] *** Process received signal ***
> [n0007:03454] Signal: Segmentation fault (11)
> [n0007:03454] Signal code: Address not mapped (1)
> [n0007:03454] Failing at address: 0x2b2431bf0130
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 3454 on node n0007.scs00
> exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> <---->
>
> If we use only designated BTL such as SM and SELF, the binary runs but
> still getting segmentation fault towards the end.
>
> <---->
> [yqin_at_n0007.scs00 ~]$ mpirun -np 2 -mca btl sm,self -np 2 osu_bw
> [n0007.scs00:03460] Error in mx_open_endpoint (error No MX device
> entry in /dev.)
> [n0007.scs00:03461] Error in mx_open_endpoint (error No MX device
> entry in /dev.)
> # OSU MPI Bandwidth Test v3.3
> # Size Bandwidth (MB/s)
> 1 2.54
> 2 5.22
> 4 10.92
> 8 21.61
> 16 43.89
> 32 62.19
> 64 121.95
> 128 212.28
> 256 337.52
> 512 516.67
> 1024 701.29
> 2048 845.69
> 4096 836.45
> 8192 934.31
> 16384 1035.53
> 32768 1186.90
> 65536 1390.41
> 131072 1519.14
> 262144 1562.96
> 524288 1596.78
> 1048576 1611.48
> 2097152 1616.09
> 4194304 1620.47
> [n0007:03461] *** Process received signal ***
> [n0007:03460] *** Process received signal ***
> [n0007:03460] Signal: Segmentation fault (11)
> [n0007:03460] Signal code: Address not mapped (1)
> [n0007:03460] Failing at address: 0x2acac044d130
> [n0007:03461] Signal: Segmentation fault (11)
> [n0007:03461] Signal code: Address not mapped (1)
> [n0007:03461] Failing at address: 0x2b8bc4121130
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 3460 on node n0007.scs00
> exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> <---->
>
>
> Can anybody shed some light here? It looks like ompi is trying to open
> the MX device no matter what. This is on a fresh build of Open MPI 1.6
> with "--with-mx --with-openib" options. We didn't have such an issue
> with the old GM BTL.
>
> Thanks,
>
> Yong Qin
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
* Dr. Aurélien Bouteiller
* Researcher at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 309b
* Knoxville, TN 37996
* 865 974 9375