Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Myricom MX2G Segmentation fault on OMPI 1.6
From: Yong Qin (yong.qin_at_[hidden])
Date: 2012-06-11 18:24:32


Hi,

We are migrating to Open MPI 1.6 but since 1.6 dropped support for
Myricom GM driver so we have to switch to the MX driver. We have the
Myricom MX2G 1.2.16 driver installed. However upon testing the new
build of Open MPI on a node without the actual Myrinet device, we are
getting the following segmentation fault.

<---->
[yqin_at_n0007.scs00 ~]$ mpirun -np 2 -np 2 osu_bw
[n0007.scs00:03075] Error in mx_open_endpoint (error No MX device
entry in /dev.)
[n0007.scs00:03074] Error in mx_open_endpoint (error No MX device
entry in /dev.)
--------------------------------------------------------------------------
[[32626,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: Myrinet/MX
  Host: n0007.scs00

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
[n0007:03074] *** Process received signal ***
[n0007:03074] Signal: Segmentation fault (11)
[n0007:03074] Signal code: Invalid permissions (2)
[n0007:03074] Failing at address: 0x2b9112128130
[n0007:03075] *** Process received signal ***
[n0007:03075] Signal: Segmentation fault (11)
[n0007:03075] Signal code: Invalid permissions (2)
[n0007:03075] Failing at address: 0x2b041c9f1130
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 3075 on node n0007.scs00
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[n0007.scs00:03073] 1 more process has sent help message
help-mpi-btl-base.txt / btl:no-nics
[n0007.scs00:03073] Set MCA parameter "orte_base_help_aggregate" to 0
to see all help / error messages
<---->

Excluding the MX BTL does not get anywhere further.

<---->
[yqin_at_n0007.scs00 ~]$ mpirun -np 2 -mca btl ^mx -np 2 osu_bw
[n0007.scs00:03453] Error in mx_open_endpoint (error No MX device
entry in /dev.)
[n0007.scs00:03454] Error in mx_open_endpoint (error No MX device
entry in /dev.)
[n0007:03453] *** Process received signal ***
[n0007:03453] Signal: Segmentation fault (11)
[n0007:03453] Signal code: Address not mapped (1)
[n0007:03453] Failing at address: 0x2b3c1fe73130
[n0007:03454] *** Process received signal ***
[n0007:03454] Signal: Segmentation fault (11)
[n0007:03454] Signal code: Address not mapped (1)
[n0007:03454] Failing at address: 0x2b2431bf0130
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 3454 on node n0007.scs00
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
<---->

If we use only designated BTL such as SM and SELF, the binary runs but
still getting segmentation fault towards the end.

<---->
[yqin_at_n0007.scs00 ~]$ mpirun -np 2 -mca btl sm,self -np 2 osu_bw
[n0007.scs00:03460] Error in mx_open_endpoint (error No MX device
entry in /dev.)
[n0007.scs00:03461] Error in mx_open_endpoint (error No MX device
entry in /dev.)
# OSU MPI Bandwidth Test v3.3
# Size Bandwidth (MB/s)
1 2.54
2 5.22
4 10.92
8 21.61
16 43.89
32 62.19
64 121.95
128 212.28
256 337.52
512 516.67
1024 701.29
2048 845.69
4096 836.45
8192 934.31
16384 1035.53
32768 1186.90
65536 1390.41
131072 1519.14
262144 1562.96
524288 1596.78
1048576 1611.48
2097152 1616.09
4194304 1620.47
[n0007:03461] *** Process received signal ***
[n0007:03460] *** Process received signal ***
[n0007:03460] Signal: Segmentation fault (11)
[n0007:03460] Signal code: Address not mapped (1)
[n0007:03460] Failing at address: 0x2acac044d130
[n0007:03461] Signal: Segmentation fault (11)
[n0007:03461] Signal code: Address not mapped (1)
[n0007:03461] Failing at address: 0x2b8bc4121130
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 3460 on node n0007.scs00
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
<---->

Can anybody shed some light here? It looks like ompi is trying to open
the MX device no matter what. This is on a fresh build of Open MPI 1.6
with "--with-mx --with-openib" options. We didn't have such an issue
with the old GM BTL.

Thanks,

Yong Qin