Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] system call failed during shared memory initialization with openmpi-1.8a1r31254
From: tmishima_at_[hidden]
Date: 2014-03-28 05:45:31


Hi all,

I saw this error as shown below with openmpi-1.8a1r31254.
I've never seen it before with openmpi-1.7.5.

The message implies it's related to vader and I can stop
it by excluding vader from btl, -mca btl ^vader.

Could someone fix this problem?

Tetsuya

[mishima_at_manage openmpi]$ mpirun -np 16 -host node03,node04 -map-by
numa:pe=4 -display-map -report-bindings -bind-to cor
e ./demos/myprog
 Data for JOB [17579,1] offset 0

 ======================== JOB MAP ========================

 Data for node: node03 Num slots: 1 Max slots: 0 Num procs: 8
        Process OMPI jobid: [17579,1] App: 0 Process rank: 0
        Process OMPI jobid: [17579,1] App: 0 Process rank: 1
        Process OMPI jobid: [17579,1] App: 0 Process rank: 2
        Process OMPI jobid: [17579,1] App: 0 Process rank: 3
        Process OMPI jobid: [17579,1] App: 0 Process rank: 4
        Process OMPI jobid: [17579,1] App: 0 Process rank: 5
        Process OMPI jobid: [17579,1] App: 0 Process rank: 6
        Process OMPI jobid: [17579,1] App: 0 Process rank: 7

 Data for node: node04 Num slots: 1 Max slots: 0 Num procs: 8
        Process OMPI jobid: [17579,1] App: 0 Process rank: 8
        Process OMPI jobid: [17579,1] App: 0 Process rank: 9
        Process OMPI jobid: [17579,1] App: 0 Process rank: 10
        Process OMPI jobid: [17579,1] App: 0 Process rank: 11
        Process OMPI jobid: [17579,1] App: 0 Process rank: 12
        Process OMPI jobid: [17579,1] App: 0 Process rank: 13
        Process OMPI jobid: [17579,1] App: 0 Process rank: 14
        Process OMPI jobid: [17579,1] App: 0 Process rank: 15

 =============================================================
[node03.cluster:23025] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket
2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
 socket 2[core 19[hwt 0]]:
[./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
[node03.cluster:23025] MCW rank 5 bound to socket 2[core 20[hwt 0]], socket
2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
 socket 2[core 23[hwt 0]]:
[./././././././.][./././././././.][././././B/B/B/B][./././././././.]
[node03.cluster:23025] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket
3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
 socket 3[core 27[hwt 0]]:
[./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
[node03.cluster:23025] MCW rank 7 bound to socket 3[core 28[hwt 0]], socket
3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
 socket 3[core 31[hwt 0]]:
[./././././././.][./././././././.][./././././././.][././././B/B/B/B]
[node03.cluster:23025] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]:
[B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
[node04.cluster:29332] MCW rank 10 bound to socket 1[core 8[hwt 0]], socket
1[core 9[hwt 0]], socket 1[core 10[hwt 0]],
socket 1[core 11[hwt 0]]:
[./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
[node04.cluster:29332] MCW rank 11 bound to socket 1[core 12[hwt 0]],
socket 1[core 13[hwt 0]], socket 1[core 14[hwt 0]]
, socket 1[core 15[hwt 0]]:
[./././././././.][././././B/B/B/B][./././././././.][./././././././.]
[node04.cluster:29332] MCW rank 12 bound to socket 2[core 16[hwt 0]],
socket 2[core 17[hwt 0]], socket 2[core 18[hwt 0]]
, socket 2[core 19[hwt 0]]:
[./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
[node04.cluster:29332] MCW rank 13 bound to socket 2[core 20[hwt 0]],
socket 2[core 21[hwt 0]], socket 2[core 22[hwt 0]]
, socket 2[core 23[hwt 0]]:
[./././././././.][./././././././.][././././B/B/B/B][./././././././.]
[node04.cluster:29332] MCW rank 14 bound to socket 3[core 24[hwt 0]],
socket 3[core 25[hwt 0]], socket 3[core 26[hwt 0]]
, socket 3[core 27[hwt 0]]:
[./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
[node04.cluster:29332] MCW rank 15 bound to socket 3[core 28[hwt 0]],
socket 3[core 29[hwt 0]], socket 3[core 30[hwt 0]]
, socket 3[core 31[hwt 0]]:
[./././././././.][./././././././.][./././././././.][././././B/B/B/B]
[node04.cluster:29332] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]:
[B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
[node04.cluster:29332] MCW rank 9 bound to socket 0[core 4[hwt 0]], socket
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
cket 0[core 7[hwt 0]]:
[././././B/B/B/B][./././././././.][./././././././.][./././././././.]
[node03.cluster:23025] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
cket 0[core 7[hwt 0]]:
[././././B/B/B/B][./././././././.][./././././././.][./././././././.]
[node03.cluster:23025] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket
1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
ocket 1[core 11[hwt 0]]:
[./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
[node03.cluster:23025] MCW rank 3 bound to socket 1[core 12[hwt 0]], socket
1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
 socket 1[core 15[hwt 0]]:
[./././././././.][././././B/B/B/B][./././././././.][./././././././.]
Hello world from process 0 of 16
Hello world from process 5 of 16
Hello world from process 2 of 16
Hello world from process 4 of 16
Hello world from process 1 of 16
Hello world from process 7 of 16
Hello world from process 3 of 16
Hello world from process 6 of 16
Hello world from process 10 of 16
Hello world from process 9 of 16
Hello world from process 8 of 16
Hello world from process 13 of 16
Hello world from process 12 of 16
Hello world from process 11 of 16
Hello world from process 14 of 16
Hello world from process 15 of 16
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have. It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host: node03.cluster
  System call: unlink
(2) /tmp/openmpi-sessions-mishima_at_node03_0/17579/1/vader_segment.node03.0
  Error: No such file or directory (errno 2)
--------------------------------------------------------------------------