Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-11-13 01:13:11


I can try, but I have no way of testing Torque any more - so all I can do is a code review. If you can build --enable-debug and add -mca plm_base_verbose 5 to your cmd line, I'd appreciate seeing the output.

On Nov 12, 2013, at 9:58 PM, tmishima_at_[hidden] wrote:

>
>
> Hi Ralph,
>
> Thank you for your quick response.
>
> I'd like to report one more regressive issue about Torque support of
> openmpi-1.7.4a1r29646, which might be related to "#3893: LAMA mapper
> has problems" I reported a few days ago.
>
> The script below does not work with openmpi-1.7.4a1r29646,
> although it worked with openmpi-1.7.3 as I told you before.
>
> #!/bin/sh
> #PBS -l nodes=node08:ppn=8
> export OMP_NUM_THREADS=1
> cd $PBS_O_WORKDIR
> cp $PBS_NODEFILE pbs_hosts
> NPROCS=`wc -l < pbs_hosts`
> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings -bind-to core
> Myprog
>
> If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it works fine.
> Since this happens without lama request, I guess it's not the problem
> in lama itself. Anyway, please look into this issue as well.
>
> Regards,
> Tetsuya Mishima
>
>> Done - thanks!
>>
>> On Nov 12, 2013, at 7:35 PM, tmishima_at_[hidden] wrote:
>>
>>>
>>>
>>> Dear openmpi developers,
>>>
>>> I got a segmentation fault in traial use of openmpi-1.7.4a1r29646 built
> by
>>> PGI13.10 as shown below:
>>>
>>> [mishima_at_manage testbed-openmpi-1.7.3]$ mpirun -np 4 -cpus-per-proc 2
>>> -report-bindings mPre
>>> [manage.cluster:23082] MCW rank 2 bound to socket 0[core 4[hwt 0]],
> socket
>>> 0[core 5[hwt 0]]: [././././B/B][./././././.]
>>> [manage.cluster:23082] MCW rank 3 bound to socket 1[core 6[hwt 0]],
> socket
>>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.]
>>> [manage.cluster:23082] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.]
>>> [manage.cluster:23082] MCW rank 1 bound to socket 0[core 2[hwt 0]],
> socket
>>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.]
>>> [manage:23082] *** Process received signal ***
>>> [manage:23082] Signal: Segmentation fault (11)
>>> [manage:23082] Signal code: Address not mapped (1)
>>> [manage:23082] Failing at address: 0x34
>>> [manage:23082] *** End of error message ***
>>> Segmentation fault (core dumped)
>>>
>>> [mishima_at_manage testbed-openmpi-1.7.3]$ gdb mpirun core.23082
>>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1)
>>> Copyright (C) 2009 Free Software Foundation, Inc.
>>> ...
>>> Core was generated by `mpirun -np 4 -cpus-per-proc 2 -report-bindings
>>> mPre'.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f,
> sd=32767,
>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t);
>>> (gdb) where
>>> #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f,
> sd=32767,
>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
>>> #1 0x00002b5f861ca20b in recv_handler (sd=1778385023, flags=32767,
>>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760
>>> #2 0x00002b5f848eb06a in event_process_active_single_queue
>>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff)
>>> at ./event.c:1366
>>> #3 0x00002b5f848eb270 in event_process_active
> (base=0x5f848eb84900007f)
>>> at ./event.c:1435
>>> #4 0x00002b5f848eb849 in opal_libevent2021_event_base_loop
>>> (base=0x4077a000007f, flags=32767) at ./event.c:1645
>>> #5 0x00000000004077a0 in orterun (argc=7, argv=0x7fff25bbd4a8)
>>> at ./orterun.c:1030
>>> #6 0x00000000004067fb in main (argc=7, argv=0x7fff25bbd4a8)
> at ./main.c:13
>>> (gdb) quit
>>>
>>>
>>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently unnecessary,
> which
>>> causes the segfault.
>>>
>>> 624 /* lookup the corresponding process */
>>> 625 peer = mca_oob_tcp_peer_lookup(mod, &hdr->origin);
>>> 626 if (NULL == peer) {
>>> 627 ui64 = (uint64_t*)(&peer->name);
>>> 628 opal_output_verbose(OOB_TCP_DEBUG_CONNECT,
>>> orte_oob_base_framework.framework_output,
>>> 629 "%s mca_oob_tcp_recv_connect:
>>> connection from new peer",
>>> 630 ORTE_NAME_PRINT(ORTE_PROC_MY_NAME));
>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t);
>>> 632 peer->mod = mod;
>>> 633 peer->name = hdr->origin;
>>> 634 peer->state = MCA_OOB_TCP_ACCEPTING;
>>> 635 ui64 = (uint64_t*)(&peer->name);
>>> 636 if (OPAL_SUCCESS != opal_hash_table_set_value_uint64
> (&mod->
>>> peers, (*ui64), peer)) {
>>> 637 OBJ_RELEASE(peer);
>>> 638 return;
>>> 639 }
>>>
>>>
>>> Please fix this mistake in the next release.
>>>
>>> Regards,
>>> Tetsuya Mishima
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users