Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-11-13 01:13:11


I can try, but I have no way of testing Torque any more - so all I can do is a code review. If you can build --enable-debug and add -mca plm_base_verbose 5 to your cmd line, I'd appreciate seeing the output.

On Nov 12, 2013, at 9:58 PM, tmishima_at_[hidden] wrote:

>
>
> Hi Ralph,
>
> Thank you for your quick response.
>
> I'd like to report one more regressive issue about Torque support of
> openmpi-1.7.4a1r29646, which might be related to "#3893: LAMA mapper
> has problems" I reported a few days ago.
>
> The script below does not work with openmpi-1.7.4a1r29646,
> although it worked with openmpi-1.7.3 as I told you before.
>
> #!/bin/sh
> #PBS -l nodes=node08:ppn=8
> export OMP_NUM_THREADS=1
> cd $PBS_O_WORKDIR
> cp $PBS_NODEFILE pbs_hosts
> NPROCS=`wc -l < pbs_hosts`
> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings -bind-to core
> Myprog
>
> If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it works fine.
> Since this happens without lama request, I guess it's not the problem
> in lama itself. Anyway, please look into this issue as well.
>
> Regards,
> Tetsuya Mishima
>
>> Done - thanks!
>>
>> On Nov 12, 2013, at 7:35 PM, tmishima_at_[hidden] wrote:
>>
>>>
>>>
>>> Dear openmpi developers,
>>>
>>> I got a segmentation fault in traial use of openmpi-1.7.4a1r29646 built
> by
>>> PGI13.10 as shown below:
>>>
>>> [mishima_at_manage testbed-openmpi-1.7.3]$ mpirun -np 4 -cpus-per-proc 2
>>> -report-bindings mPre
>>> [manage.cluster:23082] MCW rank 2 bound to socket 0[core 4[hwt 0]],
> socket
>>> 0[core 5[hwt 0]]: [././././B/B][./././././.]
>>> [manage.cluster:23082] MCW rank 3 bound to socket 1[core 6[hwt 0]],
> socket
>>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.]
>>> [manage.cluster:23082] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.]
>>> [manage.cluster:23082] MCW rank 1 bound to socket 0[core 2[hwt 0]],
> socket
>>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.]
>>> [manage:23082] *** Process received signal ***
>>> [manage:23082] Signal: Segmentation fault (11)
>>> [manage:23082] Signal code: Address not mapped (1)
>>> [manage:23082] Failing at address: 0x34
>>> [manage:23082] *** End of error message ***
>>> Segmentation fault (core dumped)
>>>
>>> [mishima_at_manage testbed-openmpi-1.7.3]$ gdb mpirun core.23082
>>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1)
>>> Copyright (C) 2009 Free Software Foundation, Inc.
>>> ...
>>> Core was generated by `mpirun -np 4 -cpus-per-proc 2 -report-bindings
>>> mPre'.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f,
> sd=32767,
>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t);
>>> (gdb) where
>>> #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f,
> sd=32767,
>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
>>> #1 0x00002b5f861ca20b in recv_handler (sd=1778385023, flags=32767,
>>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760
>>> #2 0x00002b5f848eb06a in event_process_active_single_queue
>>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff)
>>> at ./event.c:1366
>>> #3 0x00002b5f848eb270 in event_process_active
> (base=0x5f848eb84900007f)
>>> at ./event.c:1435
>>> #4 0x00002b5f848eb849 in opal_libevent2021_event_base_loop
>>> (base=0x4077a000007f, flags=32767) at ./event.c:1645
>>> #5 0x00000000004077a0 in orterun (argc=7, argv=0x7fff25bbd4a8)
>>> at ./orterun.c:1030
>>> #6 0x00000000004067fb in main (argc=7, argv=0x7fff25bbd4a8)
> at ./main.c:13
>>> (gdb) quit
>>>
>>>
>>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently unnecessary,
> which
>>> causes the segfault.
>>>
>>> 624 /* lookup the corresponding process */
>>> 625 peer = mca_oob_tcp_peer_lookup(mod, &hdr->origin);
>>> 626 if (NULL == peer) {
>>> 627 ui64 = (uint64_t*)(&peer->name);
>>> 628 opal_output_verbose(OOB_TCP_DEBUG_CONNECT,
>>> orte_oob_base_framework.framework_output,
>>> 629 "%s mca_oob_tcp_recv_connect:
>>> connection from new peer",
>>> 630 ORTE_NAME_PRINT(ORTE_PROC_MY_NAME));
>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t);
>>> 632 peer->mod = mod;
>>> 633 peer->name = hdr->origin;
>>> 634 peer->state = MCA_OOB_TCP_ACCEPTING;
>>> 635 ui64 = (uint64_t*)(&peer->name);
>>> 636 if (OPAL_SUCCESS != opal_hash_table_set_value_uint64
> (&mod->
>>> peers, (*ui64), peer)) {
>>> 637 OBJ_RELEASE(peer);
>>> 638 return;
>>> 639 }
>>>
>>>
>>> Please fix this mistake in the next release.
>>>
>>> Regards,
>>> Tetsuya Mishima
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users