Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] segfault on host not found error.
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2008-04-01 06:09:30


yes, it seems to be fixed.
thanks.

On Mon, Mar 31, 2008 at 9:17 PM, Ralph H Castain <rhc_at_[hidden]> wrote:

> I am unable to replicate the segfault. However, I was able to get the job
> to
> hang. I fixed that behavior with r18044.
>
> Perhaps you can test this again and let me know what you see. A gdb stack
> trace would be more helpful.
>
> Thanks
> Ralph
>
>
>
> On 3/31/08 5:13 AM, "Lenny Verkhovsky" <lennyb_at_[hidden]> wrote:
>
> >
> >
> >
> > I accidently run job on the hostfile where one of hosts was not properly
> > mounted. As a result I got an error and a segfault.
> >
> >
> > /home/USERS/lenny/OMPI_ORTE_TRUNK/bin/mpirun -np 29 -hostfile hostfile
> > ./mpi_p01 -t lt
> > bash: /home/USERS/lenny/OMPI_ORTE_TRUNK/bin/orted: No such file or
> > directory
> > ------------------------------------------------------------------------
> > --
> > A daemon (pid 9753) died unexpectedly with status 127 while attempting
> > to launch so we are aborting.
> >
> > There may be more information reported by the environment (see above).
> >
> > This may be because the daemon was unable to find all the needed shared
> > libraries on the remote node. You may set your LD_LIBRARY_PATH to have
> > the
> > location of the shared libraries on the remote nodes and this will
> > automatically be forwarded to the remote nodes.
> > ------------------------------------------------------------------------
> > --
> > ------------------------------------------------------------------------
> > --
> > mpirun was unable to start the specified application as it encountered
> > an error.
> > More information may be available above.
> > ------------------------------------------------------------------------
> > --
> > [witch1:09745] *** Process received signal ***
> > [witch1:09745] Signal: Segmentation fault (11)
> > [witch1:09745] Signal code: Address not mapped (1)
> > [witch1:09745] Failing at address: 0x3c
> > [witch1:09745] [ 0] /lib64/libpthread.so.0 [0x2aff223ebc10]
> > [witch1:09745] [ 1]
> > /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-rte.so.0 [0x2aff21cdfe21]
> > [witch1:09745] [ 2]
> > /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/openmpi/mca_rml_oob.so
> > [0x2aff22c398f1]
> > [witch1:09745] [ 3]
> > /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/openmpi/mca_oob_tcp.so
> > [0x2aff22d426ee]
> > [witch1:09745] [ 4]
> > /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/openmpi/mca_oob_tcp.so
> > [0x2aff22d433fb]
> > [witch1:09745] [ 5]
> > /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/openmpi/mca_oob_tcp.so
> > [0x2aff22d4485b]
> > [witch1:09745] [ 6]
> > /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-pal.so.0 [0x2aff21e1242b]
> > [witch1:09745] [ 7] /home/USERS/lenny/OMPI_ORTE_TRUNK/bin/mpirun
> > [0x403203]
> > [witch1:09745] [ 8]
> > /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-pal.so.0 [0x2aff21e1242b]
> > [witch1:09745] [ 9]
> > /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-pal.so.0(opal_progress+0x
> > 8b) [0x2aff21e060cb]
> > [witch1:09745] [10]
> > /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-rte.so.0(orte_trigger_eve
> > nt+0x20) [0x2aff21cc6940]
> > [witch1:09745] [11]
> > /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-rte.so.0(orte_wakeup+0x2d
> > ) [0x2aff21cc776d]
> > [witch1:09745] [12]
> > /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/openmpi/mca_plm_rsh.so
> > [0x2aff22b34756]
> > [witch1:09745] [13]
> > /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-rte.so.0 [0x2aff21cc6ea7]
> > [witch1:09745] [14]
> > /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-pal.so.0 [0x2aff21e1242b]
> > [witch1:09745] [15]
> > /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-pal.so.0(opal_progress+0x
> > 8b) [0x2aff21e060cb]
> > [witch1:09745] [16]
> > /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-rte.so.0(orte_plm_base_da
> > emon_callback+0xad) [0x2aff21ce068d]
> > [witch1:09745] [17]
> > /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/openmpi/mca_plm_rsh.so
> > [0x2aff22b34e5e]
> > [witch1:09745] [18] /home/USERS/lenny/OMPI_ORTE_TRUNK/bin/mpirun
> > [0x402e13]
> > [witch1:09745] [19] /home/USERS/lenny/OMPI_ORTE_TRUNK/bin/mpirun
> > [0x402873]
> > [witch1:09745] [20] /lib64/libc.so.6(__libc_start_main+0xf4)
> > [0x2aff22512154]
> > [witch1:09745] [21] /home/USERS/lenny/OMPI_ORTE_TRUNK/bin/mpirun
> > [0x4027c9]
> > [witch1:09745] *** End of error message ***
> > Segmentation fault (core dumped)
> >
> >
> > Best Regards,
> > Lenny.
> >
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>