Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Ralph Castain (rhc_at_[hidden])
Date: 2006-11-12 21:33:59


I'll fix the case in attr_create_predefined_callback - we should initialize
the rank variable first to be safe.

For your other question, do your configure with "--without-memory-manager".

Ralph

On 11/12/06 10:52 AM, "Adrian Knoth" <adi_at_[hidden]> wrote:

> Hi,
>
> I'm currently tracing a segfault in mpi_init which is caused
> by ompi/runtime/ompi_mpi_init.c:569
>
> ret = MCA_PML_CALL(add_procs(procs, nprocs));
> free(procs);
>
> In most cases, no segfault occurs and everything works fine,
> but with some special combinations of machines, I can trigger
> the bug.
>
> If I choose a working configuration and increase the number
> of IPv6 addresses on one of the machines, the segfault occurs.
>
> It cannot be triggered by adding IPv4 addresses, and disabling
> IPv6 completely solves the problem.
>
> The debugger shows that free internally calls mem2chunk.
> The working configuration has a chunksize of 16 (bytes?),
> the failing one has $BIGNUM, thus causing the segfault.
> (trying to free unallocated memory)
>
> I think these long IPv6 addresses overwrite a buffer (or at
> least some memory which is allocated inside OMPI's memory
> pool, thus delaying the segfault).
>
> There are two issues found by valgrind, but I wanted to
> check the "normal" valgrind output first. With the nightly
> snapshot 1.2b1r12555, I got the following "errors":
>
> ==8948== Conditional jump or move depends on uninitialised value(s)
> ==8948== at 0x1B92884D: ompi_attr_create_predefined_callback
> (attribute_predefined.c:374)
> ==8948== by 0x1BC869B8: orte_gpr_proxy_deliver_notify_msg
> (gpr_proxy_deliver_notify_msg.c:144)
> ==8948== by 0x1B9FEDF7: mca_oob_xcast (oob_base_xcast.c:147)
> ==8948== by 0x1B947E49: ompi_mpi_init (ompi_mpi_init.c:542)
> ==8948== by 0x1B97D657: MPI_Init (pinit.c:71)
> ==8948== by 0x8048846: main (in /home/racl/adi/ompi/trunk/test/vm/ring)
>
> and
>
> ==8948== Syscall param writev(vector[...]) points to uninitialised byte(s)
> ==8948== at 0x1BBCD5E8: (within /lib/tls/libc-2.3.2.so)
> ==8948== by 0x1BD873C1: mca_btl_tcp_frag_send (btl_tcp_frag.c:104)
> ==8948== by 0x1BD87133: mca_btl_tcp_endpoint_send_handler
> (btl_tcp_endpoint.c:689)
> ==8948== by 0x1BA48AD3: opal_event_process_active (event.c:463)
> ==8948== by 0x1BA48E11: opal_event_base_loop (event.c:600)
> ==8948== by 0x1BA48BE3: opal_event_loop (event.c:514)
> ==8948== by 0x1BA4211D: opal_progress (opal_progress.c:259)
> ==8948== by 0x1BD59D24: opal_condition_wait (condition.h:81)
> ==8948== by 0x1BD5AD00: mca_pml_ob1_send (pml_ob1_isend.c:128)
> ==8948== by 0x1B985CD9: MPI_Send (psend.c:63)
> ==8948== by 0x80488B6: main (in /home/racl/adi/ompi/trunk/test/vm/ring)
> ==8948== Address 0x80FEECE is not stack'd, malloc'd or (recently) free'd
>
>
> Should I worry about these two?
>
> The segfault itself is probably related to this output:
>
> ==3324== Syscall param writev(vector[...]) points to uninitialised byte(s)
> ==3324== at 0x1BBB45E8: (within /lib/tls/libc-2.3.2.so)
> ==3324== by 0x1BC57191: mca_oob_tcp_msg_send_handler (oob_tcp_msg.c:234)
> ==3324== by 0x1BC58658: mca_oob_tcp_peer_send (oob_tcp_peer.c:194)
> ==3324== by 0x1BC5E873: mca_oob_tcp_send (oob_tcp_send.c:152)
> ==3324== by 0x1B9FEC92: mca_oob_send_packed (oob_base_send.c:78)
> ==3324== by 0x1BC6CE92: orte_gpr_proxy_exec_compound_cmd
> (gpr_proxy_compound_cmd.c:117)
> ==3324== by 0x1B94503A: ompi_mpi_init (ompi_mpi_init.c:523)
> ==3324== by 0x1B97AE7F: MPI_Init (pinit.c:71)
> ==3324== by 0x8048846: main (in /home/racl/adi/ompi/trunk/test/vm/ring)
> ==3324== Address 0x822BF11 is not stack'd, malloc'd or (recently) free'd
>
> But I still have to look closer.
>
> Is there a way to disable OMPI's ptmalloc2 and use the
> system's free/malloc? (hopefully causing the segfault right where
> it is done, probably a memcpy with wrong size)
>
> Or are there other ways to debug such an issue?
>
> TIA