Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault in mca_pml_ob1.so
From: Grzegorz Maj (maju3_at_[hidden])
Date: 2010-12-07 10:52:57


I recompiled MPI with -g, but it didn't solve the problem. Two things that
have changed are: buf in PMPI_Recv is no longer of value 0 and backtrace in
gdb shows more functions (eg. mca_pml_ob1_recv_frag_callback_put as #1).

As you recommended, I will try to walk up the stack, but it's not so easy
for me to follow this code.

This is the backtrace I got with -g:
-----
Program received signal SIGSEGV, Segmentation fault.
0x00007f1f1a11e4eb in mca_pml_ob1_send_request_put (sendreq=0x1437b00,
btl=0xdae850, hdr=0xeb4870) at pml_ob1_sendreq.c:1231
1231 pml_ob1_sendreq.c: No such file or directory.
 in pml_ob1_sendreq.c
(gdb) bt
#0 0x00007f1f1a11e4eb in mca_pml_ob1_send_request_put (sendreq=0x1437b00,
btl=0xdae850, hdr=0xeb4870) at pml_ob1_sendreq.c:1231
#1 0x00007f1f1a1124de in mca_pml_ob1_recv_frag_callback_put (btl=0xdae850,
tag=72 'H', des=0x7f1f1ff6bb00, cbdata=0x0) at pml_ob1_recvfrag.c:361
#2 0x00007f1f19660e0f in mca_btl_tcp_endpoint_recv_handler (sd=24, flags=2,
user=0xe2ab40) at btl_tcp_endpoint.c:718
#3 0x00007f1f1d74aa5b in event_process_active (base=0xd82af0) at
event.c:651
#4 0x00007f1f1d74b087 in opal_event_base_loop (base=0xd82af0, flags=2) at
event.c:823
#5 0x00007f1f1d74ac76 in opal_event_loop (flags=2) at event.c:730
#6 0x00007f1f1d73a360 in opal_progress () at runtime/opal_progress.c:189
#7 0x00007f1f1a10c0af in opal_condition_wait (c=0x7f1f1df3a5c0,
m=0x7f1f1df3a620) at ../../../../opal/threads/condition.h:99
#8 0x00007f1f1a10bef1 in ompi_request_wait_completion (req=0xe1eb00) at
../../../../ompi/request/request.h:375
#9 0x00007f1f1a10bdb5 in mca_pml_ob1_recv (addr=0x7f1f1a083080, count=1,
datatype=0xeb3da0, src=-1, tag=0, comm=0xeb0cd0, status=0xe43f00) at
pml_ob1_irecv.c:104
#10 0x00007f1f1dc9e324 in PMPI_Recv (buf=0x7f1f1a083080, count=1,
type=0xeb3da0, source=-1, tag=0, comm=0xeb0cd0, status=0xe43f00) at
precv.c:75
#11 0x000000000049cc43 in BI_Srecv ()
#12 0x000000000049c555 in BI_IdringBR ()
#13 0x0000000000495ba1 in ilp64_Cdgebr2d ()
#14 0x000000000047ffa0 in Cdgebr2d ()
#15 0x00007f1f1f99c8e1 in PB_CInV2 () from
/home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so
#16 0x00007f1f1f9c489c in PB_CpgemmAB () from
/home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so
#17 0x00007f1f1fa748fd in pdgemm_ () from
/home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so
-----

Thanks,
Grzegorz Maj

2010/12/7 Terry Dontje <terry.dontje_at_[hidden]>

> I am not sure this has anything to do with your problem but if you look at
> the stack entry for PMPI_Recv I noticed the buf has a value of 0. Shouldn't
> that be an address?
>
> Does your code fail if the MPI library is built with -g? If it does fail
> the same way, the next step I would do would be to walk up the stack and try
> and figure out where the sendreq address is coming from because supposedly
> it is that address that is not mapped according to the original stack.
>
> --td
>
>
> On 12/07/2010 08:29 AM, Grzegorz Maj wrote:
>
> Some update on this issue. I've attached gdb to the crashing
> application and I got:
>
> -----
> Program received signal SIGSEGV, Segmentation fault.
> mca_pml_ob1_send_request_put (sendreq=0x130c480, btl=0xc49850,
> hdr=0xd10e60) at pml_ob1_sendreq.c:1231
> 1231 pml_ob1_sendreq.c: No such file or directory.
> in pml_ob1_sendreq.c
> (gdb) bt
> #0 mca_pml_ob1_send_request_put (sendreq=0x130c480, btl=0xc49850,
> hdr=0xd10e60) at pml_ob1_sendreq.c:1231
> #1 0x00007fc55bf31693 in mca_btl_tcp_endpoint_recv_handler (sd=<value
> optimized out>, flags=<value optimized out>, user=<value optimized
> out>) at btl_tcp_endpoint.c:718
> #2 0x00007fc55fff7de4 in event_process_active (base=0xc1daf0,
> flags=2) at event.c:651
> #3 opal_event_base_loop (base=0xc1daf0, flags=2) at event.c:823
> #4 0x00007fc55ffe9ff1 in opal_progress () at runtime/opal_progress.c:189
> #5 0x00007fc55c9d7115 in opal_condition_wait (addr=<value optimized
> out>, count=<value optimized out>, datatype=<value optimized out>,
> src=<value optimized out>, tag=<value optimized out>,
> comm=<value optimized out>, status=0xcc6100) at
> ../../../../opal/threads/condition.h:99
> #6 ompi_request_wait_completion (addr=<value optimized out>,
> count=<value optimized out>, datatype=<value optimized out>,
> src=<value optimized out>, tag=<value optimized out>,
> comm=<value optimized out>, status=0xcc6100) at
> ../../../../ompi/request/request.h:375
> #7 mca_pml_ob1_recv (addr=<value optimized out>, count=<value
> optimized out>, datatype=<value optimized out>, src=<value optimized
> out>, tag=<value optimized out>, comm=<value optimized out>,
> status=0xcc6100) at pml_ob1_irecv.c:104
> #8 0x00007fc560511260 in PMPI_Recv (buf=0x0, count=12884048,
> type=0xd10410, source=-1, tag=0, comm=0xd0daa0, status=0xcc6100) at
> precv.c:75
> #9 0x000000000049cc43 in BI_Srecv ()
> #10 0x000000000049c555 in BI_IdringBR ()
> #11 0x0000000000495ba1 in ilp64_Cdgebr2d ()
> #12 0x000000000047ffa0 in Cdgebr2d ()
> #13 0x00007fc5621da8e1 in PB_CInV2 () from
> /home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so
> #14 0x00007fc56220289c in PB_CpgemmAB () from
> /home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so
> #15 0x00007fc5622b28fd in pdgemm_ () from
> /home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so
> -----
>
> So this looks like the line responsible for segmentation fault is:
> mca_bml_base_endpoint_t *bml_endpoint = sendreq->req_endpoint;
>
> I repeated it several times: always crashes in the same line.
>
> I have no idea what to do with this. Again, any help would be appreciated.
>
> Thanks,
> Grzegorz Maj
>
>
>
> 2010/12/6 Grzegorz Maj <maju3_at_[hidden]> <maju3_at_[hidden]>:
>
> Hi,
> I'm using mkl scalapack in my project. Recently, I was trying to run
> my application on new set of nodes. Unfortunately, when I try to
> execute more than about 20 processes, I get segmentation fault.
>
> [compn7:03552] *** Process received signal ***
> [compn7:03552] Signal: Segmentation fault (11)
> [compn7:03552] Signal code: Address not mapped (1)
> [compn7:03552] Failing at address: 0x20b2e68
> [compn7:03552] [ 0] /lib64/libpthread.so.0(+0xf3c0) [0x7f46e0fc33c0]
> [compn7:03552] [ 1]
> /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0xd577)
> [0x7f46dd093577]
> [compn7:03552] [ 2]
> /home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so(+0x5b4c)
> [0x7f46dc5edb4c]
> [compn7:03552] [ 3]
> /home/gmaj/lib/openmpi/lib/libopen-pal.so.0(+0x1dbe8) [0x7f46e0679be8]
> [compn7:03552] [ 4]
> (home/gmaj/lib/openmpi/lib/libopen-pal.so.0(opal_progress+0xa1)
> [0x7f46e066dbf1]
> [compn7:03552] [ 5]
> /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x5945)
> [0x7f46dd08b945]
> [compn7:03552] [ 6]
> /home/gmaj/lib/openmpi/lib/libmpi.so.0(MPI_Send+0x6a) [0x7f46e0b4f10a]
> [compn7:03552] [ 7] /home/gmaj/matrix/matrix(BI_Ssend+0x21) [0x49cc11]
> [compn7:03552] [ 8] /home/gmaj/matrix/matrix(BI_IdringBR+0x79) [0x49c579]
> [compn7:03552] [ 9] /home/gmaj/matrix/matrix(ilp64_Cdgebr2d+0x221) [0x495bb1]
> [compn7:03552] [10] /home/gmaj/matrix/matrix(Cdgebr2d+0xd0) [0x47ffb0]
> [compn7:03552] [11]
> /home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so(PB_CInV2+0x1304)
> [0x7f46e27f5124]
> [compn7:03552] *** End of error message ***
>
> This error appears during some scalapack computation. My processes do
> some mpi communication before this error appears.
>
> I found out, that by modifying btl_tcp_eager_limit and
> btl_tcp_max_send_size parameters, I can run more processes - the
> smaller those values are, the more processes I can run. Unfortunately,
> by this method I've succeeded to run up to 30 processes, which is
> still far to small.
>
> Some clue may be what valgrind says:
>
> ==3894== Syscall param writev(vector[...]) points to uninitialised byte(s)
> ==3894== at 0x82D009B: writev (in /lib64/libc-2.12.90.so)
> ==3894== by 0xBA2136D: mca_btl_tcp_frag_send (in
> /home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so)
> ==3894== by 0xBA203D0: mca_btl_tcp_endpoint_send (in
> /home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so)
> ==3894== by 0xB003583: mca_pml_ob1_send_request_start_rdma (in
> /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
> ==3894== by 0xAFFA7C9: mca_pml_ob1_send (in
> /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
> ==3894== by 0x6D4B109: PMPI_Send (in /home/gmaj/lib/openmpi/lib/libmpi.so.0)
> ==3894== by 0x49CC10: BI_Ssend (in /home/gmaj/matrix/matrix)
> ==3894== by 0x49C578: BI_IdringBR (in /home/gmaj/matrix/matrix)
> ==3894== by 0x495BB0: ilp64_Cdgebr2d (in /home/gmaj/matrix/matrix)
> ==3894== by 0x47FFAF: Cdgebr2d (in /home/gmaj/matrix/matrix)
> ==3894== by 0x51B38E0: PB_CInV2 (in
> /home/gmaj/lib/intel_mkl/10.2.6/lib/em64t/libmkl_scalapack_ilp64.so)
> ==3894== by 0x51DB89B: PB_CpgemmAB (in
> /home/gmaj/lib/intel_mkl/10.2.6/lib/em64t/libmkl_scalapack_ilp64.so)
> ==3894== Address 0xadecdce is 461,886 bytes inside a block of size
> 527,544 alloc'd
> ==3894== at 0x4C2615D: malloc (vg_replace_malloc.c:195)
> ==3894== by 0x6D0BBA3: ompi_free_list_grow (in
> /home/gmaj/lib/openmpi/lib/libmpi.so.0)
> ==3894== by 0xBA1E1A4: mca_btl_tcp_component_init (in
> /home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so)
> ==3894== by 0x6D5C909: mca_btl_base_select (in
> /home/gmaj/lib/openmpi/lib/libmpi.so.0)
> ==3894== by 0xB40E950: mca_bml_r2_component_init (in
> /home/gmaj/lib/openmpi/lib/openmpi/mca_bml_r2.so)
> ==3894== by 0x6D5C07E: mca_bml_base_init (in
> /home/gmaj/lib/openmpi/lib/libmpi.so.0)
> ==3894== by 0xAFF8A0E: mca_pml_ob1_component_init (in
> /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
> ==3894== by 0x6D663B2: mca_pml_base_select (in
> /home/gmaj/lib/openmpi/lib/libmpi.so.0)
> ==3894== by 0x6D25D20: ompi_mpi_init (in
> /home/gmaj/lib/openmpi/lib/libmpi.so.0)
> ==3894== by 0x6D45987: PMPI_Init_thread (in
> /home/gmaj/lib/openmpi/lib/libmpi.so.0)
> ==3894== by 0x42490A: MPI::Init_thread(int&, char**&, int)
> (functions_inln.h:150)
> ==3894== by 0x41F483: main (matrix.cpp:83)
>
> I've tried to configure open-mpi with option --without-memory-manager,
> but it didn't help.
>
> I can successfully run exactly the same application on other machines,
> having the number of nodes even over 800.
>
> Does anyone have any idea how to further debug this issue? Any help
> would be appreciated.
>
> Thanks,
> Grzegorz Maj
>
>
> _______________________________________________
> users mailing listusers_at_[hidden]http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> [image: Oracle]
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle * - Performance Technologies*
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje_at_[hidden]
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>