Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] scaling issue beyond 1024 processes
From: CB (cbalways_at_[hidden])
Date: 2011-08-09 14:02:25


Hi Ralph,

Yes, you are right. Those nodes were still pointing to an old version.
I'll check the installation on all nodes and try to run it again.

Thanks,
- Chansup

On Tue, Aug 9, 2011 at 1:48 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> That error makes no sense - line 335 is just a variable declaration. Sure
> you are not picking up a different version on that node?
>
>
> On Aug 9, 2011, at 11:37 AM, CB wrote:
>
> > Hi,
> >
> > Currently I'm having trouble to scale an MPI job beyond a certain limit.
> > So I'm running an MPI hello example to test beyond 1024 but it failed
> with the following error with 2048 processes.
> > It worked fine with 1024 processes. I have enough file descriptor limit
> (65536) defined for each process.
> >
> > I appreciate if anyone gives me any suggestions.
> > I'm running (Open MPI) 1.4.3
> >
> > [x-01-06-a:25989] [[37568,0],69] ORTE_ERROR_LOG: Data unpack had
> inadequate space in file base/odls_base_default_fns.c at line 335
> > [x-01-06-b:09532] [[37568,0],74] ORTE_ERROR_LOG: Data unpack had
> inadequate space in file base/odls_base_default_fns.c at line 335
> >
> --------------------------------------------------------------------------
> > mpirun noticed that the job aborted, but has no info as to the process
> > that caused that situation.
> >
> --------------------------------------------------------------------------
> > [x-03-20-b:23316] *** Process received signal ***
> > [x-03-20-b:23316] Signal: Segmentation fault (11)
> > [x-03-20-b:23316] Signal code: Address not mapped (1)
> > [x-03-20-b:23316] Failing at address: 0x6c
> > [x-03-20-b:23316] [ 0] /lib64/libpthread.so.0 [0x310860ee90]
> > [x-03-20-b:23316] [ 1]
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_app_report_launch+0x230)
> [0x7f0dbe0c5010]
> > [x-03-20-b:23316] [ 2] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0
> [0x7f0dbde5c8f8]
> > [x-03-20-b:23316] [ 3] mpirun [0x403bbe]
> > [x-03-20-b:23316] [ 4] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0
> [0x7f0dbde5c8f8]
> > [x-03-20-b:23316] [ 5]
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0x99)
> [0x7f0dbde50e49]
> > [x-03-20-b:23316] [ 6]
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_trigger_event+0x42)
> [0x7f0dbe0a7ca2]
> > [x-03-20-b:23316] [ 7]
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_app_report_launch+0x22d)
> [0x7f0dbe0c500d]
> > [x-03-20-b:23316] [ 8] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0
> [0x7f0dbde5c8f8]
> > [x-03-20-b:23316] [ 9]
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0x99)
> [0x7f0dbde50e49]
> > [x-03-20-b:23316] [10]
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x23d)
> [0x7f0dbe0c5ddd]
> > [x-03-20-b:23316] [11]
> /usr/local/MPI/openmpi-1.4.3/lib/openmpi/mca_plm_rsh.so [0x7f0dbd41d679]
> > [x-03-20-b:23316] [12] mpirun [0x40373f]
> > [x-03-20-b:23316] [13] mpirun [0x402a1c]
> > [x-03-20-b:23316] [14] /lib64/libc.so.6(__libc_start_main+0xfd)
> [0x3107e1ea2d]
> > [x-03-20-b:23316] [15] mpirun [0x402939]
> > [x-03-20-b:23316] *** End of error message ***
> > [x-01-06-a:25989] [[37568,0],69]-[[37568,0],0] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> > [x-01-06-b:09532] [[37568,0],74]-[[37568,0],0] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> > ./sge_jsb.sh: line 9: 23316 Segmentation fault (core dumped) mpirun
> -np $NSLOTS ./hello_openmpi.exe
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>