Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] scaling issue beyond 1024 processes
From: CB (cbalways_at_[hidden])
Date: 2011-08-09 14:02:25


Hi Ralph,

Yes, you are right. Those nodes were still pointing to an old version.
I'll check the installation on all nodes and try to run it again.

Thanks,
- Chansup

On Tue, Aug 9, 2011 at 1:48 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> That error makes no sense - line 335 is just a variable declaration. Sure
> you are not picking up a different version on that node?
>
>
> On Aug 9, 2011, at 11:37 AM, CB wrote:
>
> > Hi,
> >
> > Currently I'm having trouble to scale an MPI job beyond a certain limit.
> > So I'm running an MPI hello example to test beyond 1024 but it failed
> with the following error with 2048 processes.
> > It worked fine with 1024 processes. I have enough file descriptor limit
> (65536) defined for each process.
> >
> > I appreciate if anyone gives me any suggestions.
> > I'm running (Open MPI) 1.4.3
> >
> > [x-01-06-a:25989] [[37568,0],69] ORTE_ERROR_LOG: Data unpack had
> inadequate space in file base/odls_base_default_fns.c at line 335
> > [x-01-06-b:09532] [[37568,0],74] ORTE_ERROR_LOG: Data unpack had
> inadequate space in file base/odls_base_default_fns.c at line 335
> >
> --------------------------------------------------------------------------
> > mpirun noticed that the job aborted, but has no info as to the process
> > that caused that situation.
> >
> --------------------------------------------------------------------------
> > [x-03-20-b:23316] *** Process received signal ***
> > [x-03-20-b:23316] Signal: Segmentation fault (11)
> > [x-03-20-b:23316] Signal code: Address not mapped (1)
> > [x-03-20-b:23316] Failing at address: 0x6c
> > [x-03-20-b:23316] [ 0] /lib64/libpthread.so.0 [0x310860ee90]
> > [x-03-20-b:23316] [ 1]
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_app_report_launch+0x230)
> [0x7f0dbe0c5010]
> > [x-03-20-b:23316] [ 2] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0
> [0x7f0dbde5c8f8]
> > [x-03-20-b:23316] [ 3] mpirun [0x403bbe]
> > [x-03-20-b:23316] [ 4] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0
> [0x7f0dbde5c8f8]
> > [x-03-20-b:23316] [ 5]
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0x99)
> [0x7f0dbde50e49]
> > [x-03-20-b:23316] [ 6]
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_trigger_event+0x42)
> [0x7f0dbe0a7ca2]
> > [x-03-20-b:23316] [ 7]
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_app_report_launch+0x22d)
> [0x7f0dbe0c500d]
> > [x-03-20-b:23316] [ 8] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0
> [0x7f0dbde5c8f8]
> > [x-03-20-b:23316] [ 9]
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0x99)
> [0x7f0dbde50e49]
> > [x-03-20-b:23316] [10]
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x23d)
> [0x7f0dbe0c5ddd]
> > [x-03-20-b:23316] [11]
> /usr/local/MPI/openmpi-1.4.3/lib/openmpi/mca_plm_rsh.so [0x7f0dbd41d679]
> > [x-03-20-b:23316] [12] mpirun [0x40373f]
> > [x-03-20-b:23316] [13] mpirun [0x402a1c]
> > [x-03-20-b:23316] [14] /lib64/libc.so.6(__libc_start_main+0xfd)
> [0x3107e1ea2d]
> > [x-03-20-b:23316] [15] mpirun [0x402939]
> > [x-03-20-b:23316] *** End of error message ***
> > [x-01-06-a:25989] [[37568,0],69]-[[37568,0],0] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> > [x-01-06-b:09532] [[37568,0],74]-[[37568,0],0] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> > ./sge_jsb.sh: line 9: 23316 Segmentation fault (core dumped) mpirun
> -np $NSLOTS ./hello_openmpi.exe
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>