Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] scaling issue beyond 1024 processes
From: CB (cbalways_at_[hidden])
Date: 2011-08-10 16:09:36


Now I was able to run MPI hello world example up to 3096 processes across
129 nodes (24 cores per node).
However, it seems to get stuck with 3097 processes.

Any suggestions for troubleshooting?

Thanks,
- Chansup

On Tue, Aug 9, 2011 at 2:02 PM, CB <cbalways_at_[hidden]> wrote:

> Hi Ralph,
>
> Yes, you are right. Those nodes were still pointing to an old version.
> I'll check the installation on all nodes and try to run it again.
>
> Thanks,
> - Chansup
>
>
> On Tue, Aug 9, 2011 at 1:48 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>> That error makes no sense - line 335 is just a variable declaration. Sure
>> you are not picking up a different version on that node?
>>
>>
>> On Aug 9, 2011, at 11:37 AM, CB wrote:
>>
>> > Hi,
>> >
>> > Currently I'm having trouble to scale an MPI job beyond a certain limit.
>> > So I'm running an MPI hello example to test beyond 1024 but it failed
>> with the following error with 2048 processes.
>> > It worked fine with 1024 processes. I have enough file descriptor limit
>> (65536) defined for each process.
>> >
>> > I appreciate if anyone gives me any suggestions.
>> > I'm running (Open MPI) 1.4.3
>> >
>> > [x-01-06-a:25989] [[37568,0],69] ORTE_ERROR_LOG: Data unpack had
>> inadequate space in file base/odls_base_default_fns.c at line 335
>> > [x-01-06-b:09532] [[37568,0],74] ORTE_ERROR_LOG: Data unpack had
>> inadequate space in file base/odls_base_default_fns.c at line 335
>> >
>> --------------------------------------------------------------------------
>> > mpirun noticed that the job aborted, but has no info as to the process
>> > that caused that situation.
>> >
>> --------------------------------------------------------------------------
>> > [x-03-20-b:23316] *** Process received signal ***
>> > [x-03-20-b:23316] Signal: Segmentation fault (11)
>> > [x-03-20-b:23316] Signal code: Address not mapped (1)
>> > [x-03-20-b:23316] Failing at address: 0x6c
>> > [x-03-20-b:23316] [ 0] /lib64/libpthread.so.0 [0x310860ee90]
>> > [x-03-20-b:23316] [ 1]
>> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_app_report_launch+0x230)
>> [0x7f0dbe0c5010]
>> > [x-03-20-b:23316] [ 2] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0
>> [0x7f0dbde5c8f8]
>> > [x-03-20-b:23316] [ 3] mpirun [0x403bbe]
>> > [x-03-20-b:23316] [ 4] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0
>> [0x7f0dbde5c8f8]
>> > [x-03-20-b:23316] [ 5]
>> /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0x99)
>> [0x7f0dbde50e49]
>> > [x-03-20-b:23316] [ 6]
>> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_trigger_event+0x42)
>> [0x7f0dbe0a7ca2]
>> > [x-03-20-b:23316] [ 7]
>> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_app_report_launch+0x22d)
>> [0x7f0dbe0c500d]
>> > [x-03-20-b:23316] [ 8] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0
>> [0x7f0dbde5c8f8]
>> > [x-03-20-b:23316] [ 9]
>> /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0x99)
>> [0x7f0dbde50e49]
>> > [x-03-20-b:23316] [10]
>> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x23d)
>> [0x7f0dbe0c5ddd]
>> > [x-03-20-b:23316] [11]
>> /usr/local/MPI/openmpi-1.4.3/lib/openmpi/mca_plm_rsh.so [0x7f0dbd41d679]
>> > [x-03-20-b:23316] [12] mpirun [0x40373f]
>> > [x-03-20-b:23316] [13] mpirun [0x402a1c]
>> > [x-03-20-b:23316] [14] /lib64/libc.so.6(__libc_start_main+0xfd)
>> [0x3107e1ea2d]
>> > [x-03-20-b:23316] [15] mpirun [0x402939]
>> > [x-03-20-b:23316] *** End of error message ***
>> > [x-01-06-a:25989] [[37568,0],69]-[[37568,0],0] mca_oob_tcp_msg_recv:
>> readv failed: Connection reset by peer (104)
>> > [x-01-06-b:09532] [[37568,0],74]-[[37568,0],0] mca_oob_tcp_msg_recv:
>> readv failed: Connection reset by peer (104)
>> > ./sge_jsb.sh: line 9: 23316 Segmentation fault (core dumped) mpirun
>> -np $NSLOTS ./hello_openmpi.exe
>> >
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>