Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] scaling issue beyond 1024 processes
From: CB (cbalways_at_[hidden])
Date: 2011-08-11 12:51:38


The job was dispatched by SGE scheduler but the mpi hello binary never gets
executed on compute nodes. It appears that the OpenMPI orted is waiting for
something as shown below:

Master node:

 4465 ? Sl 0:05 /usr/local/sge/latest/bin/lx26-amd64/sge_execd
 4677 ? S 0:00 \_ sge_shepherd-296 -bg
 4679 ? Ss 0:00 \_ /bin/bash ./sge_jsb.sh
 4681 ? S 0:00 \_ mpirun -np 3097 ./hello_openmpi.exe
 4682 ? Sl 0:02 \_
/usr/local/sge/latest/bin/lx26-amd64/qrsh -inherit -nostdin -V x-01-00-a
orted -mca ess env -mca orte_ess_jobid 662831104 -mca orte_ess_vpid 1 -mca
orte_ess_num_procs 130 --hnp-uri "662831104.0;tcp://xxx.xx.4.8:39025"
 4683 ? Sl 0:01 \_
/usr/local/sge/latest/bin/lx26-amd64/qrsh -inherit -nostdin -V x-01-06-b
orted -mca ess env -mca orte_ess_jobid 662831104 -mca orte_ess_vpid 2 -mca
orte_ess_num_procs 130 --hnp-uri "662831104.0;tcp://xxx.xx.4.8:39025"
... <cut the ramining process> ...

===

A client compute node:

 6290 ? Sl 0:05 /usr/local/sge/latest/bin/lx26-amd64/sge_execd
 6793 ? Sl 0:00 \_ sge_shepherd-296 -bg
 6794 ? Ss 0:00 \_
/usr/local/sge/latest/utilbin/lx26-amd64/qrsh_starter
/var/spool/sge/62u5/x-01-00-a/active_jobs/296.1/1.x-01-00-a
 6801 ? S 0:00 \_ orted -mca ess env -mca
orte_ess_jobid 662831104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 130
--hnp-uri 662831104.0;tcp://xxx.xx.4.8:39025

- Chansup

On Wed, Aug 10, 2011 at 4:19 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> When you say "stuck", what actually happens?
>
> On Aug 10, 2011, at 2:09 PM, CB wrote:
>
> Now I was able to run MPI hello world example up to 3096 processes across
> 129 nodes (24 cores per node).
> However, it seems to get stuck with 3097 processes.
>
> Any suggestions for troubleshooting?
>
> Thanks,
> - Chansup
>
>
> On Tue, Aug 9, 2011 at 2:02 PM, CB <cbalways_at_[hidden]> wrote:
>
>> Hi Ralph,
>>
>> Yes, you are right. Those nodes were still pointing to an old version.
>> I'll check the installation on all nodes and try to run it again.
>>
>> Thanks,
>> - Chansup
>>
>>
>> On Tue, Aug 9, 2011 at 1:48 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>> That error makes no sense - line 335 is just a variable declaration. Sure
>>> you are not picking up a different version on that node?
>>>
>>>
>>> On Aug 9, 2011, at 11:37 AM, CB wrote:
>>>
>>> > Hi,
>>> >
>>> > Currently I'm having trouble to scale an MPI job beyond a certain
>>> limit.
>>> > So I'm running an MPI hello example to test beyond 1024 but it failed
>>> with the following error with 2048 processes.
>>> > It worked fine with 1024 processes. I have enough file descriptor
>>> limit (65536) defined for each process.
>>> >
>>> > I appreciate if anyone gives me any suggestions.
>>> > I'm running (Open MPI) 1.4.3
>>> >
>>> > [x-01-06-a:25989] [[37568,0],69] ORTE_ERROR_LOG: Data unpack had
>>> inadequate space in file base/odls_base_default_fns.c at line 335
>>> > [x-01-06-b:09532] [[37568,0],74] ORTE_ERROR_LOG: Data unpack had
>>> inadequate space in file base/odls_base_default_fns.c at line 335
>>> >
>>> --------------------------------------------------------------------------
>>> > mpirun noticed that the job aborted, but has no info as to the process
>>> > that caused that situation.
>>> >
>>> --------------------------------------------------------------------------
>>> > [x-03-20-b:23316] *** Process received signal ***
>>> > [x-03-20-b:23316] Signal: Segmentation fault (11)
>>> > [x-03-20-b:23316] Signal code: Address not mapped (1)
>>> > [x-03-20-b:23316] Failing at address: 0x6c
>>> > [x-03-20-b:23316] [ 0] /lib64/libpthread.so.0 [0x310860ee90]
>>> > [x-03-20-b:23316] [ 1]
>>> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_app_report_launch+0x230)
>>> [0x7f0dbe0c5010]
>>> > [x-03-20-b:23316] [ 2]
>>> /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0 [0x7f0dbde5c8f8]
>>> > [x-03-20-b:23316] [ 3] mpirun [0x403bbe]
>>> > [x-03-20-b:23316] [ 4]
>>> /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0 [0x7f0dbde5c8f8]
>>> > [x-03-20-b:23316] [ 5]
>>> /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0x99)
>>> [0x7f0dbde50e49]
>>> > [x-03-20-b:23316] [ 6]
>>> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_trigger_event+0x42)
>>> [0x7f0dbe0a7ca2]
>>> > [x-03-20-b:23316] [ 7]
>>> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_app_report_launch+0x22d)
>>> [0x7f0dbe0c500d]
>>> > [x-03-20-b:23316] [ 8]
>>> /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0 [0x7f0dbde5c8f8]
>>> > [x-03-20-b:23316] [ 9]
>>> /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0x99)
>>> [0x7f0dbde50e49]
>>> > [x-03-20-b:23316] [10]
>>> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x23d)
>>> [0x7f0dbe0c5ddd]
>>> > [x-03-20-b:23316] [11]
>>> /usr/local/MPI/openmpi-1.4.3/lib/openmpi/mca_plm_rsh.so [0x7f0dbd41d679]
>>> > [x-03-20-b:23316] [12] mpirun [0x40373f]
>>> > [x-03-20-b:23316] [13] mpirun [0x402a1c]
>>> > [x-03-20-b:23316] [14] /lib64/libc.so.6(__libc_start_main+0xfd)
>>> [0x3107e1ea2d]
>>> > [x-03-20-b:23316] [15] mpirun [0x402939]
>>> > [x-03-20-b:23316] *** End of error message ***
>>> > [x-01-06-a:25989] [[37568,0],69]-[[37568,0],0] mca_oob_tcp_msg_recv:
>>> readv failed: Connection reset by peer (104)
>>> > [x-01-06-b:09532] [[37568,0],74]-[[37568,0],0] mca_oob_tcp_msg_recv:
>>> readv failed: Connection reset by peer (104)
>>> > ./sge_jsb.sh: line 9: 23316 Segmentation fault (core dumped)
>>> mpirun -np $NSLOTS ./hello_openmpi.exe
>>> >
>>> >
>>> > _______________________________________________
>>> > users mailing list
>>> > users_at_[hidden]
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>