When you say "stuck", what actually happens?

On Aug 10, 2011, at 2:09 PM, CB wrote:

Now I was able to run MPI hello world example up to 3096 processes across 129 nodes (24 cores per node).
However, it seems to get stuck with 3097 processes.

Any suggestions for troubleshooting?

Thanks,
- Chansup


On Tue, Aug 9, 2011 at 2:02 PM, CB <cbalways@gmail.com> wrote:
Hi Ralph,

Yes, you are right. Those nodes were still pointing to an old version.
I'll check the installation on all nodes and try to run it again.

Thanks,
- Chansup


On Tue, Aug 9, 2011 at 1:48 PM, Ralph Castain <rhc@open-mpi.org> wrote:
That error makes no sense - line 335 is just a variable declaration. Sure you are not picking up a different version on that node?


On Aug 9, 2011, at 11:37 AM, CB wrote:

> Hi,
>
> Currently I'm having trouble to scale an MPI job beyond a certain limit.
> So I'm running an MPI hello example to test beyond 1024 but it failed with the following error with 2048 processes.
> It worked fine with 1024 processes.  I have enough file descriptor limit (65536) defined for each process.
>
> I appreciate if anyone gives me any suggestions.
> I'm running (Open MPI) 1.4.3
>
> [x-01-06-a:25989] [[37568,0],69] ORTE_ERROR_LOG: Data unpack had inadequate space in file base/odls_base_default_fns.c at line 335
> [x-01-06-b:09532] [[37568,0],74] ORTE_ERROR_LOG: Data unpack had inadequate space in file base/odls_base_default_fns.c at line 335
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> [x-03-20-b:23316] *** Process received signal ***
> [x-03-20-b:23316] Signal: Segmentation fault (11)
> [x-03-20-b:23316] Signal code: Address not mapped (1)
> [x-03-20-b:23316] Failing at address: 0x6c
> [x-03-20-b:23316] [ 0] /lib64/libpthread.so.0 [0x310860ee90]
> [x-03-20-b:23316] [ 1] /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_app_report_launch+0x230) [0x7f0dbe0c5010]
> [x-03-20-b:23316] [ 2] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0 [0x7f0dbde5c8f8]
> [x-03-20-b:23316] [ 3] mpirun [0x403bbe]
> [x-03-20-b:23316] [ 4] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0 [0x7f0dbde5c8f8]
> [x-03-20-b:23316] [ 5] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0x99) [0x7f0dbde50e49]
> [x-03-20-b:23316] [ 6] /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_trigger_event+0x42) [0x7f0dbe0a7ca2]
> [x-03-20-b:23316] [ 7] /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_app_report_launch+0x22d) [0x7f0dbe0c500d]
> [x-03-20-b:23316] [ 8] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0 [0x7f0dbde5c8f8]
> [x-03-20-b:23316] [ 9] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0x99) [0x7f0dbde50e49]
> [x-03-20-b:23316] [10] /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x23d) [0x7f0dbe0c5ddd]
> [x-03-20-b:23316] [11] /usr/local/MPI/openmpi-1.4.3/lib/openmpi/mca_plm_rsh.so [0x7f0dbd41d679]
> [x-03-20-b:23316] [12] mpirun [0x40373f]
> [x-03-20-b:23316] [13] mpirun [0x402a1c]
> [x-03-20-b:23316] [14] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3107e1ea2d]
> [x-03-20-b:23316] [15] mpirun [0x402939]
> [x-03-20-b:23316] *** End of error message ***
> [x-01-06-a:25989] [[37568,0],69]-[[37568,0],0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [x-01-06-b:09532] [[37568,0],74]-[[37568,0],0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> ./sge_jsb.sh: line 9: 23316 Segmentation fault      (core dumped) mpirun -np $NSLOTS ./hello_openmpi.exe
>
>
> _______________________________________________
> users mailing list
> users@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users