Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Barry Evans (bevans_at_[hidden])
Date: 2007-01-18 18:42:00


Hi,

We tried running with 32 and 16, had some success but after a reboot of
the cluster it seems to be any DLPOLY run attempted falls over, either
interactively or through SGE. Standard benchmarks such as IMB and HPL
execute to completion.

Here's the full output of a typical error:

Signal:7 info.si_errno:0(Success) si_code:2()
Failing at addr:0x5107c0
Signal:7 info.si_errno:0(Success) si_code:2()
Failing at addr:0x5107c0
Signal:7 info.si_errno:0(Success) si_code:2()
Failing at addr:0x5107c0
Signal:7 info.si_errno:0(Success) si_code:2()
Failing at addr:0x5107c0
Signal:7 info.si_errno:0(Success) si_code:2()
Failing at addr:0x5107c0
Signal:7 info.si_errno:0(Success) si_code:2()
Failing at addr:0x5107c0
Signal:7 info.si_errno:0(Success) si_code:2()
Failing at addr:0x5107c0
Signal:7 info.si_errno:0(Success) si_code:2()
Failing at addr:0x5107c0
[0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68]
*** End of error message ***
[0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68]
*** End of error message ***
[0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68]
*** End of error message ***
[0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68]
*** End of error message ***
17 additional processes aborted (not shown)

Cheers,
Barry
-----Original Message-----
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
Behalf Of Pak Lui
Sent: 17 January 2007 19:16
To: Open MPI Users
Subject: Re: [OMPI users] Problems with ompi1.2b2, SGE and
DLPOLY[Scanned]

Sorry for jumping in late.

I was able to use ~128 SGE slots for my test run, with the either of the
SGE allocation rules ($fill_up or $round_robin) and -np 64 on my test
MPI program, but I wasn't able to reproduce your error though on
Solaris. Like Brian said, having the stack trace could help. Also, I
wonder if you can can try with a non-MPI program, a smaller number of
slots, or -np to see if he's still able to see the issue?

Brian W. Barrett wrote:
> On Jan 15, 2007, at 10:13 AM, Marcelo Maia Garcia wrote:
>
>> I am trying to setup SGE to run DLPOLY compiled with mpif90
>> (OpenMPI 1.2b2, pathscale Fortran compilers and gcc c/c++). In
>> general I am much more luckier running DLPOLY interactively then
>> using SGE. The error that I got is: Signal:7 info.si_errno:0
>> (Success) si_code:2()[1]. A previous message in the list[2], says
>> that this is more likely to be a configuration problem. But what
>> kind of configuration? It is in the run time?
>
> Could you include the entire stack trace next time? That can help
> localize where the error is occurring. The message is saying that a
> process died from a signal 7, which on Linux is a Bus Error. This
> usually points to memory errors, either in Open MPI or in the user
> application. Without seeing the stack trace, it's difficult to pin
> down where the error is occurring.
>
>> Another error that I got sometimes is related with "writev"[3]
>> But this is pretty rare.
>
> Usually these point to some process in the job dying and the other
> processes having issues completing outstanding sends to the dead
> process. I would guess that the problem originates with the bus
> error you are seeing. Cleaning that up will likely make these errors

> go away.
>
> Brian
>
>
>
>> [1]
>> [ocf_at_master TEST2]$ mpirun -np 16 --hostfile /home/ocf/SRIFBENCH/
>> DLPOLY3/data/nodes_16_slots4.txt /home/ocf/SRIFBENCH/DLPOLY3/
>> execute/DLPOLY.Y
>> Signal:7 info.si_errno:0(Success) si_code:2()
>> Failing at addr:0x5107b0
>> (...)
>>
>> [2] http://www.open-mpi.org/community/lists/users/2007/01/2423.php
>>
>>
>> [3]
>> [node007:05003] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node007:05004] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node006:05170] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node007:05005] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node007:05006] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node006:05170] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node006:05171] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node006:05171] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node006:05172] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node006:05172] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node006:05173] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node006:05173] mca_btl_tcp_frag_send: writev failed with errno=104
>> mpirun noticed that job rank 0 with PID 0 on node node003 exited on
>> signal 48.
>> 15 additional processes aborted (not shown)
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Thanks,
- Pak Lui
pak.lui_at_[hidden]
_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users