Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Pak Lui (Pak.Lui_at_[hidden])
Date: 2007-01-19 10:24:14


It seems from what you said that the DLPOLY program would fail with or
without SGE is being used. Since I am not familiar with DLPOLY, I am a
little clueless as to what else you can try. Perhaps you can try looking
deeper into DLPOLY by having a debuggable build and running a parallel
debugger on the program to see if you can pinpoint see where it actually
fails?

Barry Evans wrote:
> Hi,
>
> We tried running with 32 and 16, had some success but after a reboot of
> the cluster it seems to be any DLPOLY run attempted falls over, either
> interactively or through SGE. Standard benchmarks such as IMB and HPL
> execute to completion.
>
> Here's the full output of a typical error:
>
> Signal:7 info.si_errno:0(Success) si_code:2()
> Failing at addr:0x5107c0
> Signal:7 info.si_errno:0(Success) si_code:2()
> Failing at addr:0x5107c0
> Signal:7 info.si_errno:0(Success) si_code:2()
> Failing at addr:0x5107c0
> Signal:7 info.si_errno:0(Success) si_code:2()
> Failing at addr:0x5107c0
> Signal:7 info.si_errno:0(Success) si_code:2()
> Failing at addr:0x5107c0
> Signal:7 info.si_errno:0(Success) si_code:2()
> Failing at addr:0x5107c0
> Signal:7 info.si_errno:0(Success) si_code:2()
> Failing at addr:0x5107c0
> Signal:7 info.si_errno:0(Success) si_code:2()
> Failing at addr:0x5107c0
> [0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68]
> *** End of error message ***
> [0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68]
> *** End of error message ***
> [0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68]
> *** End of error message ***
> [0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68]
> *** End of error message ***
> 17 additional processes aborted (not shown)
>
> Cheers,
> Barry
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
> Behalf Of Pak Lui
> Sent: 17 January 2007 19:16
> To: Open MPI Users
> Subject: Re: [OMPI users] Problems with ompi1.2b2, SGE and
> DLPOLY[Scanned]
>
> Sorry for jumping in late.
>
> I was able to use ~128 SGE slots for my test run, with the either of the
> SGE allocation rules ($fill_up or $round_robin) and -np 64 on my test
> MPI program, but I wasn't able to reproduce your error though on
> Solaris. Like Brian said, having the stack trace could help. Also, I
> wonder if you can can try with a non-MPI program, a smaller number of
> slots, or -np to see if he's still able to see the issue?
>
> Brian W. Barrett wrote:
>> On Jan 15, 2007, at 10:13 AM, Marcelo Maia Garcia wrote:
>>
>>> I am trying to setup SGE to run DLPOLY compiled with mpif90
>>> (OpenMPI 1.2b2, pathscale Fortran compilers and gcc c/c++). In
>>> general I am much more luckier running DLPOLY interactively then
>>> using SGE. The error that I got is: Signal:7 info.si_errno:0
>>> (Success) si_code:2()[1]. A previous message in the list[2], says
>>> that this is more likely to be a configuration problem. But what
>>> kind of configuration? It is in the run time?
>> Could you include the entire stack trace next time? That can help
>> localize where the error is occurring. The message is saying that a
>> process died from a signal 7, which on Linux is a Bus Error. This
>> usually points to memory errors, either in Open MPI or in the user
>> application. Without seeing the stack trace, it's difficult to pin
>> down where the error is occurring.
>>
>>> Another error that I got sometimes is related with "writev"[3]
>>> But this is pretty rare.
>> Usually these point to some process in the job dying and the other
>> processes having issues completing outstanding sends to the dead
>> process. I would guess that the problem originates with the bus
>> error you are seeing. Cleaning that up will likely make these errors
>
>> go away.
>>
>> Brian
>>
>>
>>
>>> [1]
>>> [ocf_at_master TEST2]$ mpirun -np 16 --hostfile /home/ocf/SRIFBENCH/
>>> DLPOLY3/data/nodes_16_slots4.txt /home/ocf/SRIFBENCH/DLPOLY3/
>>> execute/DLPOLY.Y
>>> Signal:7 info.si_errno:0(Success) si_code:2()
>>> Failing at addr:0x5107b0
>>> (...)
>>>
>>> [2] http://www.open-mpi.org/community/lists/users/2007/01/2423.php
>>>
>>>
>>> [3]
>>> [node007:05003] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node007:05004] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node006:05170] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node007:05005] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node007:05006] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node006:05170] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node006:05171] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node006:05171] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node006:05172] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node006:05172] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node006:05173] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node006:05173] mca_btl_tcp_frag_send: writev failed with errno=104
>>> mpirun noticed that job rank 0 with PID 0 on node node003 exited on
>>> signal 48.
>>> 15 additional processes aborted (not shown)
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

-- 
Thanks,
- Pak Lui
pak.lui_at_[hidden]