Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Pak Lui (Pak.Lui_at_[hidden])
Date: 2007-01-17 14:16:00


Sorry for jumping in late.

I was able to use ~128 SGE slots for my test run, with the either of the
SGE allocation rules ($fill_up or $round_robin) and -np 64 on my test
MPI program, but I wasn't able to reproduce your error though on
Solaris. Like Brian said, having the stack trace could help. Also, I
wonder if you can can try with a non-MPI program, a smaller number of
slots, or -np to see if he's still able to see the issue?

Brian W. Barrett wrote:
> On Jan 15, 2007, at 10:13 AM, Marcelo Maia Garcia wrote:
>
>> I am trying to setup SGE to run DLPOLY compiled with mpif90
>> (OpenMPI 1.2b2, pathscale Fortran compilers and gcc c/c++). In
>> general I am much more luckier running DLPOLY interactively then
>> using SGE. The error that I got is: Signal:7 info.si_errno:0
>> (Success) si_code:2()[1]. A previous message in the list[2], says
>> that this is more likely to be a configuration problem. But what
>> kind of configuration? It is in the run time?
>
> Could you include the entire stack trace next time? That can help
> localize where the error is occurring. The message is saying that a
> process died from a signal 7, which on Linux is a Bus Error. This
> usually points to memory errors, either in Open MPI or in the user
> application. Without seeing the stack trace, it's difficult to pin
> down where the error is occurring.
>
>> Another error that I got sometimes is related with "writev"[3]
>> But this is pretty rare.
>
> Usually these point to some process in the job dying and the other
> processes having issues completing outstanding sends to the dead
> process. I would guess that the problem originates with the bus
> error you are seeing. Cleaning that up will likely make these errors
> go away.
>
> Brian
>
>
>
>> [1]
>> [ocf_at_master TEST2]$ mpirun -np 16 --hostfile /home/ocf/SRIFBENCH/
>> DLPOLY3/data/nodes_16_slots4.txt /home/ocf/SRIFBENCH/DLPOLY3/
>> execute/DLPOLY.Y
>> Signal:7 info.si_errno:0(Success) si_code:2()
>> Failing at addr:0x5107b0
>> (...)
>>
>> [2] http://www.open-mpi.org/community/lists/users/2007/01/2423.php
>>
>>
>> [3]
>> [node007:05003] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node007:05004] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node006:05170] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node007:05005] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node007:05006] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node006:05170] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node006:05171] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node006:05171] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node006:05172] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node006:05172] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node006:05173] mca_btl_tcp_frag_send: writev failed with errno=104
>> [node006:05173] mca_btl_tcp_frag_send: writev failed with errno=104
>> mpirun noticed that job rank 0 with PID 0 on node node003 exited on
>> signal 48.
>> 15 additional processes aborted (not shown)
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Thanks,
- Pak Lui
pak.lui_at_[hidden]