Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Galen Shipman (gshipman_at_[hidden])
Date: 2007-01-19 09:32:09


ah, disregard..

On Jan 19, 2007, at 1:35 AM, Barry Evans wrote:

> It's gigabit attached, pathscale is there simply to indicate that ompi
> was compiled with ekopath
>
> - Barry
>
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_open-
> mpi.org] On
> Behalf Of Galen Shipman
> Sent: 19 January 2007 01:56
> To: Open MPI Users
> Cc: Pak.Lui_at_[hidden]
> Subject: Re: [OMPI users] Problems with ompi1.2b2, SGE and
> DLPOLY[Scanned]
>
>
>
>
> Are you using
>
> -mca pml cm
>
> for pathscale or are you using openib?
>
> - Galen
>
>
> On Jan 18, 2007, at 4:42 PM, Barry Evans wrote:
>
>> Hi,
>>
>> We tried running with 32 and 16, had some success but after a
>> reboot of
>> the cluster it seems to be any DLPOLY run attempted falls over,
>> either
>> interactively or through SGE. Standard benchmarks such as IMB and HPL
>> execute to completion.
>>
>> Here's the full output of a typical error:
>>
>> Signal:7 info.si_errno:0(Success) si_code:2()
>> Failing at addr:0x5107c0
>> Signal:7 info.si_errno:0(Success) si_code:2()
>> Failing at addr:0x5107c0
>> Signal:7 info.si_errno:0(Success) si_code:2()
>> Failing at addr:0x5107c0
>> Signal:7 info.si_errno:0(Success) si_code:2()
>> Failing at addr:0x5107c0
>> Signal:7 info.si_errno:0(Success) si_code:2()
>> Failing at addr:0x5107c0
>> Signal:7 info.si_errno:0(Success) si_code:2()
>> Failing at addr:0x5107c0
>> Signal:7 info.si_errno:0(Success) si_code:2()
>> Failing at addr:0x5107c0
>> Signal:7 info.si_errno:0(Success) si_code:2()
>> Failing at addr:0x5107c0
>> [0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68]
>> *** End of error message ***
>> [0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68]
>> *** End of error message ***
>> [0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68]
>> *** End of error message ***
>> [0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68]
>> *** End of error message ***
>> 17 additional processes aborted (not shown)
>>
>> Cheers,
>> Barry
>> -----Original Message-----
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_open-
>> mpi.org] On
>> Behalf Of Pak Lui
>> Sent: 17 January 2007 19:16
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Problems with ompi1.2b2, SGE and
>> DLPOLY[Scanned]
>>
>> Sorry for jumping in late.
>>
>> I was able to use ~128 SGE slots for my test run, with the either
>> of the
>> SGE allocation rules ($fill_up or $round_robin) and -np 64 on my test
>> MPI program, but I wasn't able to reproduce your error though on
>> Solaris. Like Brian said, having the stack trace could help. Also, I
>> wonder if you can can try with a non-MPI program, a smaller number of
>> slots, or -np to see if he's still able to see the issue?
>>
>> Brian W. Barrett wrote:
>>> On Jan 15, 2007, at 10:13 AM, Marcelo Maia Garcia wrote:
>>>
>>>> I am trying to setup SGE to run DLPOLY compiled with mpif90
>>>> (OpenMPI 1.2b2, pathscale Fortran compilers and gcc c/c++). In
>>>> general I am much more luckier running DLPOLY interactively then
>>>> using SGE. The error that I got is: Signal:7 info.si_errno:0
>>>> (Success) si_code:2()[1]. A previous message in the list[2], says
>>>> that this is more likely to be a configuration problem. But what
>>>> kind of configuration? It is in the run time?
>>>
>>> Could you include the entire stack trace next time? That can help
>>> localize where the error is occurring. The message is saying that a
>>> process died from a signal 7, which on Linux is a Bus Error. This
>>> usually points to memory errors, either in Open MPI or in the user
>>> application. Without seeing the stack trace, it's difficult to pin
>>> down where the error is occurring.
>>>
>>>> Another error that I got sometimes is related with "writev"[3]
>>>> But this is pretty rare.
>>>
>>> Usually these point to some process in the job dying and the other
>>> processes having issues completing outstanding sends to the dead
>>> process. I would guess that the problem originates with the bus
>>> error you are seeing. Cleaning that up will likely make these
>>> errors
>>
>>> go away.
>>>
>>> Brian
>>>
>>>
>>>
>>>> [1]
>>>> [ocf_at_master TEST2]$ mpirun -np 16 --hostfile /home/ocf/SRIFBENCH/
>>>> DLPOLY3/data/nodes_16_slots4.txt /home/ocf/SRIFBENCH/DLPOLY3/
>>>> execute/DLPOLY.Y
>>>> Signal:7 info.si_errno:0(Success) si_code:2()
>>>> Failing at addr:0x5107b0
>>>> (...)
>>>>
>>>> [2] http://www.open-mpi.org/community/lists/users/2007/01/2423.php
>>>>
>>>>
>>>> [3]
>>>> [node007:05003] mca_btl_tcp_frag_send: writev failed with errno=104
>>>> [node007:05004] mca_btl_tcp_frag_send: writev failed with errno=104
>>>> [node006:05170] mca_btl_tcp_frag_send: writev failed with errno=104
>>>> [node007:05005] mca_btl_tcp_frag_send: writev failed with errno=104
>>>> [node007:05006] mca_btl_tcp_frag_send: writev failed with errno=104
>>>> [node006:05170] mca_btl_tcp_frag_send: writev failed with errno=104
>>>> [node006:05171] mca_btl_tcp_frag_send: writev failed with errno=104
>>>> [node006:05171] mca_btl_tcp_frag_send: writev failed with errno=104
>>>> [node006:05172] mca_btl_tcp_frag_send: writev failed with errno=104
>>>> [node006:05172] mca_btl_tcp_frag_send: writev failed with errno=104
>>>> [node006:05173] mca_btl_tcp_frag_send: writev failed with errno=104
>>>> [node006:05173] mca_btl_tcp_frag_send: writev failed with errno=104
>>>> mpirun noticed that job rank 0 with PID 0 on node node003 exited on
>>>> signal 48.
>>>> 15 additional processes aborted (not shown)
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> --
>>
>> Thanks,
>>
>> - Pak Lui
>> pak.lui_at_[hidden]
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users