Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: David Burns (3db14_at_[hidden])
Date: 2007-03-19 14:01:09


OK,

Trying

mpirun -v -np 2 --debug-daemons --host talisker4 hostname

yields the error
[talisker4.phy.queensu.ca:00682] [0,0,1]-[0,0,0]
mca_oob_tcp_peer_try_connect: connect to 130.15.29.85:33821 failed,
connecting over all interfaces failed!
[talisker2.phy.queensu.ca:28538] ERROR: A daemon on node talisker4
failed to start as expected.
[talisker2.phy.queensu.ca:28538] ERROR: There may be more information
available from
[talisker2.phy.queensu.ca:28538] ERROR: the remote shell (see above).
[talisker2.phy.queensu.ca:28538] ERROR: The daemon exited unexpectedly
with status 255.

So apparently, the error is a result of talisker4 (remote) being unable
to open a connection with talisker2 (local) in this case. Trying the reverse

mpirun -v -np 2 --debug-daemons --host talisker2 hostname

executed from talisker4 yields the same error message reversed (ie 2
cant connect to 4). This makes me think its a firewall problem...

- Dave

Tim Prins wrote:
> David,
>
> Have you tried something like
>
> mpirun -np 1 --host talisker4 hostname
>
> If that hangs, try adding '--debug-daemons' to the command line and
> see if the output from that helps. If not, please send the output to
> the list.
>
> Thanks,
>
> Tim
>
> On Mar 19, 2007, at 1:59 AM, David Burns wrote:
>
>
>> I neglected to mention that the test is currently running on 100 Mbps
>> ethernet. I have also tested the setup using a simple "hello world my
>> rank is_" program and get the same hanging problem.
>>
>>
>> 3db14_at_[hidden] wrote:
>>
>>> If anyone could help me out with this I would greatly appreciate
>>> it. I
>>> have already read through the entire FAQ and havent seen anyone
>>> with a
>>> similar problem.
>>>
>>> I have successfully tested and run the ompi application I've coded
>>> locally
>>> on both computers talisker2 and talisker4
>>>
>>> mpirun -np 1 --host localhost fdtd : -np 2 --host localhost rnode
>>>
>>> However, when attempting to execute processes remotely, eg
>>>
>>> mpirun -np 1 --host localhost fdtd : -np 2 --host talisker4 rnode
>>>
>>> Nothing happens. The shell just sits there, nothing prints (despite
>>> stdouts), and does not return until I kill it. I have set up ssh with
>>> rsa-authentication, no passphrase. The paths are all set; I have
>>> tried
>>> purposefully missetting them and the error is reported and returns as
>>> expected (so it isnt that).
>>>
>>> More info about the system- fedora core 5, (Open MPI) 1.1.4.
>>> config.log
>>> and ompi_info outputs attached. Any help or ideas of where to go next
>>> would be greatly appreciated.
>>>
>>> Thanks,
>>> David
>>>
>>> ---------------------------------------------------------------------
>>> ---
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> ---------------------------------------------------------------------
>>> ---
>>>
>>> No virus found in this incoming message.
>>> Checked by AVG Free Edition.
>>> Version: 7.5.446 / Virus Database: 268.18.13/725 - Release Date:
>>> 17/03/2007 12:33 PM
>>>
>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>