Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem running an mpi applicatio​n on nodes with more than one interface
From: Richard Bardwell (richard_at_[hidden])
Date: 2012-02-17 11:30:57


Yes, they were on the same subnet. I guess that is the problem.

----- Original Message -----
From: "Jeff Squyres" <jsquyres_at_[hidden]>
To: "Open MPI Users" <users_at_[hidden]>
Sent: Friday, February 17, 2012 4:20 PM
Subject: Re: [OMPI users] Problem running an mpi applicatio​n on nodes with more than one interface

> Did you have both of the ethernet ports on the same subnet, or were they on different subnets?
>
>
> On Feb 17, 2012, at 5:36 AM, Richard Bardwell wrote:
>
>> I had exactly the same problem.
>> Trying to run mpi between 2 separate machines, with each machine having
>> 2 ethernet ports, causes really weird behaviour on the most basic code.
>> I had to disable one of the ethernet ports on each of the machines
>> and it worked just fine after that. No idea why though !
>>
>> ----- Original Message -----
>> From: Jingcha Joba
>> To: users_at_[hidden]
>> Sent: Thursday, February 16, 2012 8:43 PM
>> Subject: [OMPI users] Problem running an mpi applicatio​n on nodes with more than one interface
>>
>> Hello Everyone,
>> This is my 1st post in open-mpi forum.
>> I am trying to run a simple program which does Sendrecv between two nodes having 2 interface cards on each of two nodes.
>> Both the nodes are running RHEL6, with open-mpi 1.4.4 on a 8 core Xeon processor.
>> What I noticed was that when using two or more interface on both the nodes, the mpi "hangs" attempting to connect.
>> These details might help,
>> Node 1 - Denver has a single port "A" card (eth21 - 25.192.xx.xx - which I use to ssh to that machine), and a double port "B"
>> card (eth23 - 10.3.1.1 & eth24 - 10.3.1.2).
>> Node 2 - Chicago also the same single port A card (eth19 - 25.192.xx.xx - again uses for ssh) and a double port B card ( eth29 -
>> 10.3.1.3 &eth30 - 10.3.1.4).
>> My /etc/host looks like
>> 25.192.xx.xx denver.xxx.com denver
>> 10.3.1.1 denver.xxx.com denver
>> 10.3.1.2 denver.xxx.com denver
>> 25.192.xx.xx chicago.xxx.com chicago
>> 10.3.1.3 chicago.xxx.com chicago
>> 10.3.1.4 chicago.xxx.com chicago
>> ...
>> ...
>> ...
>> This is how I run,
>> mpirun --hostfile host1 --mca btl tcp,sm,self --mca btl_tcp_if_exclude eth21,eth19,lo,virbr0 --mca btl_base_verbose 30 -np 4
>> ./Sendrecv
>> I get bunch of things from both chicago and denver, which says its has found components like tcp, sm, self and stuffs, and then
>> hangs at
>> [denver.xxx.com:21682] btl: tcp: attempting to connect() to address 10.3.1.3 on port 4
>> [denver.xxx.com:21682] btl: tcp: attempting to connect() to address 10.3.1.4 on port 4
>> However, if I run the same program by excluding eth29 or eth30, then it works fine. Something like this:
>> mpirun --hostfile host1 --mca btl tcp,sm,self --mca btl_tcp_if_exclude eth21,eth19,eth29,lo,virbr0 --mca btl_base_verbose 30 -np
>> 4 ./Sendrecv
>> My hostfile looks like this
>> [sshuser_at_denver Sendrecv]$ cat host1
>> denver slots=2
>> chicago slots=2
>> I am not sure if I have to provide somethbing else. Please if I have to, please feel to ask me..
>> thanks,
>> --
>> Joba
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users