Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem running an mpi applicatio​n on nodes with more than one interface
From: Jingcha Joba (pukkimonkey_at_[hidden])
Date: 2012-02-17 11:59:46


Yes. I did.
Because it was a same NIC with two ports each capable of delivering 5gb/s,
I never thought that they should be in different subnet.
But once I changed the subnet for one of the ports on both the nodes, it
seemed to work..

Also, I am looking for a good way to start understanding the implementation
level details for OpenMPI. Can you point me to some good source?
(PS: To start with, I have already read the FAQ section)

thanks a lot for the help

--
Joba
On Fri, Feb 17, 2012 at 8:30 AM, Richard Bardwell <richard_at_[hidden]>wrote:
> Yes, they were on the same subnet. I guess that is the problem.
>
> ----- Original Message ----- From: "Jeff Squyres" <jsquyres_at_[hidden]>
> To: "Open MPI Users" <users_at_[hidden]>
> Sent: Friday, February 17, 2012 4:20 PM
> Subject: Re: [OMPI users] Problem running an mpi applicatio​n on nodes
> with more than one interface
>
>
>
>  Did you have both of the ethernet ports on the same subnet, or were they
>> on different subnets?
>>
>>
>> On Feb 17, 2012, at 5:36 AM, Richard Bardwell wrote:
>>
>>  I had exactly the same problem.
>>> Trying to run mpi between 2 separate machines, with each machine having
>>> 2 ethernet ports, causes really weird behaviour on the most basic code.
>>> I had to disable one of the ethernet ports on each of the machines
>>> and it worked just fine after that. No idea why though !
>>>
>>> ----- Original Message -----
>>> From: Jingcha Joba
>>> To: users_at_[hidden]
>>> Sent: Thursday, February 16, 2012 8:43 PM
>>> Subject: [OMPI users] Problem running an mpi applicatio​n on nodes with
>>> more than one interface
>>>
>>> Hello Everyone,
>>> This is my 1st post in open-mpi forum.
>>> I am trying to run a simple program which does Sendrecv between two
>>> nodes having 2 interface cards on each of two nodes.
>>> Both the nodes are running RHEL6, with open-mpi 1.4.4 on a 8 core Xeon
>>> processor.
>>> What I noticed was that when using two or more interface on both the
>>> nodes, the mpi "hangs" attempting to connect.
>>> These details might help,
>>> Node 1 - Denver has a single port "A" card (eth21 - 25.192.xx.xx - which
>>> I use to ssh to that machine), and a double port "B" card (eth23 - 10.3.1.1
>>> & eth24 - 10.3.1.2).
>>> Node 2 - Chicago also the same single port A card (eth19 - 25.192.xx.xx
>>> - again uses for ssh) and a double port B card ( eth29 - 10.3.1.3 &eth30 -
>>> 10.3.1.4).
>>> My /etc/host looks like
>>> 25.192.xx.xx denver.xxx.com denver
>>> 10.3.1.1 denver.xxx.com denver
>>> 10.3.1.2 denver.xxx.com denver
>>> 25.192.xx.xx chicago.xxx.com chicago
>>> 10.3.1.3 chicago.xxx.com chicago
>>> 10.3.1.4 chicago.xxx.com chicago
>>> ...
>>> ...
>>> ...
>>> This is how I run,
>>> mpirun --hostfile host1 --mca btl tcp,sm,self --mca btl_tcp_if_exclude
>>> eth21,eth19,lo,virbr0 --mca btl_base_verbose 30 -np 4 ./Sendrecv
>>> I get bunch of things from both chicago and denver, which says its has
>>> found components like tcp, sm, self and stuffs, and then hangs at
>>> [denver.xxx.com:21682] btl: tcp: attempting to connect() to address
>>> 10.3.1.3 on port 4
>>> [denver.xxx.com:21682] btl: tcp: attempting to connect() to address
>>> 10.3.1.4 on port 4
>>> However, if I run the same program by excluding eth29 or eth30, then it
>>> works fine. Something like this:
>>> mpirun --hostfile host1 --mca btl tcp,sm,self --mca btl_tcp_if_exclude
>>> eth21,eth19,eth29,lo,virbr0 --mca btl_base_verbose 30 -np 4 ./Sendrecv
>>> My hostfile looks like this
>>> [sshuser_at_denver Sendrecv]$ cat host1
>>> denver slots=2
>>> chicago slots=2
>>> I am not sure if I have to provide somethbing else. Please if I have to,
>>> please feel to ask me..
>>> thanks,
>>> --
>>> Joba
>>>
>>>
>>> ______________________________**_________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/**mailman/listinfo.cgi/users>
>>> ______________________________**_________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/**mailman/listinfo.cgi/users>
>>>
>>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
>> http://www.cisco.com/web/**about/doing_business/legal/**cri/>
>>
>>
>> ______________________________**_________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/**mailman/listinfo.cgi/users>
>>
>
>
> ______________________________**_________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/**mailman/listinfo.cgi/users>
>