Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] unknown af_family recieved errors...
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-01-28 07:42:56


(sorry for the delay in this reply; this mail came while I was at the MPI Forum meeting. Travel always makes my disastrous INBOX even worse...)

As a bit of explanation, I can surmise part of what is happening here.

When you run on only one machine, the TCP communications plugin (i.e., the "BTL") is not used -- only the shared memory (sm) BTL is used. Hence, you don't see the warnings. That being said, you could force the TCP BTL to be used instead of the sm BTL by using:

  mpirun --mca btl tcp,self -np 2 my_test_program

When you run across multiple nodes, the TCP BTL is used by default. And therefore these warnings come up.

These warnings refer to IP interfaces that Open MPI found that it doesn't recognize. What is the output of ifconfig on your machine?

On Jan 16, 2012, at 9:11 PM, Hamilton Fischer wrote:

>
> ----- Forwarded Message -----
> From: Hamilton Fischer <fischerhamilton_at_[hidden]>
> To: "user_at_[hidden]" <user_at_[hidden]>
> Sent: Monday, January 16, 2012 9:09 PM
> Subject: unknown af_family recieved errors...
>
> Hi, I'm having odd issues with my "cluster", I guess. This very simple example works on one machine, but it gives a load of errors and hangs afterwards when I try to make it work on parrallelize it across the network.
>
> #include <stdio.h>
> #include "mpi.h"
>
> int
> main(int argc, char *argv[])
> {
> int rank, size;
> MPI_Init(&argc, &argv);
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> MPI_Comm_size(MPI_COMM_WORLD, &size);
>
> if (rank == 0)
> {
> int i;
> for(i=1; i < size; ++i)
> {
> int s=1;
> MPI_Send(&s, 1, MPI_INT, i, 1, MPI_COMM_WORLD);
> }
> }
> else
> {
> int r;
> MPI_Recv(&r, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, NULL);
> printf("%d got a %d\n", rank, r);
> }
> MPI_Finalize();
> return 0;
> }
>
> If I do `mpirun -np 3 a.out', where a.out is the executable, I get obvious output:
>
> 1 got a 1
> 2 got a 1
>
> Now, let's say I go on the network. I use `mpirun --hostfile ../combin_host a.out', where my hostfile is simply:
>
> # Hostfile
> angryrock_at_192.168.0.1 slots=4
> # Hostfile
> user_at_192.168.0.102 slots=2
> user_at_192.168.0.103 slots=2
> user_at_192.168.0.104 slots=2
> user_at_192.168.0.105 slots=2
>
> I get this...
>
> [localhost:04756] mca_btl_tcp_proc: unknown af_family received: 1
> [localhost:04756] unknown address family for tcp: 0
> [localhost:04756] mca_btl_tcp_proc: unknown af_family received: 1
> [localhost:04756] unknown address family for tcp: 0
> [localhost:04610] mca_btl_tcp_proc: unknown af_family received: 1
> [localhost:04610] unknown address family for tcp: 0
> [localhost:04048] mca_btl_tcp_proc: unknown af_family received: 1
> ...
> [localhost:04123] unknown address family for tcp: 0
> 1 got a 1
> 2 got a 1
> 3 got a 1
> ^Cmpirun: killing job...
>
> The ellipsis encompases a few lines of the same thing probably for each host. The ending part no doubt is a.out executing on my machine. As is obvious, at the end, I have to kill it because it hangs.
>
> Any help as to what my issue might be? It obviously is an installation issue...
>
> Thanks,
> noobermin
>
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/