Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] help: sm btl does not work when I specify the same host twice or more in the node list
From: yanyg_at_[hidden]
Date: 2012-02-15 11:54:26


> No, there are no others you need to set. Ralph's referring to the fact
> that we set OMPI environment variables in the processes that are
> started on the remote nodes.
>
> I was asking to ensure you hadn't set any MCA parameters in the
> environment that could be creating a problem. Do you have any set in
> files, perchance?
>
> And can you run "env | grep OMPI" from the script that you invoked via
> mpirun?
>
> So just to be clear on the exact problem you're seeing:
>
> - you mpirun on a single node and all works fine
> - you mpirun on multiple nodes and all works fine (e.g., mpirun --host
> a,b,c your_executable) - you mpirun on multiple nodes and list a host
> more than once and it hangs (e.g., mpirun --host a,a,b,c
> your_executable)
>
> Is that correct?
>
> If so, can you attach a debugger to one of the hung processes and see
> exactly where it's hung? (i.e., get the stack traces)
>
> Per a question from your prior mail: yes, Open MPI does create mmapped
> files in /tmp for use with shared memory communication. They *should*
> get cleaned up when you exit, however, unless something disastrous
> happens.

Thank you very much!

Now I am more clear with what Ralph asked.

Yes what you described is right with the sm btl layer. As I double
checked again, the problem is that when I use sm btl for MPI
commnunication on the same host(as --mca btl openib,sm,self),
issues come up as you described, all ran well on a single node, all
ran well on multiple but different nodes, but it hang at MPI_Init() call
if I ran on multiple nodes and list a host more than once. However,
if I instead use tcp or openib btl without sm layer(as --mca btl
openib,self), all these 3 cases ran just fine.

I do setup the MCAs "plm_rsh_agent" to "rsh:ssh" and
"btl_openib_warn_default_gid_prefix" to 0 in all cases, with or
without sm btl layer. The OMPI environment variables set for each
processes are quoted below(as output by env | grep OMPI in my
script invoked by mpirun):

------
//process #0:

OMPI_MCA_plm_rsh_agent=rsh:ssh
OMPI_MCA_btl_openib_warn_default_gid_prefix=0
OMPI_MCA_btl=openib,sm,self
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294
OMPI_MCA_orte_local_daemon_uri=195559424.0;tcp://198.177.14
6.70:53997;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172
.33.10.1:53997
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_mpi_yield_when_idle=0
OMPI_MCA_orte_app_num=0 OMPI_UNIVERSE_SIZE=4
OMPI_MCA_ess=env OMPI_MCA_orte_ess_num_procs=4
OMPI_COMM_WORLD_SIZE=4
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_MCA_orte_ess_jobid=195559425
OMPI_MCA_orte_ess_vpid=0 OMPI_COMM_WORLD_RANK=0
OMPI_COMM_WORLD_LOCAL_RANK=0

//process #1:

OMPI_MCA_plm_rsh_agent=rsh:ssh
OMPI_MCA_btl_openib_warn_default_gid_prefix=0
OMPI_MCA_btl=openib,sm,self
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294
OMPI_MCA_orte_local_daemon_uri=195559424.0;tcp://198.177.14
6.70:53997;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172
.33.10.1:53997
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_mpi_yield_when_idle=0
OMPI_MCA_orte_app_num=1 OMPI_UNIVERSE_SIZE=4
OMPI_MCA_ess=env OMPI_MCA_orte_ess_num_procs=4
OMPI_COMM_WORLD_SIZE=4
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_MCA_orte_ess_jobid=195559425
OMPI_MCA_orte_ess_vpid=1 OMPI_COMM_WORLD_RANK=1
OMPI_COMM_WORLD_LOCAL_RANK=1

//process #3:

OMPI_MCA_plm_rsh_agent=rsh:ssh
OMPI_MCA_btl_openib_warn_default_gid_prefix=0
OMPI_MCA_btl=openib,sm,self
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294 OMPI_MCA_orte_daemonize=1
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_ess=env OMPI_MCA_orte_ess_jobid=195559425
OMPI_MCA_orte_ess_vpid=3 OMPI_MCA_orte_ess_num_procs=4
OMPI_MCA_orte_local_daemon_uri=195559424.1;tcp://198.177.14
6.71:53290;tcp://10.10.10.1:53290;tcp://172.23.10.2:53290;tcp://172
.33.10.2:53290 OMPI_MCA_mpi_yield_when_idle=0
OMPI_MCA_orte_app_num=3 OMPI_UNIVERSE_SIZE=4
OMPI_COMM_WORLD_SIZE=4
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_COMM_WORLD_RANK=3
OMPI_COMM_WORLD_LOCAL_RANK=1

//process #2:

OMPI_MCA_plm_rsh_agent=rsh:ssh
OMPI_MCA_btl_openib_warn_default_gid_prefix=0
OMPI_MCA_btl=openib,sm,self
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294 OMPI_MCA_orte_daemonize=1
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_ess=env OMPI_MCA_orte_ess_jobid=195559425
OMPI_MCA_orte_ess_vpid=2 OMPI_MCA_orte_ess_num_procs=4
OMPI_MCA_orte_local_daemon_uri=195559424.1;tcp://198.177.14
6.71:53290;tcp://10.10.10.1:53290;tcp://172.23.10.2:53290;tcp://172
.33.10.2:53290 OMPI_MCA_mpi_yield_when_idle=0
OMPI_MCA_orte_app_num=2 OMPI_UNIVERSE_SIZE=4
OMPI_COMM_WORLD_SIZE=4
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_COMM_WORLD_RANK=2
OMPI_COMM_WORLD_LOCAL_RANK=0

------
process #0 and #1 are on the same host, while process #2 and #3
are on the other.

When I use sm btl layer, my program just hang at the MPI_Init() at
the very beginning.

I wish I made myself clear.

Thanks,
Yiguang