Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] orted 1.6.4 and 1.8.1 segv with bonded Cisco P81E
From: Vineet Rawat (vineetrawat0_at_[hidden])
Date: 2014-06-09 17:41:44


We've deployed OpenMPI on a small cluster but get a SEGV in orted. Debug
information is very limited as the cluster is at a remote customer site.
They have a network card with which I'm not familiar (Cisco Systems Inc VIC
P81E PCIe Ethernet NIC) and it seems capable of using the usNIC BTL. I'm
suspicious that it might be at the root of the problem. They're also
bonding the 2 ports.

However, we're also doing a few unusual things which could be causing
problems. Firstly, we built OpenMPI (I tried 1.6.4 and 1.8.1) without the
ibverbs or usnic BTLs. Then, we only ship what (we think) we need: otrerun,
orted, libmpi, libmpi_cxx, libopen-rte and libopen-pal. Could there be a
dependency on some other binary executable or dlopen'ed library? We also
use a special plm_rsh_agent but we've used this approach for some time
without issue.

I tried a few different MCA settings, the most restrictive of which led to
the failure of this command:

orted --debug --debug-daemons -mca ess env -mca orte_ess_jobid 1925054464
-mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca orte_hnp_uri
\"1925054464.0;tcp://\" --tree-spawn --mca
orte_base_help_aggregate 1 --mca plm_rsh_agent yyy --mca
btl_tcp_port_min_v4 2000 --mca btl_tcp_port_range_v4 100 --mca btl tcp,self
--mca btl_tcp_if_include bond0 --mca orte_create_session_dirs 0 --mca
plm_rsh_assume_same_shell 0 -mca plm rsh -mca orte_debug_daemons 1 -mca
orte_debug 1 -mca orte_tag_output 1

It seems that the host is set up such that the core file is generated and
immediately removed ("ulimit -c" is unlimited) but the abrt daemon is doing
something weird. I'll be trying to get access to the system so I can use
"--mca orte orte_daemon_spin" and attach a debugger (if that's how that's
done). If I'm able to debug or obtain a core file I'll provide more
information. I've attached some information regarding the hardware,
OpenMPI's configuration and ompi_info output. Any thoughts?