On Mon, Jun 9, 2014 at 3:21 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> On Jun 9, 2014, at 2:41 PM, Vineet Rawat <vineetrawat0_at_[hidden]> wrote:
> We've deployed OpenMPI on a small cluster but get a SEGV in orted. Debug
> information is very limited as the cluster is at a remote customer site.
> They have a network card with which I'm not familiar (Cisco Systems Inc VIC
> P81E PCIe Ethernet NIC) and it seems capable of using the usNIC BTL. I'm
> suspicious that it might be at the root of the problem. They're also
> bonding the 2 ports.
> This shouldn't matter - the VIC should work fine.
Great, glad to hear that.
> However, we're also doing a few unusual things which could be causing
> problems. Firstly, we built OpenMPI (I tried 1.6.4 and 1.8.1) without the
> ibverbs or usnic BTLs. Then, we only ship what (we think) we need: otrerun,
> orted, libmpi, libmpi_cxx, libopen-rte and libopen-pal. Could there be a
> dependency on some other binary executable or dlopen'ed library? We also
> use a special plm_rsh_agent but we've used this approach for some time
> without issue.
> Did you remember to include all the libraries under <prefix>/lib/openmpi?
> We need all of those or else the orted will fail.
No, we only included what seemed necessary (from ldd output and experience
on other clusters). The only things in my <prefix>/lib/openmpi are
libompi_dbg_msgq*. Is that what you're referring to? In <prefix>/lib for
12.8.1 (ignoring the VampirTrace libs) I could add libmpi_mpifh,
libmpi_usempi, libompitrace and/or liboshmem. Anything needed there?
Thanks for the help,
> I tried a few different MCA settings, the most restrictive of which led to
> the failure of this command:
> orted --debug --debug-daemons -mca ess env -mca orte_ess_jobid 1925054464
> -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca orte_hnp_uri
> \"1925054464.0;tcp://10.xxx.xxx.xxx:40547\" --tree-spawn --mca
> orte_base_help_aggregate 1 --mca plm_rsh_agent yyy --mca
> btl_tcp_port_min_v4 2000 --mca btl_tcp_port_range_v4 100 --mca btl tcp,self
> --mca btl_tcp_if_include bond0 --mca orte_create_session_dirs 0 --mca
> plm_rsh_assume_same_shell 0 -mca plm rsh -mca orte_debug_daemons 1 -mca
> orte_debug 1 -mca orte_tag_output 1
> It seems that the host is set up such that the core file is generated and
> immediately removed ("ulimit -c" is unlimited) but the abrt daemon is doing
> something weird. I'll be trying to get access to the system so I can use
> "--mca orte orte_daemon_spin" and attach a debugger (if that's how that's
> done). If I'm able to debug or obtain a core file I'll provide more
> information. I've attached some information regarding the hardware,
> OpenMPI's configuration and ompi_info output. Any thoughts?
> users mailing list
> users mailing list