Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Alex Tumanov (atumanov_at_[hidden])
Date: 2007-02-14 12:33:56


Hello,

I recently tried running HPLinpack, compiled with OMPI, over myrinet
MX interconnect. Running a simple hello world program works, but XHPL
fails with an error occurring when it tries to MPI_Send:

# mpirun -np 4 -H l0-0,c0-2 --prefix $MPIHOME --mca btl mx,self
/opt/hpl/openmpi-hpl/bin/xhpl
[l0-0.local:04707] *** An error occurred in MPI_Send
[l0-0.local:04707] *** on communicator MPI_COMM_WORLD
[l0-0.local:04707] *** MPI_ERR_INTERN: internal error
[l0-0.local:04707] *** MPI_ERRORS_ARE_FATAL (goodbye)
mpirun noticed that job rank 0 with PID 4706 on node "l0-0" exited on signal 15.
3 additional processes aborted (not shown)

# mpirun -np 4 -H l0-0,c0-2 --prefix $MPIHOME --mca btl mx,self ~/atumanov/hello
Hello from Alex' MPI test program
Process 1 on compute-0-2.local out of 4
Hello from Alex' MPI test program
Hello from Alex' MPI test program
Process 0 on l0-0.local out of 4
Process 3 on compute-0-2.local out of 4
Hello from Alex' MPI test program
Process 2 on l0-0.local out of 4

The output from mx_info is as follows:
-------------------------------------------------------------------------------------------------
MX Version: 1.2.0g
MX Build: root_at_[hidden]:/home/install/rocks/src/roll/myrinet_mx10g/BUILD/mx-1.2.0g
Wed Jan 17 18:51:12 PST 2007
1 Myrinet board installed.
The MX driver is configured to support up to 4 instances and 1024 nodes.
===================================================================
Instance #0: 299.8 MHz LANai, PCI-E x8, 2 MB SRAM
        Status: Running, P0: Link up
        MAC Address: 00:60:dd:47:7d:73
        Product code: 10G-PCIE-8A-C
        Part number: 09-03362
        Serial number: 314581
        Mapper: 00:60:dd:47:7d:73, version = 0x591b1c74, configured
        Mapped hosts: 2

                                                                ROUTE COUNT
INDEX MAC ADDRESS HOST NAME P0
----- ----------- --------- ---
   0) 00:60:dd:47:7d:73 compute-0-2.local:0 D 0,0
   1) 00:60:dd:47:7d:72 l0-0.local:0 1,0
-------------------------------------------------------------------------------------------------

There are several questions. First of all, am I able to initiate OMPI
over MX jobs from the headnode to be executed on 2 compute nodes even
though the headnode does not have MX hardware? Secondly, looking at
next to last line in the mx_info output, what does letter 'D' stand
for? Third, the MX interconnect support OMPI provides - does it mean
MX-2G or there's support for MX-10G as well?

If anybody has encountered a similar problem and was able to
circumvent it please do let me know.

Many thanks for your time and for bringing the community together.

Sincerely,
Alex.