Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [omx-devel] Open-mx issue with ompi 1.6.1
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-09-12 11:57:36


Here's the r numbers with notable MX changes recently:

https://svn.open-mpi.org/trac/ompi/changeset/26760
https://svn.open-mpi.org/trac/ompi/changeset/26759
https://svn.open-mpi.org/trac/ompi/changeset/26698
https://svn.open-mpi.org/trac/ompi/changeset/26626
https://svn.open-mpi.org/trac/ompi/changeset/26194
https://svn.open-mpi.org/trac/ompi/changeset/26180
https://svn.open-mpi.org/trac/ompi/changeset/25445
https://svn.open-mpi.org/trac/ompi/changeset/25043
https://svn.open-mpi.org/trac/ompi/changeset/24858
https://svn.open-mpi.org/trac/ompi/changeset/24460
https://svn.open-mpi.org/trac/ompi/changeset/23996
https://svn.open-mpi.org/trac/ompi/changeset/23925
https://svn.open-mpi.org/trac/ompi/changeset/23801
https://svn.open-mpi.org/trac/ompi/changeset/23764
https://svn.open-mpi.org/trac/ompi/changeset/23714
https://svn.open-mpi.org/trac/ompi/changeset/23713
https://svn.open-mpi.org/trac/ompi/changeset/23712

That goes back over a year.

On Sep 12, 2012, at 11:39 AM, Brice Goglin wrote:

> (I am bringing back OMPI users to CC)
>
> I reproduce the problem with OMPI 1.6.1 and found the problem.
> mx_finalize() is called before this error occurs. So the error is
> expected because calling mx_connect() after mx_finalize() is invalid.
> It looks the MX component changed significantly between OMPI 1.5 and
> 1.6.1, and I am pretty sure it worked fine with early 1.5.x versions.
> Can somebody comment on what was changed in the MX BTL component in late
> 1.5 versions ?
>
> Brice
>
>
>
>
>
> Le 12/09/2012 15:48, Douglas Eadline a écrit :
>>> Resending this mail again, with another SMTP. Please re-add
>>> open-mx-devel to CC when you reply.
>>>
>>> Brice
>>>
>>>
>>> Le 07/09/2012 00:10, Brice Goglin a écrit :
>>>> Hello Doug,
>>>>
>>>> Did you use the same Open-MX version when it worked fine? Same kernel
>>>> too?
>>>> Any chance you built OMPI over an old OMX that would not be compatible
>>>> with 1.5.2?
>> I double checked, and even rebuilt both Open MPI and MPICH2
>> with 1.5.2.
>>
>> Running on a 4 node cluster with Warewulf provisioning. See below:
>>
>>>> The error below tells that the OMX driver and lib don't speak the same
>>>> langage. EBADF is almost never returned from the OMX driver. The only
>>>> case is when talking to /dev/open-mx-raw, but normal application don't
>>>> do this. That's why I am thinking about OMPI using an old library that
>>>> cannot talk to a new driver. We have checks to prevent this but we never
>>>> know.
>>>>
>>>> Do you see anything in dmesg?
>> no
>>>> Is omx_info OK?
>> yes shows:
>>
>> Open-MX version 1.5.2
>> build: deadline_at_limulus:/raid1/home/deadline/rpms-sl6/BUILD/open-mx-1.5.2
>> Mon Sep 10 08:44:16 EDT 2012
>>
>> Found 1 boards (32 max) supporting 32 endpoints each:
>> n0:0 (board #0 name eth0 addr 00:1a:4d:4a:bf:85)
>> managed by driver 'r8169'
>>
>> Peer table is ready, mapper is 00:00:00:00:00:00
>> ================================================
>> 0) 00:1a:4d:4a:bf:85 n0:0
>> 1) 00:1c:c0:9b:66:d0 n1:0
>> 2) 00:1a:4d:4a:bf:83 n2:0
>> 3) e0:69:95:35:d7:71 limulus:0
>>
>>
>>
>>>> Does a basic omx_perf work? (see
>>>> http://open-mx.gforge.inria.fr/FAQ/index-1.5.html#perf-omxperf)
>> yes, checked host -> each node it works. And mpich2 compiled
>> with same libraries works. What else
>> can I check?
>>
>> --
>> Doug
>>
>>
>>>> Brice
>>>>
>>>>
>>>> Le 06/09/2012 23:04, Douglas Eadline a écrit :
>>>>> I built open-mpi 1.6.1 using the open-mx libraries.
>>>>> This worked previously and now I get the following
>>>>> error. Here is my system:
>>>>>
>>>>> kernel: 2.6.32-279.5.1.el6.x86_64
>>>>> open-mx: 1.5.2
>>>>>
>>>>> BTW, open-mx worked previously with open-mpi and the current
>>>>> version works with mpich2
>>>>>
>>>>>
>>>>> $ mpiexec -np 8 -machinefile machines cpi
>>>>> Process 0 on limulus
>>>>> FatalError: Failed to lookup peer by addr, driver replied Bad file
>>>>> descriptor
>>>>> cpi: ../omx_misc.c:89: omx__ioctl_errno_to_return_checked: Assertion
>>>>> `0'
>>>>> failed.
>>>>> [limulus:04448] *** Process received signal ***
>>>>> [limulus:04448] Signal: Aborted (6)
>>>>> [limulus:04448] Signal code: (-6)
>>>>> [limulus:04448] [ 0] /lib64/libpthread.so.0() [0x3324e0f500]
>>>>> [limulus:04448] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x33246328a5]
>>>>> [limulus:04448] [ 2] /lib64/libc.so.6(abort+0x175) [0x3324634085]
>>>>> [limulus:04448] [ 3] /lib64/libc.so.6() [0x332462ba1e]
>>>>> [limulus:04448] [ 4] /lib64/libc.so.6(__assert_perror_fail+0)
>>>>> [0x332462bae0]
>>>>> [limulus:04448] [ 5]
>>>>> /usr/open-mx/lib/libopen-mx.so.0(omx__ioctl_errno_to_return_checked+0x197)
>>>>> [0x7fb587418b37]
>>>>> [limulus:04448] [ 6]
>>>>> /usr/open-mx/lib/libopen-mx.so.0(omx__peer_addr_to_index+0x55)
>>>>> [0x7fb58741a5d5]
>>>>> [limulus:04448] [ 7] /usr/open-mx/lib/libopen-mx.so.0(+0xdc7a)
>>>>> [0x7fb587419c7a]
>>>>> [limulus:04448] [ 8] /usr/open-mx/lib/libopen-mx.so.0(omx_connect+0x8c)
>>>>> [0x7fb58741a27c]
>>>>> [limulus:04448] [ 9] /usr/open-mx/lib/libopen-mx.so.0(mx_connect+0x15)
>>>>> [0x7fb587425865]
>>>>> [limulus:04448] [10]
>>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_btl_mx_proc_connect+0x5e)
>>>>> [0x7fb5876fe40e]
>>>>> [limulus:04448] [11]
>>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_btl_mx_send+0x2d4)
>>>>> [0x7fb5876fbd94]
>>>>> [limulus:04448] [12]
>>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_pml_ob1_send_request_start_prepare+0xcb)
>>>>> [0x7fb58777d6fb]
>>>>> [limulus:04448] [13]
>>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_pml_ob1_isend+0x4cb)
>>>>> [0x7fb58777509b]
>>>>> [limulus:04448] [14]
>>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_generic+0x37b)
>>>>> [0x7fb58770b55b]
>>>>> [limulus:04448] [15]
>>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_binomial+0xd8)
>>>>> [0x7fb58770b8b8]
>>>>> [limulus:04448] [16]
>>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_dec_fixed+0xcc)
>>>>> [0x7fb587702d8c]
>>>>> [limulus:04448] [17]
>>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_coll_sync_bcast+0x78)
>>>>> [0x7fb587712e88]
>>>>> [limulus:04448] [18]
>>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(MPI_Bcast+0x130)
>>>>> [0x7fb5876ce1b0]
>>>>> [limulus:04448] [19] cpi(main+0x10b) [0x400cc4]
>>>>> [limulus:04448] [20] /lib64/libc.so.6(__libc_start_main+0xfd)
>>>>> [0x332461ecdd]
>>>>> [limulus:04448] [21] cpi() [0x400ac9]
>>>>> [limulus:04448] *** End of error message ***
>>>>> Process 2 on limulus
>>>>> Process 4 on limulus
>>>>> Process 6 on limulus
>>>>> Process 1 on n0
>>>>> Process 7 on n0
>>>>> Process 3 on n0
>>>>> Process 5 on n0
>>>>> --------------------------------------------------------------------------
>>>>> mpiexec noticed that process rank 0 with PID 4448 on node limulus
>>>>> exited
>>>>> on signal 6 (Aborted).
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> Mailscanner: Clean
>>>
>>
>
>
> _______________________________________________
> Open-mx-devel mailing list
> Open-mx-devel_at_[hidden]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/open-mx-devel

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/