Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Tim Prins (tprins_at_[hidden])
Date: 2007-04-02 13:56:50


Yes, only the first segfault is fixed in the nightly builds. You can
run mx_endpoint_info to see how many endpoints are available and if
any are in use.

As far as the segfault you are seeing now, I am unsure what is
causing it. Hopefully someone who knows more about that area of the
code than me can help.

Thanks,

Tim

On Apr 2, 2007, at 6:12 AM, de Almeida, Valmor F. wrote:

>
> Hi Tim,
>
> I installed the openmpi-1.2.1a0r14178 tarball (took this
> opportunity to
> use the intel fortran compiler instead gfortran). With a simple
> test it
> seems to work but note the same messages
>
> ->mpirun -np 8 -machinefile mymachines a.out
> [x1:25417] mca_btl_mx_init: mx_open_endpoint() failed with status=20
> [x1:25418] mca_btl_mx_init: mx_open_endpoint() failed with status=20
> [x2:31983] mca_btl_mx_init: mx_open_endpoint() failed with status=20
> [x2:31982] mca_btl_mx_init: mx_open_endpoint() failed with status=20
> [x2:31980] mca_btl_mx_init: mx_open_endpoint() failed with status=20
> Hello, world! I am 4 of 7
> Hello, world! I am 0 of 7
> Hello, world! I am 1 of 7
> Hello, world! I am 5 of 7
> Hello, world! I am 2 of 7
> Hello, world! I am 7 of 7
> Hello, world! I am 6 of 7
> Hello, world! I am 3 of 7
>
> and the machinefile is
>
> x1 slots=4 max_slots=4
> x2 slots=4 max_slots=4
>
> However with a realistic code, it starts fine (same messages as above)
> and somewhere later:
>
> [x1:25947] *** Process received signal ***
> [x1:25947] Signal: Segmentation fault (11)
> [x1:25947] Signal code: Address not mapped (1)
> [x1:25947] Failing at address: 0x14
> [x1:25947] [ 0] [0xb7f00440]
> [x1:25947] [ 1]
> /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so
> (mca_pml_ob1_send_r
> equest_start_copy+0x13e) [0xb7a80e6e]
> [x1:25947] [ 2]
> /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so
> (mca_pml_ob1_send_r
> equest_process_pending+0x1e3) [0xb7a82463]
> [x1:25947] [ 3] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so
> [0xb7a7ebf8]
> [x1:25947] [ 4]
> /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_btl_sm.so
> (mca_btl_sm_componen
> t_progress+0x1813) [0xb7a41923]
> [x1:25947] [ 5]
> /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_bml_r2.so
> (mca_bml_r2_progress
> +0x36) [0xb7a4fdd6]
> [x1:25947] [ 6] /opt/ompi/lib/libopen-pal.so.0(opal_progress+0x79)
> [0xb7dc41a9]
> [x1:25947] [ 7] /opt/ompi/lib/libmpi.so.0(ompi_request_wait_all+0xb5)
> [0xb7e90145]
> [x1:25947] [ 8]
> /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so
> (ompi_coll_tuned
> _sendrecv_actual+0xc9) [0xb7a167a9]
> [x1:25947] [ 9]
> /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so
> (ompi_coll_tuned
> _barrier_intra_recursivedoubling+0xe4) [0xb7a1bfb4]
> [x1:25947] [10]
> /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so
> (ompi_coll_tuned
> _barrier_intra_dec_fixed+0x48) [0xb7a16a18]
> [x1:25947] [11] /opt/ompi/lib/libmpi.so.0(PMPI_Barrier+0x69)
> [0xb7ea4059]
> [x1:25947] [12] driver0(_ZNK3MPI4Comm7BarrierEv+0x20) [0x806baf4]
> [x1:25947] [13] driver0(_ZN3gms12PartitionSet14ReadData_Case2Ev+0xc92)
> [0x808bb78]
> [x1:25947] [14] driver0(_ZN3gms12PartitionSet8ReadDataESsSsSst+0xbc)
> [0x8086f96]
> [x1:25947] [15] driver0(main+0x181) [0x8068c7f]
> [x1:25947] [16] /lib/libc.so.6(__libc_start_main+0xdc) [0xb7b6a824]
> [x1:25947] [17] driver0(__gxx_personality_v0+0xb9) [0x8068991]
> [x1:25947] *** End of error message ***
> mpirun noticed that job rank 0 with PID 25945 on node x1 exited on
> signal 15 (Terminated).
> 7 additional processes aborted (not shown)
>
>
> This code does run to completion using ompi-1.2 if I use only 2 slots
> per machine.
>
> Thanks for any help.
>
> --
> Valmor
>
>> -----Original Message-----
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
> On
>> Behalf Of Tim Prins
>> Sent: Friday, March 30, 2007 10:49 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] mca_btl_mx_init: mx_open_endpoint() failed
>> withstatus=20
>>
>> Hi Valmor,
>>
>> What is happening here is that when Open MPI tries to create MX
> endpoint
>> for
>> communication, mx returns code 20, which is MX_BUSY.
>>
>> At this point we should gracefully move on, but there is a bug in
>> Open
> MPI
>> 1.2
>> which causes a segmentation fault in case of this type of error. This
> will
>> be
>> fixed in 1.2.1, and the fix is available now in the 1.2 nightly
> tarballs.
>>
>> Hope this helps,
>>
>> Tim
>>
>> On Friday 30 March 2007 05:06 pm, de Almeida, Valmor F. wrote:
>>> Hello,
>>>
>>> I am getting this error any time the number of processes requested
> per
>>> machine is greater than the number of cpus. I suspect it is
> something on
>>> the configuration of mx / ompi that I am missing since another
> machine I
>>> have without mx installed runs ompi correctly with oversubscription.
>>>
>>> Thanks for any help.
>>>
>>> --
>>> Valmor
>>>
>>>
>>> ->mpirun -np 3 --machinefile mymachines-1 a.out
>>> [x1:23624] mca_btl_mx_init: mx_open_endpoint() failed with status=20
>>> [x1:23624] *** Process received signal *** [x1:23624] Signal:
>>> Segmentation fault (11) [x1:23624] Signal code: Address not mapped
> (1)
>>> [x1:23624] Failing at address: 0x20 [x1:23624] [ 0] [0xb7f7f440]
>>> [x1:23624] [ 1]
>>> /opt/openmpi-1.2/lib/openmpi/mca_btl_mx.so(mca_btl_mx_finalize+0x25)
>>> [0xb7aca825] [x1:23624] [ 2]
>>>
> /opt/openmpi-1.2/lib/openmpi/mca_btl_mx.so(mca_btl_mx_component_init
> +0x6
>>> f8) [0xb7acc658] [x1:23624] [ 3]
>>> /opt/ompi/lib/libmpi.so.0(mca_btl_base_select+0x1a0) [0xb7f41900]
>>> [x1:23624] [ 4]
>>>
> /opt/openmpi-1.2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init
> +0x2
>>> 6) [0xb7ad1006] [x1:23624] [ 5]
>>> /opt/ompi/lib/libmpi.so.0(mca_bml_base_init+0x78) [0xb7f41198]
>>> [x1:23624] [ 6]
>>>
> /opt/openmpi-1.2/lib/openmpi/mca_pml_ob1.so
> (mca_pml_ob1_component_init+0
>>> x7d) [0xb7af866d] [x1:23624] [ 7]
>>> /opt/ompi/lib/libmpi.so.0(mca_pml_base_select+0x176) [0xb7f49b56]
>>> [x1:23624] [ 8] /opt/ompi/lib/libmpi.so.0(ompi_mpi_init+0x4cf)
>>> [0xb7f0fe2f] [x1:23624] [ 9]
> /opt/ompi/lib/libmpi.so.0(MPI_Init+0xab)
>>> [0xb7f3204b] [x1:23624] [10] a.out(_ZN3MPI4InitERiRPPc+0x18)
> [0x8052cbe]
>>> [x1:23624] [11] a.out(main+0x21) [0x804f4a7] [x1:23624] [12]
>>> /lib/libc.so.6(__libc_start_main+0xdc) [0xb7be9824]
>>>
>>> content of mymachines-1 file
>>>
>>> x1 max_slots=4
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users