Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpiblast + openmpi + gridengine job faila to run
From: Reuti (reuti_at_[hidden])
Date: 2008-12-24 08:55:54


Hi,

Am 24.12.2008 um 07:55 schrieb Sangamesh B:

> Thanks Reuti. That sorted out the problem.
>
> Now mpiblast is able to run, but only on single node. i.e. mpiformatdb
> -> 4 fragments, mpiblast - 4 processes. Since each node is having 4
> cores, the job will run on a single node and works fine. With 8
> processes, the job fails with following error message:

I would suggest to search in the SGE mailing list archive for
"mpiblast" in the mail body - there are several entries about solving
this issue, which might also apply to your case.

-- Reuti

> $ cat err.108.OMPI-Blast-Job
> [0,1,7][btl_openib_component.c:1371:btl_openib_component_progress]
> from compute-0-5.local to: compute-0-11.local error polling HP CQ with
> status LOCAL LENGTH ERROR status number 1 for wr_id 12002616 opcode 42
> [compute-0-11.local:09692] [0,0,0]-[0,1,2] mca_oob_tcp_msg_recv: readv
> failed: Connection reset by peer (104)
> [compute-0-11.local:09692] [0,0,0]-[0,1,4] mca_oob_tcp_msg_recv: readv
> failed: Connection reset by peer (104)
> 4 0.674234 Bailing out with signal 15
> [compute-0-5.local:10032] MPI_ABORT invoked on rank 4 in communicator
> MPI_COMM_WORLD with errorcode 0
> 5 1.324 Bailing out with signal 15
> [compute-0-5.local:10033] MPI_ABORT invoked on rank 5 in communicator
> MPI_COMM_WORLD with errorcode 0
> 6 1.32842 Bailing out with signal 15
> [compute-0-5.local:10034] MPI_ABORT invoked on rank 6 in communicator
> MPI_COMM_WORLD with errorcode 0
> [compute-0-11.local:09692] [0,0,0]-[0,1,3] mca_oob_tcp_msg_recv: readv
> failed: Connection reset by peer (104)
> 0 0.674561 Bailing out with signal 15
> [compute-0-11.local:09782] MPI_ABORT invoked on rank 0 in communicator
> MPI_COMM_WORLD with errorcode 0
> 1 0.808846 Bailing out with signal 15
> [compute-0-11.local:09783] MPI_ABORT invoked on rank 1 in communicator
> MPI_COMM_WORLD with errorcode 0
> 2 0.81484 Bailing out with signal 15
> [compute-0-11.local:09784] MPI_ABORT invoked on rank 2 in communicator
> MPI_COMM_WORLD with errorcode 0
> 3 1.32249 Bailing out with signal 15
> [compute-0-11.local:09785] MPI_ABORT invoked on rank 3 in communicator
> MPI_COMM_WORLD with errorcode 0
>
> I think its problem with OpenMPI. Its not able to communicate with
> processes on another node.
> Please help me to get it working on multiple nodes.
>
> Thanks,
> Sangamesh
>
>
> On Tue, Dec 23, 2008 at 4:45 PM, Reuti <reuti_at_[hidden]>
> wrote:
>> Hi,
>>
>> Am 23.12.2008 um 12:03 schrieb Sangamesh B:
>>
>>> Hello,
>>>
>>> I've compiled MPIBLAST-1.5.0-pio app on Rocks 4.3,Voltaire
>>> infiniband based Linux cluster using Open MPI-1.2.8 + intel 10
>>> compilers.
>>>
>>> The job is not running. Let me explain the configs:
>>>
>>> SGE job script:
>>>
>>> $ cat sge_submit.sh
>>> #!/bin/bash
>>>
>>> #$ -N OMPI-Blast-Job
>>>
>>> #$ -S /bin/bash
>>>
>>> #$ -cwd
>>>
>>> #$ -e err.$JOB_ID.$JOB_NAME
>>>
>>> #$ -o out.$JOB_ID.$JOB_NAME
>>>
>>> #$ -pe orte 4
>>>
>>> /opt/openmpi_intel/1.2.8/bin/mpirun -np $NSLOTS
>>> /opt/apps/mpiblast-150-pio_OMPI/bin/mpiblast -p blastp -d
>>> Mtub_CDC1551_.faa -i 586_seq.fasta -o test.out
>>>
>>> The PE orte is:
>>>
>>> $ qconf -sp orte
>>> pe_name orte
>>> slots 999
>>> user_lists NONE
>>> xuser_lists NONE
>>> start_proc_args /bin/true
>>> stop_proc_args /bin/true
>>> allocation_rule $fill_up
>>> control_slaves FALSE
>>> job_is_first_task TRUE
>>
>> you will need here:
>>
>> control_slaves TRUE
>> job_is_first_task FALSE
>>
>> -- Reuti
>>
>>
>>> urgency_slots min
>>>
>>> # /opt/openmpi_intel/1.2.8/bin/ompi_info | grep gridengine
>>> MCA ras: gridengine (MCA v1.0, API v1.3,
>>> Component v1.2.8)
>>> MCA pls: gridengine (MCA v1.0, API v1.3,
>>> Component v1.2.8)
>>>
>>> The SGE error and output files for the job are as follows:
>>>
>>> $ cat err.88.OMPI-Blast-Job
>>> error: executing task of job 88 failed:
>>> [compute-0-1.local:06151] ERROR: A daemon on node compute-0-1.local
>>> failed to start as expected.
>>> [compute-0-1.local:06151] ERROR: There may be more information
>>> available
>>> from
>>> [compute-0-1.local:06151] ERROR: the 'qstat -t' command on the Grid
>>> Engine tasks.
>>> [compute-0-1.local:06151] ERROR: If the problem persists, please
>>> restart
>>> the
>>> [compute-0-1.local:06151] ERROR: Grid Engine PE job
>>> [compute-0-1.local:06151] ERROR: The daemon exited unexpectedly with
>>> status 1.
>>>
>>> $ cat out.88.OMPI-Blast-Job
>>>
>>> There is nothing in output file.
>>>
>>> The qstat shows that job is running at some node. But on that node,
>>> there is no mpiblast processes running as seen by top command.
>>>
>>> The ps command:
>>>
>>> # ps -ef | grep mpiblast
>>> locuz 4018 4017 0 16:25 ? 00:00:00
>>> /opt/openmpi_intel/1.2.8/bin/mpirun -np 4
>>> /opt/apps/mpiblast-150-pio_OMPI/bin/mpiblast -p blastp -d
>>> Mtub_CDC1551_.faa -i 586_seq.fasta -o test.out
>>> root 4120 4022 0 16:27 pts/0 00:00:00 grep mpiblast
>>>
>>> shows this.
>>>
>>> The ibv_rc_pingpong tests work fine. The output of lsmod:
>>>
>>> # lsmod | grep ib
>>> ib_sdp 57788 0
>>> rdma_cm 38292 3 rdma_ucm,rds,ib_sdp
>>> ib_addr 11400 1 rdma_cm
>>> ib_local_sa 14864 1 rdma_cm
>>> ib_mthca 157396 2
>>> ib_ipoib 83928 0
>>> ib_umad 20656 0
>>> ib_ucm 21256 0
>>> ib_uverbs 46896 8 rdma_ucm,ib_ucm
>>> ib_cm 42536 3 rdma_cm,ib_ipoib,ib_ucm
>>> ib_sa 28512 4 rdma_cm,ib_local_sa,ib_ipoib,ib_cm
>>> ib_mad 43432 5
>>> ib_local_sa,ib_mthca,ib_umad,ib_cm,ib_sa
>>> ib_core 70544 14
>>>
>>> rdma_ucm,rds,ib_sdp,rdma_cm,iw_cm,ib_local_sa,ib_mthca,ib_ipoib,ib_u
>>> mad,ib_ucm,ib_uverbs,ib_cm,ib_sa,ib_mad
>>> ipv6 285089 23 ib_ipoib
>>> libata 124585 1 ata_piix
>>> scsi_mod 144529 2 libata,sd_mod
>>>
>>> What might be the problem?
>>> We've used Voltaire OFA Roll from rocks - Gridstack.
>>>
>>> Thanks,
>>> Sangamesh
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users