Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] openmpi query
From: Nisha Dhankher -M.Tech(CSE) (nishadhankher-coaeseeit_at_[hidden])
Date: 2014-04-08 05:37:48


and thank you very much

On Tue, Apr 8, 2014 at 3:07 PM, Nisha Dhankher -M.Tech(CSE) <
nishadhankher-coaeseeit_at_[hidden]> wrote:

> latest rocks 6.2 carry this version only
>
>
> On Tue, Apr 8, 2014 at 3:49 AM, Jeff Squyres (jsquyres) <
> jsquyres_at_[hidden]> wrote:
>
>> Open MPI 1.4.3 is *ancient*. Please upgrade -- we just released Open MPI
>> 1.8 last week.
>>
>> Also, please look at this FAQ entry -- it steps you through a lot of
>> basic troubleshooting steps about getting basic MPI programs working.
>>
>> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems
>>
>> Once you get basic MPI programs working, then try with MPI Blast.
>>
>>
>>
>> On Apr 5, 2014, at 3:11 AM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>
>> > Mpirun --mca btl ^openib --mca btl_tcp_if_include eth0 -np 16
>> -machinefile mf mpiblast -d all.fas -p blastn -i query.fas -o out.txt
>> >
>> > was the command i executed on cluster...
>> >
>> >
>> >
>> > On Sat, Apr 5, 2014 at 12:34 PM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> > sorry Ralph my mistake its not names...it is "it does not happen on
>> same nodes."
>> >
>> >
>> > On Sat, Apr 5, 2014 at 12:33 PM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> > same vm on all machines that is virt-manager
>> >
>> >
>> > On Sat, Apr 5, 2014 at 12:32 PM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> > opmpi version 1.4.3
>> >
>> >
>> > On Fri, Apr 4, 2014 at 8:13 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> > Okay, so if you run mpiBlast on all the non-name nodes, everything is
>> okay? What do you mean by "names nodes"?
>> >
>> >
>> > On Apr 4, 2014, at 7:32 AM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> >
>> >> no it does not happen on names nodes
>> >>
>> >>
>> >> On Fri, Apr 4, 2014 at 7:51 PM, Ralph Castain <rhc_at_[hidden]>
>> wrote:
>> >> Hi Nisha
>> >>
>> >> I'm sorry if my questions appear abrasive - I'm just a little
>> frustrated at the communication bottleneck as I can't seem to get a clear
>> picture of your situation. So you really don't need to keep calling me
>> "sir" :-)
>> >>
>> >> The error you are hitting is very unusual - it means that the
>> processes are able to make a connection, but are failing to correctly
>> complete a simple handshake exchange of their process identifications.
>> There are only a few ways that can happen, and I'm trying to get you to
>> test for them.
>> >>
>> >> So let's try and see if we can narrow this down. You mention that it
>> works on some machines, but not all. Is this consistent - i.e., is it
>> always the same machines that work, and the same ones that generate the
>> error? If you exclude the ones that show the error, does it work? If so,
>> what is different about those nodes? Are they a different architecture?
>> >>
>> >>
>> >> On Apr 3, 2014, at 11:09 PM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> >>
>> >>> sir
>> >>> smae virt-manager is bein used by all pc's.no i did n't enable
>> openmpi-hetro.Yes openmpi version is same in all through same kickstart
>> file.
>> >>> ok...actually sir...rocks itself installed,configured openmpi and
>> mpich on it own through hpc roll.
>> >>>
>> >>>
>> >>> On Fri, Apr 4, 2014 at 9:25 AM, Ralph Castain <rhc_at_[hidden]>
>> wrote:
>> >>>
>> >>> On Apr 3, 2014, at 8:03 PM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> >>>
>> >>>> thankyou Ralph.
>> >>>> Yes cluster is heterogenous...
>> >>>
>> >>> And did you configure OMPI --enable-heterogeneous? And are you
>> running it with ---hetero-nodes? What version of OMPI are you using anyway?
>> >>>
>> >>> Note that we don't care if the host pc's are hetero - what we care
>> about is the VM. If all the VMs are the same, then it shouldn't matter.
>> However, most VM technologies don't handle hetero hardware very well -
>> i.e., you can't emulate an x86 architecture on top of a Sparc or Power chip
>> or vice versa.
>> >>>
>> >>>
>> >>>> And i haven't made compute nodes on direct physical nodes (pc's)
>> becoz in college it is not possible to take whole lab of 32 pc's for your
>> work so i ran on vm.
>> >>>
>> >>> Yes, but at least it would let you test the setup to run MPI across
>> even a couple of pc's - this is simple debugging practice.
>> >>>
>> >>>> In Rocks cluster, frontend give the same kickstart to all the pc's
>> so openmpi version should be same i guess.
>> >>>
>> >>> Guess? or know? Makes a difference - might be worth testing.
>> >>>
>> >>>> Sir
>> >>>> mpiformatdb is a command to distribute database fragments to
>> different compute nodes after partitioning od database.
>> >>>> And sir have you done mpiblast ?
>> >>>
>> >>> Nope - but that isn't the issue, is it? The issue is with the MPI
>> setup.
>> >>>
>> >>>>
>> >>>>
>> >>>> On Fri, Apr 4, 2014 at 4:48 AM, Ralph Castain <rhc_at_[hidden]>
>> wrote:
>> >>>> What is "mpiformatdb"? We don't have an MPI database in our system,
>> and I have no idea what that command means
>> >>>>
>> >>>> As for that error - it means that the identifier we exchange between
>> processes is failing to be recognized. This could mean a couple of things:
>> >>>>
>> >>>> 1. the OMPI version on the two ends is different - could be you
>> aren't getting the right paths set on the various machines
>> >>>>
>> >>>> 2. the cluster is heterogeneous
>> >>>>
>> >>>> You say you have "virtual nodes" running on various PC's? That would
>> be an unusual setup - VM's can be problematic given the way they handle TCP
>> connections, so that might be another source of the problem if my
>> understanding of your setup is correct. Have you tried running this across
>> the PCs directly - i.e., without any VMs?
>> >>>>
>> >>>>
>> >>>> On Apr 3, 2014, at 10:13 AM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> >>>>
>> >>>>> i first formatted my database with mpiformatdb command then i ran
>> command :
>> >>>>> mpirun -np 64 -machinefile mf mpiblast -d all.fas -p blastn -i
>> query.fas -o output.txt
>> >>>>> but then it gave this error 113 from some hosts and continue to run
>> for other but with no results even after 2 hours lapsed.....on rocks 6.0
>> cluster with 12 virtual nodes on pc's ...2 on each using virt-manger , 1 gb
>> ram to each
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Apr 3, 2014 at 10:41 PM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> >>>>> i also made machine file which contain ip adresses of all compute
>> nodes + .ncbirc file for path to mpiblast and shared ,local storage path....
>> >>>>> Sir
>> >>>>> I ran the same command of mpirun on my college supercomputer 8
>> nodes each having 24 processors but it just running....gave no result
>> uptill 3 hours...
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Apr 3, 2014 at 10:39 PM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> >>>>> i first formatted my database with mpiformatdb command then i ran
>> command :
>> >>>>> mpirun -np 64 -machinefile mf mpiblast -d all.fas -p blastn -i
>> query.fas -o output.txt
>> >>>>> but then it gave this error 113 from some hosts and continue to run
>> for other but with results even after 2 hours lapsed.....on rocks 6.0
>> cluster with 12 virtual nodes on pc's ...2 on each using virt-manger , 1 gb
>> ram to each
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Apr 3, 2014 at 8:37 PM, Ralph Castain <rhc_at_[hidden]>
>> wrote:
>> >>>>> I'm having trouble understanding your note, so perhaps I am getting
>> this wrong. Let's see if I can figure out what you said:
>> >>>>>
>> >>>>> * your perl command fails with "no route to host" - but I don't see
>> any host in your cmd. Maybe I'm just missing something.
>> >>>>>
>> >>>>> * you tried running a couple of "mpirun", but the mpirun command
>> wasn't recognized? Is that correct?
>> >>>>>
>> >>>>> * you then ran mpiblast and it sounds like it successfully started
>> the processes, but then one aborted? Was there an error message beyond just
>> the -1 return status?
>> >>>>>
>> >>>>>
>> >>>>> On Apr 2, 2014, at 11:17 PM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> >>>>>
>> >>>>>> error btl_tcp_endpint.c: 638 connection failed due to error 113
>> >>>>>>
>> >>>>>> In openmpi: this error came when i run my mpiblast program on
>> rocks cluster.Connect to hosts failed on ip 10.1.255.236,10.1.255.244 . And
>> when i run following command linux_shell$ perl -e 'die$!=113' this msg
>> comes: "No route to host at -e line 1." shell$ mpirun --mca btl ^tcp shell$
>> mpirun --mca btl_tcp_if_include eth1,eth2 shell$ mpirun --mca
>> btl_tcp_if_include 10.1.255.244 was also executed but it did nt recognized
>> these commands....nd aborted.... what should i do...? When i run my
>> mpiblast program for the frst time then it give mpi_abort error...bailing
>> out of signal -1 on rank 2 processor...then i removed my public ethernet
>> cable....and then give btl_tcp endpint error 113....
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> users mailing list
>> >>>>>> users_at_[hidden]
>> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>>>>
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> users mailing list
>> >>>>> users_at_[hidden]
>> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> users mailing list
>> >>>>> users_at_[hidden]
>> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> users mailing list
>> >>>> users_at_[hidden]
>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>>>
>> >>>> _______________________________________________
>> >>>> users mailing list
>> >>>> users_at_[hidden]
>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> users mailing list
>> >>> users_at_[hidden]
>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>>
>> >>> _______________________________________________
>> >>> users mailing list
>> >>> users_at_[hidden]
>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>
>> >>
>> >> _______________________________________________
>> >> users mailing list
>> >> users_at_[hidden]
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>
>> >> _______________________________________________
>> >> users mailing list
>> >> users_at_[hidden]
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>