On May 2, 2013, at 9:18 PM, Christopher Samuel <samuel_at_[hidden]> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> Hi Ralph, very quick reply as I've got an SGI engineer waiting for
> me.. ;-)
> On 03/05/13 12:21, Ralph Castain wrote:
>> So the first problem is: how to know the Phi's are present, how
>> many you have on each node, etc? We could push that into something
>> like the hostfile, but that requires that someone build the file.
>> Still, it would only have to be built once, so maybe that's not too
>> bad - could have a "wildcard" entry if every node is the same,
> We're using Slurm, and it supports them already apparently, so I'm not
> sure if that helps?
It does - but to be clear: your saying that you can directly launch processes onto the Phi's via srun? If so, then this may not be a problem, assuming you can get confirmation that the Phi's have direct access to the interconnects.
If the answer to both is "yes", then just srun the MPI procs directly - we support direct launch and use PMI to wireup. Problem solved :-)
And yes - that support is indeed in the 1.6 series...just configure --with-pmi. You may need to provide the path to where pmi.h is located under the slurm install, but probably not.
>> Next, we have to launch processes across the PCI bus. We had to do
>> an "rsh" launch of the MPI procs onto RR's cell processors as they
>> appeared to be separate "hosts", though only visible on the local
>> node (i.e., there was a stripped-down OS running on the cell) -
>> Paul's cmd line implies this may also be the case here. If the same
>> method works here, then we have most of that code still available
>> (needs some updating). We would probably want to look at whether or
>> not binding could be supported on the Phi local OS.
> I believe that is the case - you can login via SSH to them is my
> understanding. We've not got that far with ours yet..
>> Finally, we have to wire everything up. This is where RR got a
>> little tricky, and we may encounter the same thing here. On RR, the
>> cell's didn't have direct access to the interconnects - any
>> messaging had to be relayed by a process running on the main cpu.
>> So we had to create the ability to "route" MPI messages from
>> processes running on the cells to processes residing on other
>> Solving the first two is relatively straightforward. In my mind,
>> the primary issue is the last one - does anyone know if a process
>> on the Phi's can "see" interconnects like a TCP NIC or an
>> Infiniband adaptor?
> I'm not sure, but I can tell you that the Intel RPMs include an OFED
> install that looks like it's used on the Phi (if my reading is correct).
> - --
> Christopher Samuel Senior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: samuel_at_[hidden] Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/ http://twitter.com/vlsci
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> -----END PGP SIGNATURE-----
> devel mailing list