Hmmm...well, a few points here. First, the Phi's sadly don't show up in the hwloc tree as they apparently are hidden behind the PCIe bridge. I don't know if there is a way for hwloc to "probe" and find processors on PCI cards, but that's something I'll have to defer to Jeff and Brice.
So the first problem is: how to know the Phi's are present, how many you have on each node, etc? We could push that into something like the hostfile, but that requires that someone build the file. Still, it would only have to be built once, so maybe that's not too bad - could have a "wildcard" entry if every node is the same, etc.
Next, we have to launch processes across the PCI bus. We had to do an "rsh" launch of the MPI procs onto RR's cell processors as they appeared to be separate "hosts", though only visible on the local node (i.e., there was a stripped-down OS running on the cell) - Paul's cmd line implies this may also be the case here. If the same method works here, then we have most of that code still available (needs some updating). We would probably want to look at whether or not binding could be supported on the Phi local OS.
Finally, we have to wire everything up. This is where RR got a little tricky, and we may encounter the same thing here. On RR, the cell's didn't have direct access to the interconnects - any messaging had to be relayed by a process running on the main cpu. So we had to create the ability to "route" MPI messages from processes running on the cells to processes residing on other nodes.
Solving the first two is relatively straightforward. In my mind, the primary issue is the last one - does anyone know if a process on the Phi's can "see" interconnects like a TCP NIC or an Infiniband adaptor?
I know Intel MPI (MPICH based) "just works" with Phi, but you need to do things like:
mpirun –n 2 –host cpu host.exe : –n 4 –host mic0 mic.exe
if you want to use the Phi for more than just kernel-offload (in which case they won't have/need an MPI rank).
So, launch procs is PART of the problem, but certainty not all of it.
At least, unlike RR, the processing elements all share the same endianness!
devel mailing list