I think the near future in compilers is going to be dominated by accelerator cap...

I think the near future in compilers is going to be dominated by accelerator capable LLVM variants (such as NVVM). Ideally, all Nvidia, AMD and Intel would and should need to do is to provide LLLVM based backends. Frontends just need a sane way of specifying SIMT. And no, loop + directive based stuff like OpenMP and OpenACC is IMO just a stopgap measure in order to serve higher ups the lie that portation is going to be swift and easy. There's basically one sane way to deal with data parallel problems, and that's to treat what's going on in one core as a scalar program with precomputed indices (i.e. the CUDA / OpenCL way).

Python has some good support for this already. I haven't done a project like this yet, but if setting up something fresh I'd immediately jump on NumPy interacting with my own kernels and some Python glue for the node level parallelism.