If it's public domain, then no copyright is violated. I'm not talking about public-domain data; the G-G-GP specifically mentioned the possible legal interpretation that training on large amounts of publicly visible (but not public domain) data is itself a copyright violation.