A large public domain image-text dataset to train frontier LLM models

Yesterday, after a conversation on the #ai channel in WAO’s Slack, I published Ways of categorising ethical concerns relating to generative AI. There was some pushback on Mastodon.

Alan Levine asked if I knew of any LLMs which say they’re trained on “open data” and where you can actually see the sources. It’s a good point, and I do know of one, which is Public Domain 12M (or PD12M for short). LLMs are a class of technologies, so (as I was trying to get at in my original post) we should be clear and specific in our objections to them.

Although I don’t share the concern, I understand the position which could be broadly stated as: “I have a problem with LLM datasets being scraped from the open web without the explicit consent of copyright holders.” But that’s not a position against LLMs per se. It’s an objection based on the copyright status of the ingested data.

At 12.4 million image-caption pairs, PD12M is the largest public domain image-text dataset to date, with sufficient size to train foundation models while minimizing copyright concerns. Through the Source.Plus platform, we also introduce novel, community-driven dataset governance mechanisms that reduce harm and support reproducibility over time.

Source: Source.Plus