I see the exact opposite - any open source model is going to become prohibitively expensive to train if quality data costs billions of dollars. We’re going to be left with the OpenAI’s and Google’s of the world as the only players in the space until someone solves synthetic data.
Exactly this. I work at a small web scraping company (so I might be a bit bias) and any small business can collect a fair, capable datasets of public data for model training, sentiment analysis or whatever today. If public data is stopped by copyright as this lawsuit implies that would just mean only giant corporations and pirates would be able to afford this.
This would be a huge blow to open-source and research developers and I'd even argue it could help openAI to get a bit of a moat ala regulatory capture.
research is fair use, also providing something amazing like Wikipedia is arguably educational (again fair use), reselling NYT articles on-demand via an API is by itself neither, so likely not free use
Fair use is irrelevant here as no small business would ever risk court dragging even though they are in the right. Especially since breaking ToS and "business damage" are easiest attachments to any lawsuit related to digital space.
You may remember the Google Books lawsuit where Google was digitally copying the entirety of books and making them available online.
Google won that suit under fair-use as a massive searchable database was found to be transformative as well as the non-commercial nature.
So; if your web scraping companies goal is to allow people to bypass a paywall I suspect you'll have trouble in the future. If your web scraping company instead say allows people to do market analysis on how many people need a piano tuner in NYC and it doesn't do that by copying a NYT article doing original research I think you'll be fine.