Understanding the Limitations of AI Access to the Web

Explore top LinkedIn content from expert professionals.

Summary

Understanding the limitations of AI access to the web involves recognizing challenges like ethical concerns, legal restrictions, and technical barriers in using web data for AI training. These issues highlight the need for responsible data collection practices and regulatory compliance to ensure sustainable AI development.

  • Respect data ownership: Ensure compliance with copyright, intellectual property, and data protection laws to avoid misuse of web data for AI training.
  • Prioritize consent: Provide clear options for website owners to opt in or out of allowing their data to be scraped by AI systems, acknowledging the importance of transparency and trust.
  • Address ethical dilemmas: Consider the societal impacts and potential risks to individuals when developing or deploying AI models trained on large-scale web-scraped data.
Summarized by AI based on LinkedIn member posts
  • View profile for Barak Turovsky

    Chief AI Officer at GM | Ex Google AI

    18,499 followers

    Generative AI providers' will face growing challenges to scrape the Web for publicly available training data. While courts historically adopted the approach that scraping public sites is generally allowed, complex nature of copyright laws (eg IP, terms of service etc) across many countries, plus the recent trend of API and scraping lockdowns or price hikes (see Twitter, Reddit, Stack Overflow etc), could make training data access more prohibitive, slow and costly, as evidenced by slew of lawsuits (https://lnkd.in/gDqpU-9m, https://lnkd.in/gM4K5Ekr) against OpenAI, Meta, Google, others, alleging IP violations when training LLM models. On that front, Google expected to have a huge advantage in being able to leverage its position in search: they are likely to compel most content owners to give Google the rights to train its models as part of Google’s crawling effort

  • View profile for Odia Kagan

    CDPO, CIPP/E/US, CIPM, FIP, GDPRP, PLS, Partner, Chair of Data Privacy Compliance and International Privacy at Fox Rothschild LLP

    24,164 followers

    UK Information Commissioner's Office issues for public comment guidance on legal basis for scraping data from the web or to train AI (Chapter 1 of Consultation). Key points: 🔹 Most developers of generative AI rely on publicly accessible sources for their training data, usually through web scraping. 🔹 To be fair and lawful your data collection can't be in breach of any laws - this will not be met if the scraping of personal data infringes other legislation outside of data protection such as intellectual property or contract law. 🔹 Legitimate interests can be a valid lawful basis for training generative AI models on web-scraped data, but only when the model’s developer can ensure they pass the three-part test Purpose test: is there a valid interest? 🔹 Despite the many potential downstream uses of a model, you need to frame the interest in a specific, rather than open-ended way, based on what information you can have access to at the time of collecting the training data. 🔹 If you don’t know what your model is going to be used for, how can you ensure its downstream use will respect data protection and people’s rights and freedoms? Necessity test: is web scraping necessary given the purpose? The ICO’s understanding is that currently, most generative AI training is only possible using the volume of data obtained though large-scale scraping. Balancing test: do individuals’ rights override the interest of the generative AI developer? 🔹 Collecting data though web-scraping is an ‘invisible processing’ activity. 🔹 Invisible processing and AI related processing are both seen as high-risk activities that require a DPIA under ICO guidance Risk mitigations to consider in the balancing test If you are the developer and rely on the public interest of the wider society for the first part of the test, you should be able to: 🔹 control and evidence whether the generative AI model is actually used for the stated wider societal benefit; 🔹 assess risks to individuals (both in advance during generative AI development and as part of ongoing monitoring post-deployment); 🔹 implement technical and organisational measures to mitigate risks. If you deploy a third party model through an API: 🔹 Developer should implement TOMS (eg. output filters) and organizational controls over the deployment such are: limit queries (preventing those likely to result in risks or harms to individuals) and monitoring the use of the model 🔹 Use contractual restrictions and measures, with the developer legally limiting the ways in which the generative AI model can be used by its customers If you provide a model to a third party: 🔹 Use contractual controls to mitigate the risks of lack of control on how the model is used - but that might not be effective 🔹 You need to evidence that any such controls are being complied with in practice #dataprivacy #dataprotection #privacyFOMO https://lnkd.in/eev_Qhah

  • View profile for Shelly Palmer
    Shelly Palmer Shelly Palmer is an Influencer

    Professor of Advanced Media in Residence at S.I. Newhouse School of Public Communications at Syracuse University

    382,514 followers

    OpenAI will now allow website operators to block its web crawler by updating their site’s robots.txt file or by directly blocking the IP address for OpenAI’s GPTBot. Either technique will ensure that a site is not scrapped for AI training by OpenAI. This is an obvious approach; I wrote about the need for it back in February while wondering if conversational AI would kill web traffic. AI training techniques are the focus of intense debate. OpenAI’s GPT models, like many large language models, heavily rely on vast amounts of internet data for training. However, the ethics of sourcing this data – especially without explicit consent – has been a hot topic. Platforms like Reddit and Twitter have already begun pushing back against the unrestricted use of their content by AI entities. Moreover, legal challenges have arisen, with creatives alleging unauthorized use of their works by AI companies. By allowing sites to opt out, OpenAI is acknowledging the importance of consent in the data collection process. It’s a step (albeit a small one) toward a more transparent and ethical AI ecosystem… but what about the data already ingested? ChatGPT is happy to tell you that it has ingested everything it could find prior to September 2021. Who do we see about that? Choose your metaphor: the cat’s out of the bag, the genie’s out of the bottle, can’t put the toothpaste back in the tube, etc. "OpenAI's GPTBot: A Step Towards Ethical Web Crawling?" #ai #artificialintelligence #openai #chatgpt #gptbot #privacy 

Explore categories