The Risks of LLMs and Generative AI

Lily Li

[Modified version originally published as International Insights Article: Privacy implications for organizations using generative AI, by Lily Li, on OneTrust DataGuidance, June 2023.]

Well, the cat is out of the bag – or at least the chat is. Generative AI and large language models (“LLMs”) are here to stay. From philosophical conversations between the dead to Murakami-inspired artworks for downtown LA, the possibilities of user-friendly AI are limitless. Regulators are scrambling to enforce existing legislation and enact new legislation to contain this trend. But, like all enforcement, it will take time. As a result, many companies are moving quickly to adopt and deploy these tools, testing the legal and ethical boundaries of AI.

To stay competitive, companies should not wait for data protection regulators to play cat-and-mouse games with these nascent technologies. Instead, companies need to be proactive and adopt strategies to implement transparent and trustworthy AI – not just to avoid lawsuits and regulatory fines – but to protect their data and their brands. Companies also need to be able to account for the data they input into their generative AI or LLM algorithms, or else risk destruction of these algorithms altogether.

In this article, we’ll discuss the latest privacy and security risks from generative AI and LLMs, a few of the existing privacy laws that apply to these technologies, and the potential for algorithmic disgorgement or deletion in response to privacy violations.

Social Engineering and Identity Verification

Generative AI has clearly passed the Turing test. From all outward appearances, companies and their employees cannot tell the difference between human-generated and AI-generated text. This makes it easier for traditional phishing emails and other scams to look legitimate to readers — making it far more likely for employees to click on malicious links and download malware.

Going one step further, generative AI can create realistic identities. From resumes to cover letters, online social media profiles to sample work product, these tools can improve a threat actor’s ability to pass itself off as a well-rounded individual, bypassing normal screening tools and even HR processes. In this era of remote work, it is easy to imagine malicious actors getting onboarded and hired due to their made-up “skills” and turning into insider threats once they gain access to company systems. This risk increases for companies that rely on virtual assistants and employees, where there are even fewer external validations of identity.

While companies often rely on phishing training and cyber insurance to mitigate traditional cyber-attacks, this is not enough going forward. Many cyber insurance policies exclude social engineering attacks, exclude activities involving managers or other high-level employees, or confine social engineering and phishing attacks to technological attacks and not traditional identity theft, crime, and fraud. Consequently, companies should consider AI-based email filtering systems and EDR/MDR systems to combat sophisticated phishing attacks. Security awareness training should extend beyond phishing training and include identification verification and reporting of suspicious activity across the organization. Companies should also consider HR and other vendor onboarding policies to include in-person vetting or other external validation for recruiting and outsourcing.

Privacy and DSAR Risks

Is Processing of Personal Data for Generative AI Lawful?

Large language models, and similar machine learning tools, have a privacy problem. All these systems rely on processing vast quantities of public and sometimes proprietary data to generate responses and analysis. Absent further safeguards, these inputs will likely contain personal data. Which then begs the question, where does this data come from and is the processing lawful?

This question came to a head recently in Italy, where data protection authorities issued a temporary ban on ChatGPT,[1] citing OpenAI’s failure to provide transparent notices regarding how it processes the personal data of users and data subjects (required under Articles 12, 13, and 14 of the GDPR). More importantly, the authorities found no legal basis under Article 6 of the GDPR for the collection and processing of personal data to train OpenAI’s algorithms. Impacted data subjects did not consent to the processing and, reading between the lines, OpenAI’s legitimate interest was an insufficient basis for processing given the: (i) failure to provide notice; (ii) inability to correct and delete data; and (iii) heightened privacy risks for children due to the lack of age verification techniques.

OpenAI subsequently addressed Italy’s concerns in sufficient detail to resume services,[2] but it remains unclear whether other data protection regulators in the EU will also confront OpenAI over the GDPR’s transparency and lawful bases requirements. If businesses utilize generative AI and LLMs, they should be prepared to provide compliant privacy notices to data subjects, and either obtain their explicit consent or conduct a legitimate interest analysis prior to submitting any personal data to AI or LLM platforms.

These data privacy risks also exist in the United States. The California Consumer Privacy Act (CCPA), as amended by the California Privacy Rights Act (“CPRA”), also requires businesses to provide transparent privacy notices and privacy rights to individuals. In addition, CPRA has imported the GDPR concepts of data minimization and proportionality. Personal data processing needs to be “reasonably necessary and proportionate to achieve the purposes for which the personal information was collected or processed, or for another disclosed purpose that is compatible with the context in which the personal information was collected.”[3] Consequently, companies should be wary of taking existing datasets containing personal information and running them through generative AI systems, if this use runs contrary to the expectations of data subjects when they originally submitted the data. Companies may need to re-evaluate their privacy notices and provide further notices regarding AI processing.

Furthermore, both GDPR and the CPRA (and similar US state laws) require covered organizations to give individuals the right to opt out of automated processing or automated decision-making, including profiling.[4] While California lawmakers have yet to issue regulations concerning automated decision-making, it will likely align with GDPR concepts. This means that individuals will have the right to opt-out of AIs making decisions that have legal effects, such as those surrounding employment, housing, or access to services and benefits. So, for those who are wondering, you can’t have chatbots all the way down — eventually, there needs to be a human decisionmaker at the end of the line.

Who Owns the Data? Privacy Rights to Correct and Delete

Generative AI and LLMs also call into question the ownership and control of personal data. GDPR, CCPA, HIPAA, and GLBA, among other regulations, require covered entities to obtain contractual commitments with vendors that process personal data, PHI, or NPI on their behalf.[5] By giving company personal data to an AI system absent formal review, companies may be in violating these laws, trading away the privacy of their customers, and giving up valuable IP to third parties.

To combat this problem, companies should always read the terms and privacy policies of any new AI and LLM tools to confirm, as an initial step:

The company owns all content provided to the AI system and any output generated by the AI
The AI provider will provide appropriate technical and organizational measures to protect personal data
The AI provider will maintain the confidentiality of data and limit use of the data to those purposes disclosed by the AI provider (and similarly, disclosed by the company to the relevant data subjects)
The AI provider will assist the company in responding to privacy requests, including those that require correct and deletion of personal data
The AI provider has appropriate data transfer mechanisms in place if personal data will cross borders

Assuming the generative AI or LLM terms and privacy policies cover the items above, the company may need to negotiate additional clauses under GDPR, CCPA, HIPAA, and GLBA depending on whether regulated data is provided to these platforms. If these contractual commitments do not exist, then companies should consider policies prohibiting the disclosure of personal or proprietary data — or else risk unauthorized access or even public disclosure of this information.

Even if the terms and privacy policies guarantee the confidentiality of data, companies should still validate whether the generative AI or LLM model appropriately de-identifies or anonymizes personal data or proprietary data when it improves its language models. One of the most concerning issues with generative AI is its inexplicability — often the programmers creating the model do not even understand how the AI is generating its output. Thus, even if a data subject submits a deletion or correction request, it is unclear whether this request will be propagated through the model to remove/amend information that was previously fed into the model. Consequently, companies should test any generative AI or LLM model to confirm whether identifiable data is output from the model, based on test inputs.

Finally, even if a company does not input personal information into a generative AI or LLM platform, employees may be tempted to use these platforms to research or create media about a known individual. Unfortunately, generative AI regularly creates false information about individuals. At best, this may trigger notification to data subjects under Article 14 of the GDPR “from which source the personal data originate, and if applicable, whether it came from publicly accessible sources” — so they are aware of the processing and can exercise any privacy rights. At worst, publication of this personal data may be grounds for a defamation lawsuit. Once again, companies need to implement robust identity verification and external validation of AI output concerning personal data.

Children’s Privacy

The impact of generative AI and LLM products on children will be tremendous, given the ease and accessibility of chatbots, and the vast potential for personalized education, gaming, and social services. Companies operating in this space should pay close attention to children’s privacy rules that may impact their use or provision of generative AI and LLM products and services. California’s Age-Appropriate Design Code, modeled after the UK’s Age appropriate design code, for instance, requires data protection impact assessment and a “high level” of privacy for online providers of services, products, or features that are “likely to be accessed by children.”[6] This law covers children under the age of 18. In addition, COPPA – a US federal privacy law – requires clear and conspicuous privacy notices and affirmative consent by parents prior to collection of personal information from children under 13. Companies that offer products and services that may be attractive to children will need to implement these heightened privacy requirements, or in the alternative, implement robust age-gating techniques.

Regulatory Enforcement and Algorithmic Disgorgement

Once an AI system is trained on bad data, can it be saved? According to the U.S. Federal Trade Commission (FTC) – perhaps not. While there is currently no comprehensive federal legislation in the United States governing privacy or AI, the FTC does have the ability to regulate “unfair and deceptive acts or practices in or affecting commerce.”[7] The FTC has interpreted its enforcement power to include unfair and misleading practices regarding the collection and use of personal data – including, for example, actions against Cambridge Analytica for harvesting of Facebook user data, and against GoodRx Holdings for its unauthorized disclosures of consumers’ personal health information to Facebook, Google, and other companies.[8]

The FTC’s scrutiny of privacy and security practices extends to AI. In January 2021, the FTC entered a settlement order with photo storage service, Everalbum, over allegations that it deceived consumers about its use of facial recognition technology.[9] While Everalbum allegedly represented that it would not apply facial recognition to users’ content unless they opted-in, it applied facial recognition technology by default for most users without any ability to turn this feature off. As part of the settlement order, the FTC required Everalbum to delete all facial recognition models or algorithms developed with Everalbum users’ photos or videos.

More recently, the FTC required algorithmic destruction in an action against WW International, Inc., formerly known as Weight Watchers, and a subsidiary called Kurbo, Inc.[10] According to FTC Chair Lina Khan, “Weight Watchers and Kurbo marketed weight management services for use by children as young as eight, and then illegally harvested their personal and sensitive health information….Our order against these companies requires them to delete their ill-gotten data, destroy any algorithms derived from it, and pay a penalty for their lawbreaking.”

Thus, AI companies face potential deletion or disgorgement of their algorithms if they collect personal data in an unfair or deceptive manner. While it may be tempting to amass larger and larger datasets to build the best algorithms, companies that rely on improper collection of data may find themselves bereft of their most valuable intellectual property.

Move Deliberately and Create Things

Generative AI and LLMs do not operate in a vacuum. They derive from the voices, both inspired and insipid, from all corners of the world wide web. And they create fabulous and fabulously weird content. We encourage companies to take advantage of generative AI and LLMs to create the next generation of personalized education, medicine, and creative exploration. At the same time, we encourage companies to be mindful of the existing rules that protect our privacy, so that transparent and trustworthy AI can be the foundation of these new creations.

[1] https://www.garanteprivacy.it/web/guest/home/docweb/-/docweb-display/docweb/9870847

[2] https://www.garanteprivacy.it/home/docweb/-/docweb-display/docweb/9881490#english

[3] Cal. Civ. Code Section 1798.100(c)

[4] GDPR, Article 22; Cal. Civ. Code Section 1798.185(a)(16)

[5] See, e.g., GDPR, Article 28; Cal. Civ. Code Section 1798.140(ag)(1); 45 CFR Section 164.504(e)(Business Associate requirements under HIPAA)

[6] Cal. Civ. Code Section 1798.99.31(a)

[7] 15 U.S.C. Sec. 45(a)(1)

[8] See https://www.ftc.gov/news-events/topics/protecting-consumer-privacy-security/privacy-security-enforcement for a list of FTC enforcement actions concerning privacy and cybersecurity

[9] https://www.ftc.gov/news-events/news/press-releases/2021/01/california-company-settles-ftc-allegations-it-deceived-consumers-about-use-facial-recognition-photo

[10] https://www.ftc.gov/news-events/news/press-releases/2022/03/ftc-takes-action-against-company-formerly-known-weight-watchers-illegally-collecting-kids-sensitive