Datasets used for Apple's generative AI systems and services
January 1, 2026
This page provides a high-level summary of datasets used in the development of the generative AI systems and services Apple offers. Apple integrates powerful generative AI into the apps and experiences people use every day, all while respecting the privacy of user data. We believe in training our models using diverse and high-quality data.
Sources and owners of datasets. Apple trains generative AI models using a mixture of data that includes publicly available data, including publicly available information crawled by Apple’s web crawler Applebot, data licensed or purchased from third parties, open-sourced data, data obtained through user studies, and synthetic data. Applebot crawls information publicly available on the internet. Applebot does not crawl data from websites that require login credentials or that are protected by a paywall. Applebot respects standard robots.txt directives that web publishers can use to direct Applebot not to crawl their website, or to direct Apple not to use their website content to train foundation models.
Intended purpose. Apple uses this data to develop and train generative AI models to understand and create language and images, and to simplify and accelerate everyday tasks. That purpose requires a large and diverse set of data reflecting, for example, the structure of language, imagery, audio, software code and mathematical reasoning.
Data points, including counts and types. Training data for generative AI models collectively includes trillions of individual data points. Data points may include a text label describing the content of the data, such as a caption for images or a transcription for audio. Some labels or annotations are included in the original data. Others are manually added by a human reviewer or automatically generated.
Inclusion of public domain data or data protected by copyright, trademark, or patent. Data sets for model training include both data from the public domain and data subject to intellectual property rights. For example, data used to train generative AI models includes data that has been directly licensed to Apple and data made available pursuant to licenses, such as common open-source licenses, that permit use of the data in the development of generative AI systems.
Purchase or licensing of datasets. Data sets for generative AI models include data licensed or purchased from third parties.
Inclusion of personal information. Apple does not use our users’ private personal data or user interactions when training our foundation models. Additionally, for content publicly available on the internet that has been crawled by Applebot, Apple takes steps to apply filters to remove certain categories of personally identifiable information, such as social security and credit card numbers, from training data. Apple does not make any attempt to identify individuals or create profiles from publicly available data on the internet. Apple also provides the ability to object to the crawling of URLs containing personal data (for example, your blog) for purposes of training generative AI models.
Inclusion of aggregate consumer information. For Apple users who opt in to share Device Analytics with Apple, Apple may use privacy-preserving techniques to collect data about aggregated trends, including about the content processed by Apple Intelligence, in order to improve Apple Intelligence. As a result of these protections, Apple can use this aggregate data to understand how to improve Apple Intelligence features without collecting individual user data or content.
Cleaning, processing, or other modification of datasets. Apple filters web-crawled data and publicly available datasets both at the time the data is crawled or imported and also as a part of post-acquisition processing prior to training. The data is managed both to limit the use of low-quality data and to remove content that is undesirable or unsafe. For example, Apple performs quality filtering and plain-text extraction on data crawled by Applebot, including safety, profanity, inappropriate content, spam, financial data, and quality filtering using heuristics and model-based classifiers, global fuzzy de-duplication using locality-sensitive n-gram hashing, decontamination against common pre-training benchmarks, and filtering against benchmark datasets. Different techniques are used to filter datasets, including manual and algorithmic ranking of content, use of heuristics, and use of machine learning models.
Time period for collection. Apple has been collecting textual data for training since 2018 and image data for training since 2020. Data collection remains ongoing.
Dates the datasets were first used. Text data was first used in 2018 and image data in 2020 for the development of generative AI systems and services.
Use of synthetic data. Apple uses generated text, images, audio, and other content to supplement datasets containing real-world data. This category of data is used to enhance the other corpora, including synthetic image caption data, question-answer pairs, and language data. Apple also uses synthetic data generation for post-training, including supervised fine-tuning.