Learning Transferable Visual Models From Natural Language Supervision =
https://arxiv.org/abs/2103.00020
-
Datasets
from AI Engineering:
Resources for Publicly Available Datasets
Here are a few resources where you can look for publicly available datasets. While you should take advantage of available data, you should never fully trust it. Data needs to be thoroughly inspected and validated.
Always check a dataset's license before using it. Try your best to understand where the data comes from. Even if a dataset has a license that allows commercial use, it's possible that part of it comes from a source that doesn't:
- Hugging Face (https://oreil.ly/tlt5h) and Kaggle (https://oreil.ly/g8A4a) each host
hundreds of thousands of datasets.
- Google has a wonderful and underrated Dataset Search (https://oreil.ly/TgOaR).
- Governments are often great providers of open data. Data.gov (https://data.gov) hosts hundreds of thousands of datasets, and data.gov.in (https://data.gov.in) hosts tens of thousands.
- University of Michigan's Institute for Social Research (https://oreil.ly/VhVzp)
ICPSR has data from tens of thousands of social studies.
- UC Irvine's Machine Learning Repository (https://oreil.ly/jAR9e) and OpenML (https://oreil.ly/d-Yty) are two older dataset repositories, each hosting several thousand datasets.
- The Open Data Network (https://oreil.ly/_tW6P) lets you search among tens of thousands of datasets.
- Cloud service providers often host a small collection of open datasets; the most notable one is AWS's Open Data (https://oreil.ly/DZ5uV).
- ML frameworks often have small pre-built datasets that you can load while using the framework, such as TensorFlow datasets (https://oreil.ly/HMJX_).
- Some evaluation harness tools host evaluation benchmark datasets that are suff-ciently large for PEFT finetuning. For example, Eleuther AI's Im-evaluation-harness (https://github.com/EleutherAl/m-evaluation-harness) hosts 400+ benchmark datasets, averaging 2,000+ examples per dataset.
10. The Stanford Large Network Dataset Collection (hts://oreilye_B) is a great
repository for graph datasets.
Free Datasets
UI Modelling - annotated UIs
Datasets - Roboflow (like HuggingFace for data)
Dataset - Anthropic use of AI by job category
-
MoE (Mixture of Experts)
An Introduction to Vision-Language Modeling
Vision AI - VLMs and CNNs
- Vision Language Models
- Convolutional Neural Networks
Multi-modal LLMs
Movies: rated [for collaborative filtering]
Book by Marvin Minsky
same author: Designing Machine Learning Systems- DMLS focuses on building applications on top of traditional ML models, which involves more tabular data annotations, feature engineering, and model training