Learning Transferable Visual Models From Natural Language Supervision =
https://arxiv.org/abs/2103.00020

Datasets

from AI Engineering:

Resources for Publicly Available Datasets

Here are a few resources where you can look for publicly available datasets. While you should take advantage of available data, you should never fully trust it. Data needs to be thoroughly inspected and validated.

Always check a dataset's license before using it. Try your best to understand where the data comes from. Even if a dataset has a license that allows commercial use, it's possible that part of it comes from a source that doesn't:

Hugging Face (https://oreil.ly/tlt5h) and Kaggle (https://oreil.ly/g8A4a) each host
hundreds of thousands of datasets.
Google has a wonderful and underrated Dataset Search (https://oreil.ly/TgOaR).
Governments are often great providers of open data. Data.gov (https://data.gov) hosts hundreds of thousands of datasets, and data.gov.in (https://data.gov.in) hosts tens of thousands.
University of Michigan's Institute for Social Research (https://oreil.ly/VhVzp)
ICPSR has data from tens of thousands of social studies.
UC Irvine's Machine Learning Repository (https://oreil.ly/jAR9e) and OpenML (https://oreil.ly/d-Yty) are two older dataset repositories, each hosting several thousand datasets.
The Open Data Network (https://oreil.ly/_tW6P) lets you search among tens of thousands of datasets.
Cloud service providers often host a small collection of open datasets; the most notable one is AWS's Open Data (https://oreil.ly/DZ5uV).
ML frameworks often have small pre-built datasets that you can load while using the framework, such as TensorFlow datasets (https://oreil.ly/HMJX_).
Some evaluation harness tools host evaluation benchmark datasets that are suff-ciently large for PEFT finetuning. For example, Eleuther AI's Im-evaluation-harness (https://github.com/EleutherAl/m-evaluation-harness) hosts 400+ benchmark datasets, averaging 2,000+ examples per dataset.

10. The Stanford Large Network Dataset Collection (hts://oreilye_B) is a great

repository for graph datasets.

Free Datasets

https://www.v7labs.com/blog/best-free-datasets-for-machine-learning

UI Modelling - annotated UIs

https://uimodeling.github.io/

Datasets - Roboflow (like HuggingFace for data)

https://public.roboflow.com/

Dataset - Anthropic use of AI by job category

https://huggingface.co/datasets/Anthropic/EconomicIndex

MoE (Mixture of Experts)

LLM Mixture of Experts Explained (tensorops.ai)

An Introduction to Vision-Language Modeling

Vision AI - VLMs and CNNs
- Vision Language Models
- Convolutional Neural Networks

https://arxiv.org/pdf/2405.17247

Multi-modal LLMs

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs (cambrian-mllm.github.io)

Movies: rated [for collaborative filtering]

MovieLens Latest Datasets | GroupLens

The Society of Mind

Book by Marvin Minsky

AI Engineering[Book] [O'Reilly] [Zeki recommends]

same author: Designing Machine Learning Systems- DMLS focuses on building applications on top of traditional ML models, which involves more tabular data annotations, feature engineering, and model training

LLM Engineer's Handbook [O'Reilly] by Paul Iusztin, Maxime Labonne

Stuff I Want To Read

Wednesday, March 20, 2024

AI: LLMs, prompting, platforms (hardware) and projects

Wednesday, March 6, 2024

AI Papers, Books and Datasets

Datasets

MoE (Mixture of Experts)

An Introduction to Vision-Language Modeling