Wednesday, March 6, 2024

AI Papers, Books and Datasets

Learning Transferable Visual Models From Natural Language Supervision = 
https://arxiv.org/abs/2103.00020

-

Datasets

from AI Engineering:

Resources for Publicly Available Datasets

Here are a few resources where you can look for publicly available datasets. While you should take advantage of available data, you should never fully trust it. Data needs to be thoroughly inspected and validated.

Always check a dataset's license before using it. Try your best to understand where the data comes from. Even if a dataset has a license that allows commercial use, it's possible that part of it comes from a source that doesn't:

  1. Hugging Face (https://oreil.ly/tlt5h) and Kaggle (https://oreil.ly/g8A4a) each host
    hundreds of thousands of datasets.
  2. Google has a wonderful and underrated Dataset Search (https://oreil.ly/TgOaR).
  3. Governments are often great providers of open data. Data.gov (https://data.gov) hosts hundreds of thousands of datasets, and data.gov.in (https://data.gov.in) hosts tens of thousands.
  4. University of Michigan's Institute for Social Research (https://oreil.ly/VhVzp)
    ICPSR has data from tens of thousands of social studies.
  5. UC Irvine's Machine Learning Repository (https://oreil.ly/jAR9e) and OpenML (https://oreil.ly/d-Yty) are two older dataset repositories, each hosting several thousand datasets.
  6. The Open Data Network (https://oreil.ly/_tW6P) lets you search among tens of thousands of datasets.
  7. Cloud service providers often host a small collection of open datasets; the most notable one is AWS's Open Data (https://oreil.ly/DZ5uV).
  8. ML frameworks often have small pre-built datasets that you can load while using the framework, such as TensorFlow datasets (https://oreil.ly/HMJX_).
  9. Some evaluation harness tools host evaluation benchmark datasets that are suff-ciently large for PEFT finetuning. For example, Eleuther AI's Im-evaluation-harness (https://github.com/EleutherAl/m-evaluation-harness) hosts 400+ benchmark datasets, averaging 2,000+ examples per dataset.

10. The Stanford Large Network Dataset Collection (hts://oreilye_B) is a great

repository for graph datasets.


Free Datasets

UI Modelling - annotated UIs

Datasets - Roboflow (like HuggingFace for data)

Dataset - Anthropic use of AI by job category 

-

MoE (Mixture of Experts)


An Introduction to Vision-Language Modeling

Vision AI - VLMs and CNNs
- Vision Language Models
- Convolutional Neural Networks

Multi-modal LLMs

Movies: rated [for collaborative filtering]

The Society of Mind
Book by Marvin Minsky

AI Engineering[Book] [O'Reilly] [Zeki recommends]
same author: Designing Machine Learning Systems- DMLS focuses on building applications on top of traditional ML models, which involves more tabular data annotations, feature engineering, and model training

No comments:

Post a Comment