Wednesday, March 20, 2024

AI: LLMs, platforms (hardware) and projects

(AI in Business) - Applied Artificial Intelligence

https://www.bookdepository.com/Applied-Artificial-Intelligence-Where-AI-Can-Be-Used-Business-Francesco-Corea/9783319772516?ref=grid-view&qid=1592465308969&sr=1-1 


Web element detection

OpenDILabCommunity/webpage_element_detection · Hugging Face


Layout parser - doc ingester

https://layout-parser.github.io/




Data generation - Langchain


DistilBERT: Faster version of BERT


DBRX - a 'mixture of experts' LLM

MoE in depth article 

Deploying and Fine-tuning Deepseek

Hermes-2 Mistral 7B - good for "Function Calling" (so LLM delegates creating/modifying an app, to the client)


multimodal LLMs
HPT 1.5 Air: A New Open-Sourced 8B Multimodal LLM with Llama 3


Datasets - Roboflow (like HuggingFace for data)

AWS Inferentia2 (inf2 AWS instances - cheaper than g5)

Prompt hacking and testing

Token counts - for cost estimates

LLM powered autonomous agents

Building a NN from scratch

Data science and data trends (via ML) Python libraries

evaluating LLMs
prometheus-2

LLM GitHub accelerator projects 



distilbert - can train it to classify intent or toxic

Prompt engineering guides

Tiny Multi-modal LLM - regions + OCR

AWS Bedrock Tips
- batch inference

AWS Inferentia - speculative decoding

AWS Prompt Routing and prompt caching
- can auto pick an LLM?

SOTA LLM by Richard

Wednesday, March 6, 2024

AI Papers, Books and Datasets

Learning Transferable Visual Models From Natural Language Supervision = 
https://arxiv.org/abs/2103.00020

-

Datasets

from AI Engineering:

Resources for Publicly Available Datasets

Here are a few resources where you can look for publicly available datasets. While you should take advantage of available data, you should never fully trust it. Data needs to be thoroughly inspected and validated.

Always check a dataset's license before using it. Try your best to understand where the data comes from. Even if a dataset has a license that allows commercial use, it's possible that part of it comes from a source that doesn't:

  1. Hugging Face (https://oreil.ly/tlt5h) and Kaggle (https://oreil.ly/g8A4a) each host
    hundreds of thousands of datasets.
  2. Google has a wonderful and underrated Dataset Search (https://oreil.ly/TgOaR).
  3. Governments are often great providers of open data. Data.gov (https://data.gov) hosts hundreds of thousands of datasets, and data.gov.in (https://data.gov.in) hosts tens of thousands.
  4. University of Michigan's Institute for Social Research (https://oreil.ly/VhVzp)
    ICPSR has data from tens of thousands of social studies.
  5. UC Irvine's Machine Learning Repository (https://oreil.ly/jAR9e) and OpenML (https://oreil.ly/d-Yty) are two older dataset repositories, each hosting several thousand datasets.
  6. The Open Data Network (https://oreil.ly/_tW6P) lets you search among tens of thousands of datasets.
  7. Cloud service providers often host a small collection of open datasets; the most notable one is AWS's Open Data (https://oreil.ly/DZ5uV).
  8. ML frameworks often have small pre-built datasets that you can load while using the framework, such as TensorFlow datasets (https://oreil.ly/HMJX_).
  9. Some evaluation harness tools host evaluation benchmark datasets that are suff-ciently large for PEFT finetuning. For example, Eleuther AI's Im-evaluation-harness (https://github.com/EleutherAl/m-evaluation-harness) hosts 400+ benchmark datasets, averaging 2,000+ examples per dataset.

10. The Stanford Large Network Dataset Collection (hts://oreilye_B) is a great

repository for graph datasets.


Free Datasets

UI Modelling - annotated UIs

Datasets - Roboflow (like HuggingFace for data)

Dataset - Anthropic use of AI by job category 

-

MoE (Mixture of Experts)


An Introduction to Vision-Language Modeling

Vision AI - VLMs and CNNs
- Vision Language Models
- Convolutional Neural Networks

Multi-modal LLMs

Movies: rated [for collaborative filtering]

The Society of Mind
Book by Marvin Minsky

AI Engineering[Book] [O'Reilly] [Zeki recommends]
same author: Designing Machine Learning Systems- DMLS focuses on building applications on top of traditional ML models, which involves more tabular data annotations, feature engineering, and model training