Data Machina #232

3 years ago 236

New Mixture-of-Experts (MoE) Models. MS Phi-2 2.7B Small Model. StripedHyena 7B Models. DeepMind Imagen2. Diffusion Models + XGBoost. promptbase. Automated Continual Learning. CogAgent V-L Model.

New Mixture-of-Experts (MoE) Models. I�ve read somewhere that Jeff Bezos once said that: �consensus & compromise between experts is not good for seeking truth.� Probably he is right. Well, it seems Mixture-of-Experts models are all the rage in the AI community these days. Let�s see why.

Dense transformer models are hugely demanding in terms of computational resources and model pipeline execution. MoE models provide faster pre-training, faster inference, and require less VRAM/compute resources. All the new MoEs models that are popping up recently, seem to outperform GPT-3.5 and Llama 2 models too.

How do Mixture-of-Experts models work? A group of leading AI researchers, just posted this excellent blogpost on MoEs. The researchers take a look at the building blocks of MoEs, how they�re trained, and the tradeoffs to consider when serving them for inference. Blogpost: Mixture of Experts Explained

Mixtral-8x7B: A new SMoE model. Mistral AI just announced this high-quality, sparse mixture-of-experts (SMoE) model with open weights. Licensed under Apache 2.0. Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference. It�s the strongest open-weight model with a permissive license and the best model overall regarding cost/performance trade-offs. Blogpost: Mixtral of experts, A high quality Sparse Mixture-of-Experts.

Mixtral8-7B: Overview and fine-tuning. A great video explainer in which Greg reviews the architecture of Mixtral8-7B, and explains where it stands relative to other models, and how it differs from a classic transformer architecture. The video also includes a section on how to run inference using Mixtral and how to instruct-tune the model using Mosaic Instruct V3!

SwitchHead: A new MoE Attention model. Just a few days ago, a diverse group of AI researchers -including Schmidhuber- just released SwitchHead. The model uses Mixture-of-Experts (MoE) layers for the value and output projections and requires 4 to 8 times fewer attention matrices than standard Transformers. This novel attention model can also be combined with MoE MLP layers, resulting in an efficient fully-MoE �SwitchAll� Transformer model. Paper and source code: SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention.

CausalLM / Qwen with 8 MoEs. The awesome team at CausalLM have come up with a new model trained -note merged- on 8 completely different expert models based on Qwen-7B / CausalLM. Six of which are specific domain models including: a Toolformer/Agent expert model, a multilingual translation expert model, a mathematics expert model, a visual expert model, a coding and computer expert model, and an uncensored knowledge model � along with Qwen-Chat and Qwen-Base models. Checkout model description and repo: CausalLM / Qwen 8x7B MoE - This is not Mixtral / Mistral 7B

Running Mixtral1 8x7B on the new Apple MLX. A couple of days ago, Apple just published a repo on how to run Mixtral1 8x7B MoE model on the brand new MLX framework. This example also supports the instruction fine-tuned Mixtral model. Repo: Mixtral1 8x7B on Apple MLX example.

Have a nice week.

Subscribe now

10 Link-o-Troned

Google Research - Advancements in ML for ML

MSR Phi-2 2.7B: The Surprising Power of Small LMs

DeepMind Imagen 2 - Our Most Advanced Img-to-Txt Model

The AI Trust Crisis

The New StripedHyena 7B Models: Beyond Transformers

A Hacker's Guide to Open Source LLMs (12/2023)

A Systems Programmer's Perspectives on Generative AI

MS promptbase - A Repo on All Things Prompt Engineering

[free book] Deep Learning: Foundations and Concepts, Nov 2023

Bash One-Liners for LLMs


Share Data Machina with your friends


the ML Pythonista

Samsung AI: Diffusion Models + Flow XGBoost for Tabular Data

Google AI Gemini API - Getting Started Notebook

Spin up a Swarm of 10,000 Internet Agents, Let Them Work for You

Deep & Other Learning Bits

High Dimensional, Tabular DL Aided with a Knowledge Graph

[free course] RL with Human Feedback (RLHF)

[free NeuroIPS2023 tutorial] On World Models, Agents & LLMs

AI/ DL ResearchDocs

Introducing Automated Continual Learning (ACL)

Dense X Retrieval: What Retrieval Granularity Should We Use?

CogAgent: A SOTA Visual Language Model for GUI Agents

data v-i-s-i-o-n-s

1,374 Days: My Life with Long COVID

[interactive] Cost of Living: Why Things are Expensive?

How Many Hobbits? 3,000 Years of Middle Earth Population History

MLOps Untangled

How to Setup VS Code for AI/ML & MLOps in Python

BricksLLM: AI Gateway for Putting LLM In Production

The Big Dictionary of MLOps & LLMOps

AI startups -> radar

Relevance - Build & Deploy Your Own AI Agents with No Code

Delphina - A Copilot for Data Science & ML

Typeface - A Platform for Personalised Enterprise GenAI

ML Datasets & Stuff

The AI Art Dataset - 200k Txt-to-Img Prompts

UTD19: Largest, Public Multi-city Traffic Dataset Available

Toxic DPO - A Highly Toxic, Harmful Dataset for DPO & AI Unalignment

Postscript, etc

Enjoyed this post? Tell your friends about Data Machina. Thanks for reading.

Share

Tips? Suggestions? Feedback?�email Carlos

Curated by�@ds_ldn�in the middle of the night.


View Entire Post

Read Entire Article