## LLM Compressor Workbench -- Getting Started

This notebook will demonstrate how common [LLM Compressor](https://github.com/vllm-project/llm-compressor) flows can be run on the Alauda AI.

We will show how a user can compress and evaluate a Large Language Model, with a calibration dataset.

The notebook will detect if a GPU is available. If one is not available, it will demonstrate an abbreviated run, so users without GPU access can still get a feel for `llm-compressor`.


<div class="alert alert-block alert-info">
<b>Note:</b> If you want to evaluate compressed model, just be sure to have lm_eval>=0.4.8 installed
</div>

### 1\) Calibrated Compression with a Dataset

Some more advanced compression algorithms require a small dataset of calibration samples that are meant to be a representative random subset of the language the model will see at inference.

We will show how the previous section can be augmented with a calibration dataset and GPTQ, one of the first published LLM compression algorithms.

<div class="alert alert-block alert-info">
<b>Note:</b> This will take several minutes if no GPU is available
</div>

In [None]:
import torch

use_gpu = torch.cuda.is_available()

In [None]:
# We will use a new recipe running GPTQ (https://arxiv.org/abs/2210.17323)
# to reduce error caused by quantization. GPTQ requires a calibration dataset.
from llmcompressor.modifiers.quantization import GPTQModifier

# model to compress
model_id = "./TinyLlama/TinyLlama-1.1B-Chat-v1.0"
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])

In [None]:
from datasets import load_dataset

# Create the calibration dataset, using Huggingface datasets API
dataset_id = "./ultrachat_200k"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
num_calibration_samples = 512 if use_gpu else 4
max_sequence_length = 2048 if use_gpu else 16

# Load dataset
ds = load_dataset(dataset_id, split="train_sft")
# Shuffle and grab only the number of samples we need
ds = ds.shuffle(seed=42).select(range(num_calibration_samples))


# Preprocess and tokenize into format the model uses
def preprocess(example):
    text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
    )
    return tokenizer(
        text,
        padding=False,
        max_length=max_sequence_length,
        truncation=True,
        add_special_tokens=False,
    )


ds = ds.map(preprocess, remove_columns=ds.column_names)

In [None]:
# oneshot modifies model in-place, so reload
model = AutoModelForCausalLM.from_pretrained(
    model_id, device_map="auto", torch_dtype="auto"
)
# run oneshot again, with dataset
model = oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=max_sequence_length,
    num_calibration_samples=num_calibration_samples,
)

In [None]:
# Save model and tokenizer
model_dir = "./" + model_id.split("/")[-1] + "-GPTQ-W4A16"
model.save_pretrained(model_dir)
tokenizer.save_pretrained(model_dir);

### 2\) Run `lm_eval`

Note that perplexity score has improved (lower is better) for this `TinyLlama` model. 

In [None]:
import os
os. environ ["VLLM_USE_V1"] = "0"

import lm_eval
from lm_eval.utils import make_table

from lm_eval. tasks import TaskManager
task_manager = TaskManager (include_path="./my-wikitext.yaml")

results = lm_eval.simple_evaluate(
    model="vllm" if use_gpu else "hf",
    model_args={
        "pretrained": model_dir,
        "add_bos_token": True,
        "device": "auto",
        "gpu_memory_utilization": 0.8,
    },
    tasks=["my-wikitext"],
    batch_size="auto" if use_gpu else 4,
    limit=None if use_gpu else 4,
)

In [None]:
print(make_table(results))