## LLM Compressor Workbench -- Getting Started

This notebook will demonstrate how common [LLM Compressor](https://github.com/vllm-project/llm-compressor) flows can be run on the Alauda AI.

We will show how a user can compress and evaluate a Large Language Model, without data.

The notebook will detect if a GPU is available. If one is not available, it will demonstrate an abbreviated run, so users without GPU access can still get a feel for `llm-compressor`.


<div class="alert alert-block alert-info">
<b>Note:</b> If you want to evaluate compressed model, just be sure to have lm_eval>=0.4.8 installed
</div>

### 1\) Data-Free Model Compression

In [None]:
import torch

use_gpu = torch.cuda.is_available()

In [None]:
from llmcompressor.modifiers.quantization import QuantizationModifier

# model to compress
model_id = "./TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# This recipe will quantize all Linear layers except those in the `lm_head`,
#  which is often sensitive to quantization. The W4A16 scheme compresses
#  weights to 4-bit integers while retaining 16-bit activations.
recipe = QuantizationModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])

In [None]:
# Load up model using huggingface API
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    model_id, device_map="auto", torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

In [None]:
# Run compression using `oneshot`
from llmcompressor import oneshot

model = oneshot(model=model, recipe=recipe, tokenizer=tokenizer)

In [None]:
# Save model and tokenizer
model_dir = "./" + model_id.split("/")[-1] + "-W4A16"
model.save_pretrained(model_dir)
tokenizer.save_pretrained(model_dir);

### 2\) Evaluate compressed model using open-source `lm_eval` framework

We will evaluate the performance of the model on the [`wikitext`](https://huggingface.co/datasets/EleutherAI/wikitext_document_level) language modeling dataset

In [None]:
import os
os. environ ["VLLM_USE_V1"] = "0"

import lm_eval
from lm_eval.utils import make_table

from lm_eval. tasks import TaskManager
task_manager = TaskManager (include_path="./my-wikitext.yaml")

results = lm_eval.simple_evaluate(
    model="vllm" if use_gpu else "hf",
    model_args={
        "pretrained": model_dir,
        "add_bos_token": True,
        "device": "auto",
        "gpu_memory_utilization": 0.8,
    },
    tasks=["my-wikitext"],
    batch_size="auto" if use_gpu else 4,
    limit=None if use_gpu else 4,
)

In [None]:
print(make_table(results))