Training Mistral 7B on a Local Machine with CUDA (RTX 4090)

This guide will help you train the Mistral 7B language model on your local machine using CUDA GPU. This guide was tested on a local machine with an RTX 4090, but it should work with GPUs like Tesla T4 that have a minimum of 16 GB of RAM. This guide is designed for continual pre-training, meaning that after training, the model will be capable of text completion rather than Question Answer tasks.

1. System Requirements

GPU: NVIDIA RTX 4090 (or similar) with CUDA support.
OS: Linux (Ubuntu preferred).
CUDA Toolkit: CUDA 12.1 or newer.
Python 3.8+.
Git and a Python package manager like pip.

2. Install Dependencies

First, make sure your GPU drivers and CUDA are properly installed. You can install CUDA with:

sudo apt update
sudo apt install nvidia-cuda-toolkit

After installing CUDA, verify the installation:

nvidia-smi

Next, set up Python dependencies. Create a new virtual environment to keep your workspace clean:

python3 -m venv mistral_env
source mistral_env/bin/activate

Then, install the required packages, including torch with CUDA support and unsloth for training:

pip install --upgrade pip
pip install torch==2.3.0+cu121 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
pip install git+https://github.com/unslothai/unsloth.git

You can also get the latest nightly version of Unsloth if needed:

pip uninstall unsloth -y
pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

3. Download the Dataset

In this example, we will use the "Tiny Stories" dataset from Hugging Face. We now use the Tiny Stories dataset from https://huggingface.co/datasets/roneneldan/TinyStories. We only sample the first 5000 rows to speed training up. You can download it with the datasets library:

from datasets import load_dataset

# Load the Tiny Stories dataset
dataset = load_dataset("roneneldan/TinyStories", split="train[:5000]")

4. Model Setup

We will use the Mistral 7B model with 4-bit quantization to optimize GPU memory usage. Here is how to set up the model with Unsloth:

from unsloth import FastLanguageModel
import torch

# Define model parameters
max_seq_length = 2048
load_in_4bit = True  # Use 4-bit quantization to reduce memory usage

dtype = torch.float16 if torch.cuda.get_device_capability()[0] >= 7 else None

# Load the pre-trained model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/mistral-7b-v0.3",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit
)

5. Add LoRA Adapters

To further optimize training, use LoRA (Low-Rank Adaptation) adapters. LoRA is a technique that allows you to train large language models efficiently by only updating a small fraction (1 to 10%) of all parameters, significantly reducing computational requirements. We also add embed_tokens and lm_head to allow the model to learn out-of-distribution data, which is important for continual pre-training tasks. This allows you to update only specific parts of the model, making training more efficient:

model = FastLanguageModel.get_peft_model(
    model,
    r=128,  # Choose any number > 0! Suggested values are 8, 16, 32, 64, 128
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", 
                    "embed_tokens", "lm_head"], # Add for continual pretraining
    lora_alpha=32,
    lora_dropout=0, # Supports any, but = 0 is optimized
    bias="none", # Supports any, but = "none" is optimized
    use_gradient_checkpointing="unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora=True,
    loftq_config = None, # And LoftQ
)

6. Prepare Dataset for Training

Prepare the dataset for training. It is important to add EOS_TOKEN or tokenizer.eos_token to ensure that the model's generation does not continue indefinitely:

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    return {"text": [example + EOS_TOKEN for example in examples["text"]]}

# Apply formatting
formatted_dataset = dataset.map(formatting_prompts_func, batched=True)

7. Training the Model

Now, configure the training arguments and start training using UnslothTrainer:

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import UnslothTrainer, UnslothTrainingArguments

training_args = UnslothTrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    warmup_ratio=0.1,
    num_train_epochs=1,
    learning_rate=5e-5,
    embedding_learning_rate=5e-6,
    fp16=True,
    logging_steps=1,
    optim="adamw_8bit",
    weight_decay=0.00,
    lr_scheduler_type="cosine",
    seed=3407,
    output_dir="outputs"
)

trainer = UnslothTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted_dataset,
    dataset_text_field="text",
    args=training_args
)

# Start training
trainer.train()

# Save the trained model
model.save_pretrained("outputs/mistral_7b_trained")
tokenizer.save_pretrained("outputs/mistral_7b_trained")

8. Running Inference

After training, you can use the model to generate text. Here's how to load the trained model and generate text starting with a specific prompt:

from transformers import TextIteratorStreamer
from threading import Thread
from transformers import AutoModelForCausalLM, AutoTokenizer
import textwrap

# Load the trained model
model = AutoModelForCausalLM.from_pretrained("outputs/mistral_7b_trained").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("outputs/mistral_7b_trained")

text_streamer = TextIteratorStreamer(tokenizer)
max_print_width = 100

inputs = tokenizer(["Once upon a time, in a galaxy, far far away,"], return_tensors="pt").to("cuda")

generation_kwargs = dict(
    inputs,
    streamer=text_streamer,
    max_new_tokens=256,
    use_cache=True
)

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

# Stream the generated text
length = 0
for j, new_text in enumerate(text_streamer):
    if j == 0:
        wrapped_text = textwrap.wrap(new_text, width=max_print_width)
        length = len(wrapped_text[-1])
        wrapped_text = "\n".join(wrapped_text)
        print(wrapped_text, end="")
    else:
        length += len(new_text)
        if length >= max_print_width:
            length = 0
            print()
        print(new_text, end="")

Notes

Ensure your CUDA and GPU drivers are up-to-date to fully utilize the RTX 4090's capabilities.
Training may require several hours depending on the dataset size and batch configurations. Adjust gradient_accumulation_steps and per_device_train_batch_size as needed to optimize GPU memory usage.

Author Bio

Rafal Jackiewicz is an author of books about programming in C, C++ and Java. You can find more information about him and his work on Amazon.