Training Mistral 7B on a Local Machine with CUDA (RTX 4090)
This guide will help you train the Mistral 7B language model on your local machine using CUDA GPU. This guide was tested on a local machine with an RTX 4090, but it should work with GPUs like Tesla T4 that have a minimum of 16 GB of RAM. This guide is designed for continual pre-training, meaning that after training, the model will be capable of text completion rather than Question Answer tasks.
1. System Requirements
GPU: NVIDIA RTX 4090 (or similar) with CUDA support.
OS: Linux (Ubuntu preferred).
CUDA Toolkit: CUDA 12.1 or newer.
Python 3.8+.
Git and a Python package manager like
pip
.
2. Install Dependencies
First, make sure your GPU drivers and CUDA are properly installed. You can install CUDA with:
sudo apt update
sudo apt install nvidia-cuda-toolkit
After installing CUDA, verify the installation:
nvidia-smi
Next, set up Python dependencies. Create a new virtual environment to keep your workspace clean:
python3 -m venv mistral_env
source mistral_env/bin/activate
Then, install the required packages, including torch
with CUDA support and unsloth
for training:
pip install --upgrade pip
pip install torch==2.3.0+cu121 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
pip install git+https://github.com/unslothai/unsloth.git
You can also get the latest nightly version of Unsloth if needed:
pip uninstall unsloth -y
pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
3. Download the Dataset
In this example, we will use the "Tiny Stories" dataset from Hugging Face. We now use the Tiny Stories dataset from https://huggingface.co/datasets/roneneldan/TinyStories. We only sample the first 5000 rows to speed training up. You can download it with the datasets
library:
from datasets import load_dataset
# Load the Tiny Stories dataset
dataset = load_dataset("roneneldan/TinyStories", split="train[:5000]")
4. Model Setup
We will use the Mistral 7B model with 4-bit quantization to optimize GPU memory usage. Here is how to set up the model with Unsloth:
from unsloth import FastLanguageModel
import torch
# Define model parameters
max_seq_length = 2048
load_in_4bit = True # Use 4-bit quantization to reduce memory usage
dtype = torch.float16 if torch.cuda.get_device_capability()[0] >= 7 else None
# Load the pre-trained model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/mistral-7b-v0.3",
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit
)
5. Add LoRA Adapters
To further optimize training, use LoRA (Low-Rank Adaptation) adapters. LoRA is a technique that allows you to train large language models efficiently by only updating a small fraction (1 to 10%) of all parameters, significantly reducing computational requirements. We also add embed_tokens
and lm_head
to allow the model to learn out-of-distribution data, which is important for continual pre-training tasks. This allows you to update only specific parts of the model, making training more efficient:
model = FastLanguageModel.get_peft_model(
model,
r=128, # Choose any number > 0! Suggested values are 8, 16, 32, 64, 128
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",
"embed_tokens", "lm_head"], # Add for continual pretraining
lora_alpha=32,
lora_dropout=0, # Supports any, but = 0 is optimized
bias="none", # Supports any, but = "none" is optimized
use_gradient_checkpointing="unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora=True,
loftq_config = None, # And LoftQ
)
6. Prepare Dataset for Training
Prepare the dataset for training. It is important to add EOS_TOKEN or tokenizer.eos_token to ensure that the model's generation does not continue indefinitely:
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
return {"text": [example + EOS_TOKEN for example in examples["text"]]}
# Apply formatting
formatted_dataset = dataset.map(formatting_prompts_func, batched=True)
7. Training the Model
Now, configure the training arguments and start training using UnslothTrainer
:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import UnslothTrainer, UnslothTrainingArguments
training_args = UnslothTrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
warmup_ratio=0.1,
num_train_epochs=1,
learning_rate=5e-5,
embedding_learning_rate=5e-6,
fp16=True,
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.00,
lr_scheduler_type="cosine",
seed=3407,
output_dir="outputs"
)
trainer = UnslothTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=formatted_dataset,
dataset_text_field="text",
args=training_args
)
# Start training
trainer.train()
# Save the trained model
model.save_pretrained("outputs/mistral_7b_trained")
tokenizer.save_pretrained("outputs/mistral_7b_trained")
8. Running Inference
After training, you can use the model to generate text. Here's how to load the trained model and generate text starting with a specific prompt:
from transformers import TextIteratorStreamer
from threading import Thread
from transformers import AutoModelForCausalLM, AutoTokenizer
import textwrap
# Load the trained model
model = AutoModelForCausalLM.from_pretrained("outputs/mistral_7b_trained").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("outputs/mistral_7b_trained")
text_streamer = TextIteratorStreamer(tokenizer)
max_print_width = 100
inputs = tokenizer(["Once upon a time, in a galaxy, far far away,"], return_tensors="pt").to("cuda")
generation_kwargs = dict(
inputs,
streamer=text_streamer,
max_new_tokens=256,
use_cache=True
)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
# Stream the generated text
length = 0
for j, new_text in enumerate(text_streamer):
if j == 0:
wrapped_text = textwrap.wrap(new_text, width=max_print_width)
length = len(wrapped_text[-1])
wrapped_text = "\n".join(wrapped_text)
print(wrapped_text, end="")
else:
length += len(new_text)
if length >= max_print_width:
length = 0
print()
print(new_text, end="")
Notes
Ensure your CUDA and GPU drivers are up-to-date to fully utilize the RTX 4090's capabilities.
Training may require several hours depending on the dataset size and batch configurations. Adjust
gradient_accumulation_steps
andper_device_train_batch_size
as needed to optimize GPU memory usage.
Author Bio
Rafal Jackiewicz is an author of books about programming in C and Java. You can find more information about him and his work on Amazon.