Voice cloning

The ultimate goal of this project is to have a friend's voice read me random e-books.

Whisper.cpp

whisper.cpp is an implementation of automatic speech recognition (ASR) model. You feed it audio and it renders text.

Unsloth

Unsloth is a lightweight fine tuning engine. It allows to fine tune (train) AI models efficiently.
One can train a Text-To-Speech model that way, using a custom dataset consisting of audio/transcript pairs.

Orpheus-TTS

Orpheus TTS is a SOTA open-source text-to-speech system built from the Llama-3b llm model.
It allows user to generate human-like voices from text prompts and makes use of emotion tags like <yawn>, <giggle>, <sigh>, etc.
Multilingual models are also available at their website, which is neat as my friend speaks a rather poor english.

To make use of this model, I downloaded a local client leveraging lmstudio local server.

Simply put:

  • LMStudio loads the OrpheusTTS model
  • LMStudio launches a local server (Orpheus model can thus be reached using an API served @localhost:1234)
  • The local client (a python script and a few depedencies) make use of aforementioned local server to compute queries and generate audio files

LMStudio is available as an AppImage through their official website, i.e.: no installation required.
The software is rather simple to use and one can download models directly from a search field inside its UI.
As official finetuned OrpheusTTS models are guarded behind an HF access tokens, I downloaded the model called freddyaboulton/3b-fr-ft-research_release-Q4_K_M-GGUF, which have been forked from the official french finetuned OrpheusTTS.

Using the local client I got great results from default TTS. For multilingual though, I had to adjust the script to add french voices (pierre, amelie, and marie).

//...
AVAILABLE_VOICES = ["tara", "leah", "jess", "leo", "dan", "mia", "zac", "zoe", "pierre", "amelie", "marie"]
//...
$ python gguf_orpheus.py --text "Salut, <giggle> ceci est un test" --voice pierre

This generates audio in wav format in orpheus-tts-local/outputs/.

Ressources:
OrpheusTTS GitHub
Local Orpheus client
LM Studio
Available Orpheus languages

Creating a custom dataset

First step is to gather materials.

Using OBS, I recorded roughly one hour of audio to have a sample of my friend's voice.

Then, with Audacity, I split the audio track into small samples.

  • Analize->Label Sounds
  • File->Export Audio->Export multiple->Split by labels

To simplify further processing, I had to bulk rename the exported files to remove whitespaces from their filename. This is achieved with the following oneliner:

$ for f in *.wav; do mv -- "$f" "${f// /}"; done

Using a custom made script and whisper.cpp, I then transcribed each sample in text files.

#!/bin/bash

while [ $# -gt 0 ]; do
    audio=$1
    /home/user/whisper.cpp/build/bin/whisper-cli -m /home/user/whisper.cpp/models/ggml-large-v3.bin -otxt -t 8 -f "$audio"
    shift
done

This basic bash script is used like this:

$ ./transcribe_audio.sh ./Sound*.wav

This gave me a directory containing SoundXXX.wav and associated SoundXXX.wav.txt files.

Ideally, transcript should've also include emotion tags like <laugh>, <chuckle>, <sigh>, <cough>, <sniffle>, <groan>, <yawn>, <gasp>, etc. These tags are treated as special tokens by the model. During training, the model then learn to associate these tags with the corresponding audio patterns.

For starter, I'll try without them. Mainly because I'm lazy and don't want to manually add them to the previously generated transcripts.

Just to clean the transcriptions a bit, I wrote a basic script to normalize the text files and remove any special character that could throw unsloth off-guard.

#!/bin/sh

for file in ./*.txt
do
    if [[ -f "$file"  ]]; then
        sed -i ':a;$!{N;s/\n/ /;ba;}' $file
        sed -i "s/,//g" $file
    fi
done

Unsloth accepts custom dataset. A custom dataset is achieved by storing every audio sample in a directory and creating a CSV file containing filename/transcript pairs like this:

audio,text
0001.wav,Hello there!
0002.wav,<sigh> I am very tired.

To combine all transcripts and file paths into a csv I used the following script:

#!/bin/bash
echo "audio,text" > ./custom_dataset.csv;
for file in ./*.wav
do
    if [[ -f "$file"  ]]; then
        echo "$(dirname $(realpath $file))/$(basename $file),$(<$file.txt )" >> ./custom_dataset.csv;
    fi
done

Now that I have raw materials, I can get to the model's fine-tuning.

In Unsloth, use load_dataset("csv", data_files="/path/to/custom_dataset.csv", split="train") to load it. You might need to tell the dataset loader how to handle audio paths. An alternative is using the datasets.Audio feature to load audio data on the fly:

from datasets import Audio
dataset = load_dataset("csv", data_files="custom_dataset.csv", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=24000))

Then dataset[i]["audio"] will contain the audio array.

Ensure transcripts are normalized (no unusual characters that the tokenizer might not know, except the emotion tags if used). Also ensure all audio have a consistent sampling rate (resample them if necessary to the target rate the model expects, e.g. 24kHz for Orpheus).

To ensure all audio is at 24kbps, one can use ffmpeg:

$ mkdir 24kbps && for i in *.wav; do ffmpeg -i $i -ar 24000 24kbps/$i; done

Installing JupyterLab

The easiest way to achieve the fine-tuning process is to use a python notebook and run it in a google collab or jupyter lab.

$ mkdir jupyter && cd jupyter
$ python3.10 -m venv venv
$ source venv/bin/activate
$ pip install jupyterlab
$ jupyter lab

## Or alternatively, to activate venv inside jupyter notebooks
$ pip install ipykernel
$ python -m ipykernel install --user --name=venv
$ jupyter lab
## Then make sure to select the kernel named venv inside jupyter lab

This launches a JupyterLab instance locally at http://localhost:8888/lab

One can then import the sample collab file from Unsloth and follow its instructions.

Fine-tuning

The file consist of multiple sections.

Installation

%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth
!pip install snac

This basically install unsloth package in the python virtual environnment.

Loading the model

from unsloth import FastLanguageModel
import torch

fourbit_models = [
    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-4b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-27b-it-unsloth-bnb-4bit",
    # Qwen3 new models
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    # Other very popular models!
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/Llama-3.3-70B",
    "unsloth/mistral-7b-instruct-v0.3",
    "unsloth/Phi-4",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "./models/3b-fr-ft-research_release_4bit",
    max_seq_length= 16384, # Choose any for long context!
    dtype = None, # Select None for auto detection
    load_in_4bit = False, # Select True for 4bit which reduces memory usage
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

This load the model located in ./models/ called 3b-fr-ft-research_release_4bit.

model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

And then add LoRA adapters to save on performances and fine-tune only a portion of the model's parameters.

Data preparation

from datasets import load_dataset
from datasets import Audio

dataset = load_dataset("csv", data_files="/path/to/custom_dataset.csv", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=24000))

Here we load the dataset.

import locale
import torchaudio.transforms as T
import os
import torch
from snac import SNAC
locale.getpreferredencoding = lambda: "UTF-8"

#pdb.set_trace()
ds_sample_rate = dataset[0]["audio"]["sampling_rate"]

snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model.to("cuda")
def tokenise_audio(waveform):
  waveform = torch.from_numpy(waveform).unsqueeze(0)
  waveform = waveform.to(dtype=torch.float32)
  resample_transform = T.Resample(orig_freq=ds_sample_rate, new_freq=24000)
  waveform = resample_transform(waveform)

  waveform = waveform.unsqueeze(0).to("cuda")

  #generate the codes from snac
  with torch.inference_mode():
    codes = snac_model.encode(waveform)

  all_codes = []
  for i in range(codes[0].shape[1]):
    all_codes.append(codes[0][0][i].item()+128266)
    all_codes.append(codes[1][0][2*i].item()+128266+4096)
    all_codes.append(codes[2][0][4*i].item()+128266+(2*4096))
    all_codes.append(codes[2][0][(4*i)+1].item()+128266+(3*4096))
    all_codes.append(codes[1][0][(2*i)+1].item()+128266+(4*4096))
    all_codes.append(codes[2][0][(4*i)+2].item()+128266+(5*4096))
    all_codes.append(codes[2][0][(4*i)+3].item()+128266+(6*4096))


  return all_codes

def add_codes(example):
    # Always initialize codes_list to None
    codes_list = None

    try:
        answer_audio = example.get("audio")
        # If there's a valid audio array, tokenise it
        if answer_audio and "array" in answer_audio:
            audio_array = answer_audio["array"]
            codes_list = tokenise_audio(audio_array)
    except Exception as e:
        print(f"Skipping row due to error: {e}")
        # Keep codes_list as None if we fail
    example["codes_list"] = codes_list

    return example

dataset = dataset.map(add_codes, remove_columns=["audio"])

tokeniser_length = 128256
start_of_text = 128000
end_of_text = 128009

start_of_speech = tokeniser_length + 1
end_of_speech = tokeniser_length + 2

start_of_human = tokeniser_length + 3
end_of_human = tokeniser_length + 4

start_of_ai = tokeniser_length + 5
end_of_ai =  tokeniser_length + 6
pad_token = tokeniser_length + 7

audio_tokens_start = tokeniser_length + 10

dataset = dataset.filter(lambda x: x["codes_list"] is not None)
dataset = dataset.filter(lambda x: len(x["codes_list"]) > 0)

def remove_duplicate_frames(example):
    vals = example["codes_list"]
    if len(vals) % 7 != 0:
        raise ValueError("Input list length must be divisible by 7")

    result = vals[:7]

    removed_frames = 0

    for i in range(7, len(vals), 7):
        current_first = vals[i]
        previous_first = result[-7]

        if current_first != previous_first:
            result.extend(vals[i:i+7])
        else:
            removed_frames += 1

    example["codes_list"] = result

    return example

dataset = dataset.map(remove_duplicate_frames)

tok_info = '''*** HERE you can modify the text prompt
If you are training a multi-speaker model (e.g., canopylabs/orpheus-3b-0.1-ft),
ensure that the dataset includes a "source" field and format the input accordingly:
- Single-speaker: f"{example['text']}"
- Multi-speaker: f"{example['source']}: {example['text']}"
'''
print(tok_info)

def create_input_ids(example):
    # Determine whether to include the source field
    text_prompt = f"{example['source']}: {example['text']}" if "source" in example else example["text"]

    text_ids = tokenizer.encode(text_prompt, add_special_tokens=True)
    text_ids.append(end_of_text)

    example["text_tokens"] = text_ids
    input_ids = (
        [start_of_human]
        + example["text_tokens"]
        + [end_of_human]
        + [start_of_ai]
        + [start_of_speech]
        + example["codes_list"]
        + [end_of_speech]
        + [end_of_ai]
    )
    example["input_ids"] = input_ids
    example["labels"] = input_ids
    example["attention_mask"] = [1] * len(input_ids)

    return example


dataset = dataset.map(create_input_ids, remove_columns=["text", "codes_list"])
columns_to_keep = ["input_ids", "labels", "attention_mask"]
columns_to_remove = [col for col in dataset.column_names if col not in columns_to_keep]

dataset = dataset.remove_columns(columns_to_remove)

And here we use a tokenisation function to chop it up in small chunks for the training function to adjust.

Training

HuggingFace provides a Trainer implementation, more info here.

from transformers import TrainingArguments,Trainer,DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = Trainer(
    model = model,
    train_dataset = dataset,
    args = TrainingArguments(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 180, # Set this to None for full trainning run
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Here we define the trainer's settings. Note that max steps here is defined to speed up the process, but greater results can be obtained by running a full training.

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

This simply print current GPU and memory info.

trainer_stats = trainer.train()

And finally this launch the training function. The process can take a while depending on your specs and the number of steps defined in the trainer settings.

Inference and testing the fine-tuned model

prompts = [
    "Salut bon bah c'est moi. Ceci est un test.",
]

chosen_voice = None # None for single-speaker

Here we define a prompt.

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# Moving snac_model cuda to cpu
snac_model.to("cpu")

prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]

all_input_ids = []

for prompt in prompts_:
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  all_input_ids.append(input_ids)

start_token = torch.tensor([[ 128259]], dtype=torch.int64) # Start of human
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text, End of human

all_modified_input_ids = []
for input_ids in all_input_ids:
  modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH
  all_modified_input_ids.append(modified_input_ids)

all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
  padding = max_length - modified_input_ids.shape[1]
  padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
  attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
  all_padded_tensors.append(padded_tensor)
  all_attention_masks.append(attention_mask)

all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)

input_ids = all_padded_tensors.to("cuda")
attention_mask = all_attention_masks.to("cuda")
generated_ids = model.generate(
      input_ids=input_ids,
      attention_mask=attention_mask,
      max_new_tokens=1200,
      do_sample=True,
      temperature=0.6,
      top_p=0.95,
      repetition_penalty=1.1,
      num_return_sequences=1,
      eos_token_id=128258,
     use_cache = True
  )
token_to_find = 128257
token_to_remove = 128258

token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)

if len(token_indices[1]) > 0:
    last_occurrence_idx = token_indices[1][-1].item()
    cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
else:
    cropped_tensor = generated_ids

mask = cropped_tensor != token_to_remove

processed_rows = []

for row in cropped_tensor:
    masked_row = row[row != token_to_remove]
    processed_rows.append(masked_row)

code_lists = []

for row in processed_rows:
    row_length = row.size(0)
    new_length = (row_length // 7) * 7
    trimmed_row = row[:new_length]
    trimmed_row = [t - 128266 for t in trimmed_row]
    code_lists.append(trimmed_row)


def redistribute_codes(code_list):
  layer_1 = []
  layer_2 = []
  layer_3 = []
  for i in range((len(code_list)+1)//7):
    layer_1.append(code_list[7*i])
    layer_2.append(code_list[7*i+1]-4096)
    layer_3.append(code_list[7*i+2]-(2*4096))
    layer_3.append(code_list[7*i+3]-(3*4096))
    layer_2.append(code_list[7*i+4]-(4*4096))
    layer_3.append(code_list[7*i+5]-(5*4096))
    layer_3.append(code_list[7*i+6]-(6*4096))
  codes = [torch.tensor(layer_1).unsqueeze(0),
         torch.tensor(layer_2).unsqueeze(0),
         torch.tensor(layer_3).unsqueeze(0)]

  # codes = [c.to("cuda") for c in codes]
  audio_hat = snac_model.decode(codes)
  return audio_hat

my_samples = []
for code_list in code_lists:
  samples = redistribute_codes(code_list)
  my_samples.append(samples)
from IPython.display import display, Audio
if len(prompts) != len(my_samples):
  raise Exception("Number of prompts and samples do not match")
else:
  for i in range(len(my_samples)):
    print(prompts[i])
    samples = my_samples[i]
    display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))
# Clean up to save RAM
del my_samples,samples

And here we launch the inference process to generate a sample using the previously defined prompt. We can then check the audio and cloned voice quality.

And voilà. We sucessfully cloned a voice.

Saving the model to gguf format to be used with local orpheus-TTS and LMStudio

To save the trained model to a gguf file, simply run the following inside the notebook:

# Export to GGUF format
if True: model.save_pretrained_gguf("gguf", tokenizer, quantization_method = "q4_k_m")

This will create a directory named gguf containing the 4bit quantized model in a gguf format.

And voilà. (BIS)

Results

Using only one hour worth of raw audio recording (which ammounts roughly to 20 minutes once the silences and weird sentences are set aside), and even ommiting emotion tags, I got impressive results in a matter of minutes.

The model still hallucinates sometimes though, which I imagine could be mitigated by carefully crafting a larger dataset and enhancing it with emotional tags.