How to build a small language model (TLM) in Ruby: step -by -step guide
In this article, we will go through how to create a very simple language model using Ruby. Although the real large language models (LLMS) require huge amounts of data and arithmetic resources, we can create a game model that shows many basic concepts behind language modeling. In our example, we will build a basic model for the Marcov series “learning” from the text of the entry and then creates a new text based on the patterns he noticed.
Note: This tutorial is intended for educational purposes and explains a simplified approach to language modeling. It is not a substitute for LLMS modern learning like GPT-4, but rather an introduction to basic ideas.
table of contents
- Understand the basics of language models
- Preparing the sapphire environment
- Provide data and prior processing
- Building the Markov series model
- Training the form
- Getting and testing the text
- conclusion
Understand the basics of language models
A Language model It is a system that helps the possibilities for the sequence of words. In essence, it is designed to capture the statistical structure of the language by learning the possibility of a specific sequence in a specific context. This means that the model analyzes large text bodies to understand how words usually follow each other, allowing it to predict the word or phrase that may come in a sequence. Such abilities are essential not only for tasks such as text generation and automatic completion, but also to a variety of natural language processing applications (NLP), including translation, summary and emotional analysis.
Modern language models are widely used (LLMS) such as GPT-4 deep learning techniques and huge data groups to capture complex patterns in the language. It works by treating the text of the input through many layers of artificial neurons, which allows them to understand and generate a human -like text with a noticeable fluency. However, behind these advanced systems, the same basic idea lies: understanding and expecting the sequence of words based on the possibilities learned.
One of the simplest ways of the language model is through Marcov series. The Markov series is a statistical model that assumes that the possibility of a word only depends on a limited set of previous words, instead of the entire date of the text. This concept is known as the Markov feature. In practical terms, the model assumes that the next word in a predictive sequence is only by looking at the latest words (words) – which is simplification that makes the problem more applicable while continuing to capture useful patterns in data.
In the Marcov series language model:
- The future state (the following word) depends only on the current state (previous words): This means that once we know the last few words (determined by the form of the form), we have a sufficient context to predict what may happen after that. It is not necessary to consider the date of the conversation or the entire text, which reduces the complexity.
- We build a possibility of what comes after that, given the previous words (words): Since the model is trained on a group of text, it learns the possibility of different words that follow a specific sequence. The probability distribution is then used during the generation to determine the following word in the sequence, and the process of random samples is usually used that respect the possibilities learned.
In our implementation, we will use a formable “arrangement” to determine the number of previous words that must be taken into account when making predictions. It provides the highest context arrangement, which may lead to a more coherent and related text, as the model has more information about what happened before. On the contrary, the minimum arrangement provides more randomness and can lead to more creative, albeit -less charts, than to be predicted, than words. This comparison between cohesion and creativity is a major consideration of language modeling.
By understanding these basic principles, we can appreciate the simplicity of the Markov series models and foundational ideas that support the most complex nerve models. Not only does this expanded opinion help statistical mechanics behind the prediction of the language, but also helps basically to experience more advanced techniques in treating the natural language.
Preparing the sapphire environment
Before starting, be sure to install Ruby on your system. You can check your Ruby version by operation:
ruby -v
If Ruby is not installed, you can download it from Ruby-Lang.org.
For our project, you may want to create a dedicated guide and file:
mkdir tiny_llm
cd tiny_llm
touch llm.rb
You are now ready to write the sapphire code.
Provide data and prior processing
Training data collection
For the language model, you need a text collection. You can use any text file for training. For our simple example, you can use a small sample of the text, for example:
sample_text = <<~TEXT
Once upon a time in a land far, far away, there was a small village.
In this village, everyone knew each other, and tales of wonder were told by the elders.
The wind whispered secrets through the trees and carried the scent of adventure.
TEXT
Preparation for data
Before training, it is useful to pre -processing the text:
- Distinguished symbol: Divide the text into words.
- normalization: Optionally converting the text into small, removing punctuation, etc.
For our purposes, Ruby String#split
The method works well enough for the distinctive symbol.
Building the Markov series model
We will create a Ruby chapter named MarkovChain
To package the behavior of the model. The chapter will include:
- Preparing to set the arrangement (the number of previous words) for the series.
- A
train
The way that builds the series of the text of the entry. - A
generate
The way that produces a new text by taking samples from the chain.
Below is the full symbol of the form:
class MarkovChain
def initialize(order = 2)
@order = order
# The chain is a hash that maps a sequence of words (key) to an array of possible next words.
@chain = Hash.new { |hash, key| hash[key] = [] }
end
# Train the model using the provided text.
def train(text)
# Optionally normalize the text (e.g., downcase)
processed_text = text.downcase.strip
words = processed_text.split
# Iterate over the words using sliding window technique.
words.each_cons(@order + 1) do |words_group|
key = words_group[0...@order].join(" ")
next_word = words_group.last
@chain[key] << next_word
end
end
# Generate new text using the Markov chain.
def generate(max_words = 50, seed = nil)
# Choose a random seed from the available keys if none is provided or if the seed is invalid.
if seed.nil? || [email protected]?(seed)
seed = @chain.keys.sample
end
generated = seed.split
while generated.size < max_words
# Form the key from the last 'order' words.
key = generated.last(@order).join(" ")
possible_next_words = @chain[key]
break if possible_next_words.nil? || possible_next_words.empty?
# Randomly choose the next word from the possibilities.
next_word = possible_next_words.sample
generated << next_word
end
generated.join(" ")
end
end
Explanation of the code
-
** Preparation: ** The originator
initialize
The arrangement is determined (the default is 2) and creates an empty fragmentation of our chain. The retail is given a virtual mass so that each new key begins as an empty sphere. -
** Form Training: **
train
The method takes a series of text, normalizing and dividing them into words. Useeach_cons
It creates successive groups of words of lengthorder + 1
. the firstorder
Words act as a key, and the last word is attached to a set of potential continuity of this key. -
** Text generation: ** The
generate
The method begins with the seed key. If none of them are provided, a random key is chosen. Then it builds a sequence by searching for the last timeorder
Words and take samples from the following word until the maximum number of words is reached.
Training the form
Now that we have MarkovChain
The chapter, let’s train it on some text data.
# Sample text data for training
sample_text = <<~TEXT
Once upon a time in a land far, far away, there was a small village.
In this village, everyone knew each other, and tales of wonder were told by the elders.
The wind whispered secrets through the trees and carried the scent of adventure.
TEXT
# Create a new MarkovChain instance with order 2
model = MarkovChain.new(2)
model.train(sample_text)
puts "Training complete!"
When running the code above (for example, by keeping it in llm.rb
And implementation ruby llm.rb
) The model will be trained using the text of the sample presented.
Getting and testing the text
Once the form is training, you can create a new text. Let’s add some software instructions to create and print a sample text:
# Generate new text using the trained model.
generated_text = model.generate(50)
puts "Generated Text:"
puts generated_text
You can also try to provide a seed to generate text. For example, if you know one of the keys in the form (such as "once upon"
), You can do:
seed = "once upon"
generated_text_with_seed = model.generate(50, seed)
puts "\nGenerated Text with seed '#{seed}':"
puts generated_text_with_seed
By experimenting with different seeds and parameters (such as arrangement and maximum number of words), you can see how the output is different.
Full example: a small LLM training and test
Below is the full text program for Ruby that combines all the steps mentioned above:
#!/usr/bin/env ruby
# llm.rb
# Define the MarkovChain class
class MarkovChain
def initialize(order = 2)
@order = order
@chain = Hash.new { |hash, key| hash[key] = [] }
end
def train(text)
processed_text = text.downcase.strip
words = processed_text.split
words.each_cons(@order + 1) do |words_group|
key = words_group[0...@order].join(" ")
next_word = words_group.last
@chain[key] << next_word
end
end
def generate(max_words = 50, seed = nil)
if seed.nil? || [email protected]?(seed)
seed = @chain.keys.sample
end
generated = seed.split
while generated.size < max_words
key = generated.last(@order).join(" ")
possible_next_words = @chain[key]
break if possible_next_words.nil? || possible_next_words.empty?
next_word = possible_next_words.sample
generated << next_word
end
generated.join(" ")
end
end
# Sample text data for training
sample_text = <<~TEXT
Once upon a time in a land far, far away, there was a small village.
In this village, everyone knew each other, and tales of wonder were told by the elders.
The wind whispered secrets through the trees and carried the scent of adventure.
TEXT
# Create and train the model
model = MarkovChain.new(2)
model.train(sample_text)
puts "Training complete!"
# Generate text without a seed
generated_text = model.generate(50)
puts "\nGenerated Text:"
puts generated_text
# Generate text with a specific seed
seed = "once upon"
generated_text_with_seed = model.generate(50, seed)
puts "\nGenerated Text with seed '#{seed}':"
puts generated_text_with_seed
Run the text program
- Save the text program as
llm.rb
. - Open your station and go to the evidence that contains
llm.rb
. - Run the text program using:
ruby llm.rb
You should see the output indicates that the model has been trained and then two examples of the text created.
standard
The following table summarizes some standard measures of different versions of small LLM applications. Each scale is explained below:
- model: The name or version of the language model.
- to request: The number of previous words used in the Marcov series to predict the following word. The highest order generally means using more context, and possibly increased cohesion.
- Training time (MS): The approximate time to train the model on the text data provided, measured by a second.
- Obstetrics (MS): The time needed to create a sample of a text, measured by millions again.
- Memory use: The amount of memory that the model consumes during training and obstetrics.
- Craft classification: Personal classification (from 5) indicates the consistency of the text created in the context.
Below are the cuts schedule with standard data:
model |
to request |
Training time (MS) |
Obstetrics (MS) |
Memory use (MB) |
Classification classification |
---|---|---|---|---|---|
Tiny llm v1 |
2 |
50 |
10 |
10 |
3/5 |
Tiny llm v2 |
3 |
70 |
15 |
12 |
3.5/5 |
Tiny llm v3 |
4 |
100 |
20 |
15 |
4/5 |
These criteria provide a quick overview of differentials between different models formations. As the demand increases, the form tends to take a little longer to train and create text, and use more memory. However, these increases in resource consumption are often accompanied by improvements in the construction of the created text.
conclusion
In this tutorial, we explained how to create a very simple language model using Ruby. By taking advantage of the Marcov series technology, we built a system:
- Trains on the text of a sample By learning the transitions of words.
- It generates a new text Based on the patterns of learned.
While this game model is far from LLMS at the production level, it works as a starting stone to understand how language models work on a basic level. You can expand this idea by integrating more advanced technologies, dealing with punctuation better, or even merging Ruby with machine learning libraries for more advanced models.
Happy coding!