SpaCy
Some of the concepts, methods, and objects used in the SpaCy library (for NLP) will be covered here. There will be another set of NLP notes under the Deep Learning section.
NLP using SpaCy
This is mainly a brief summary of the NLP course given by Kaggle. But I included some other information that can be useful. SpaCy is the leading library for NLP, and one of the most popular Python frameworks. Here I will mention some of the important methods used for Natural Language Processing using SpaCy.
First, we need to specify the model we are using (simply the language). Here is a small snippet of loading the "English" language model.
After loading the model, we can process a text in English using nlp
variable we just defined.
Notice: The sentence above actually uses every letter in the English alphabet at least once.
Now, let me introduce some of the most common lingos used in NLP.
1. Tokenization
A token is a string with an assigned, identified meaning. It is structured as a pair consisting of a token name and an optional token value. The token name is a category of a lexical unit. Common token names are:
identifier: names the programmer chooses;
keyword: names already in the programming language;
separator (also known as punctuators): punctuation characters and paired-delimiters;
operator: symbols that operate on arguments and produce results;
literal: numeric, logical, textual, reference literals;
comment: line, block.
Note: Simply, a token is a unit of text in the document.
2. Text Processing
There are preprocessing methods to improve an NLP model. One of them is lemmatizing. The lemma
of a word is its base form. As an example, jump
is the lemma of the words jumps, jumping, jumped
, etc. In our example lemmatizing the word jumps, converts it to jump.
Removing stopwords is a common practice in NLP. A Stopword is a word that occurs frequently in the language while not containing useful information
. Some of the stopwords in English are: "the", "is", "and", "or", "but", "not", "over", etc.
After creating a token, we can use the attributes lemma_
and is_stop
, that gives the lemma of the words and if the given token is a stopword or not, respectively.
The lemmas and stopwords important because text data is often very noisy mixed in with informative content. In the example above, the important words are quick
, brown
, fox
, jump
, lazy
and dog.
Removing stop words often helps the model to hone in on relevant words. Lemmatizing also helps to reduce the model complexity by simplifying the words with the same base form.
On the other hand, lemmatizing and dropping stopwords sometimes might worsen your model's performance. The preprocessing of a document is similar to the hyperparameter optimization process.
3. Pattern Matching
Matching tokens or phrases within a piece of (or entire) text is a common NLP task. It can be done using regular expressions, but SpaCy is a lot easier to use.
First, we will create a Matcher
, or a PhraseMatcher
object to match an individual token or a list of tokens, respectively. Let's create a PhraseMatcher
object as an example:
As it can be seen in the code snippet above, matcher returns a list of tuples where each tuple consists of (match_id, start, end)
where start and end represent the beginning and ending of the string.
4. Bag of Words
Most of the Machine learning models require numerical data. Hence, you need to transform the text to numeric values somehow.
One way to do is to use an idea similar to that of one-hot encoding. Each document can be represented as a vector of frequencies for each term in the vocabulary. The vocabulary is built from all the tokens (terms) in the corpus.
Consider the sentence above "The quick brown fox jumps over the lazy dog." and "The fox bit the dog." Then the vocabulary is (excluding punctuation):
{'the', 'quick' , 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog', 'bit'}
Notice that 0
's represent the missing terms in the sentences. For example, sentence 1 doesn't include the bit
term. This representation is called a bag of words. Vocabularies may have thousands of terms depending on the context, so sometimes these vectors can be very large.
Once you have the bag of words representation of your document, you can feed those vectors to any machine learning model. SpaCy's TextCategorizer
class handles this conversion and builds a simple linear model for you.
The TextCategorizer
is a pipe, where pipes are classes for processing and transforming tokens. When you create a spaCy model with nlp = spacy.load('en')
, there are default pipes performing different transformations. When you run the text through a model doc = nlp("Example text")
, the output of the pipes is attached to the tokens in the doc
object. The lemmas for token.lemma_
come from one of these pipes. First, we will create a model without any pipes except for a tokenizer. Then, we'll create a TextCategorizer pipe and add it to the empty model.
In the problem above, we are modeling a binary classification problem in which classes are spam
and ham
; hence, the classes are exclusive. Finally, bow stands for the bag of words
architecture. Here we picked a simple architecture.
Other common representations are TF-IDF (Term Frequency - Inverse Document Frequency), and Word Embeddings (or Vectors). Using those can improve model performance.
5. TF - IDF
TFI - DF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The TF-IDF value is proportional to the number of times a word appears in the document and is compensated by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words may appear more frequently. It is one of the most popular term-weighting schemes commonly used. According to Wikipedia, a survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use TF-IDF. The formula of TF-IDF includes two different metrics:
The term frequency (TF) of a word in a specific document. It simply gives a frequency of the term appearing in the document. The simplest approach is a binary weight including
0
and1
depending on whether the term appears or not in the document. Some of its variants are:
The inverse document frequency (IDF) of the word across a set of documents, which calculates how common (or rare) a term is in the entire set of documents. As IDF gets close to 0, it means that the term is more common across the documents. Some of its variants are:
6. Word Embeddings (Vectors)
Word embeddings are numerical representations of the words in documents such that the vector corresponds to how the word is used or what it possibly means. Vector encodings are learned by taking the context into consideration. Words showing up in similar contexts will have similar vectors. For example, vectors for "cat", "lion", and "tiger" will be close together, while they'll be far away from "space" and "house".
Moreover, the relations between words can be expressed mathematically using similarity measures such as cosine similarity. As a result, operations available in vectors will work in vector embeddings as well. For example, subtracting the vectors corresponding to "man" and "woman" will return another vector. Adding the final vector to the vector corresponding to "king" will result in a vector close to the one for "queen."
SpaCy provides embeddings learned from the Word2Vec
model. These embeddings can be accessed by loading language models like en_core_web_lg
. After loading the model, the embeddings will be available on tokens using .vector
attribute.
7. Training a TextCategorizer Model
We need to convert the data labels into the form TextCategorizer requires. For example, if a text (in an email) is spam
then we will create a dictionary {'spam': True, 'ham': False}
.
We are now ready to train our first model. First, we will create an optimizer
object, then we will split our data into mini-batches to increase the efficiency of the model. Finally, using theupdate
attribute of the model we will update the parameters.
Notice that this is training just one loop (epoch) through the whole data. To get better results, we need to randomly shuffle the data and then go through it multiple times using epoch
.
8. Making Predictions
As we trained a model, now we can make predictions using the predict()
method. We need to be careful as the input needs to be tokenized using nlp.tokenizer
before feeding to predictions.
These scores are used to predict a single class by choosing the label with the highest probability. The index of the highest probability is obtained using argmax
attribute on the scores, then we can use that index to get the label from textcat.labels
. To measure the model performance, there are multiple metrics available such as accuracy, precision, recall, F1-score, ROC curve, AUC, etc.
These topics are covered in the Hands-on Machine Learning section, Chapter 3.
Last updated