A Guide to Natural Language Processing
Introduce yourself to the world of natural language processing (NLP) by learning about some basic algorithms for stemming and splitting words automatically.
Join the DZone community and get the full member experience.
Join For FreeNatural language processing (NLP) is a set of techniques that can be used to achieve many different objectives. Take a look at the following table to figure out which technique(s) can solve your particular problem.
WHAT YOU NEED |
WHERE TO LOOK |
Grouping similar words for search |
|
Finding words with the same meaning for search |
|
Generating realistic names |
|
Understanding how much time does it take to read a text |
|
Understanding how difficult to read is a text |
|
Identifying the language of a text |
|
Generating a summary of a text |
SumBasic (word-based), Graph-Based Methods: TextRank (relationship-based), Latent Semantic Analysis (semantic-based) |
Finding similar documents |
|
Identifying entities (i.e. cities, people) in a text |
|
Understanding the attitude expressed in a text |
|
Translating a text |
We are going to talk about parsing in the general sense of analyzing a document and extracting its meaning. So, we are going to talk about the actual parsing of natural languages, but we will spend most of the time on other techniques. When it comes to understanding programming languages, parsing is the way to go. However, you can pick specific alternatives for natural languages. In other words, we are mostly going to talk about what you would use instead of parsing to accomplish your goals.
For instance, if you wanted to find all for
statements in a programming language file, you would parse it and then count the number of for
s. Instead, you are probably going to use something like stemming to find all mentions of cats in a natural language document.
This is necessary because the theory behind the parsing of natural languages might be the same one that is behind the parsing of programming languages; however, the practice is very dissimilar. In fact, you are not going to build a parser for a natural language — that is, unless you work in artificial intelligence or as a researcher, and even in those cases, you are rarely going to use one. Rather, you are going to find an algorithm that works as a simplified model of the document and that can only solve your specific problem.
In short, you are going to find tricks to avoid to actually having to parse a natural language. That is why this area of computer science is usually called natural language processing rather than natural language parsing.
Algorithms That Require Data
We are going to see specific solutions to each problem. Mind you, these specific solutions can be quite complex themselves. The more advanced they are, the less they rely on simple algorithms. Usually, they need a vast database of data about the language. A logical consequence of this is that it is rarely easy to adopt a tool for one language to be used for another one. Or rather, the tool might work with few adaptations, but to build the database would require a lot of investment. So, for example, you would probably find a ready-to-use tool to create a summary of an English text, but maybe not one for an Italian one.
For this reason, in this series, we concentrate mostly on English language NLP tools. Although we mention whether these tools work for other languages, you do not need to know the theoretical differences between languages, such as the number of genders or cases they have. However, you should be aware that the more different a language is from English, the harder it will be to apply these techniques or tools to it.
For example, you should not expect to find tools that can work with Chinese (or rather the Chinese writing system). It is not necessarily that these languages are harder to understand programmatically, but there might be less research on them or the methods might be completely different from the ones adopted for English.
The Structure of This Guide
This article is organized according to the tasks we want to accomplish — which means that the tools and explanation are grouped according to the task they are used for. For instance, there is a section about measuring the properties of a text, such as its difficulty. They are also generally in ascending order of difficulty — it is easier to classify words than entire documents. We start with simple information retrieval techniques and we end in the proper field of natural language processing.
We think this is the most useful way to provide the information you need: If you need to do X, we directly show the methods and tools you can use.
Classifying Words
With the expression classifying words, we include techniques and libraries that group words together.
Grouping Similar Words
We are going to talk about two methods that can group together similar words for the purpose of information retrieval. Basically, these are methods used to find documents with the words we care about from a pool of documents. That is useful because if a user searches for documents containing the word "friend," they are probably equally interested in documents containing "friends" and possibly "friended" and "friendship."
So, to be clear, in this section, we are not going to talk about methods to group semantically connected words, such as identifying all pets or all English towns.
The two methods are stemming and splitting words into groups of characters. The algorithms for the former are language-dependent, while the ones for the latter are not. We are going to examine each of them in separate paragraphs.
Stemming
Stemming is the process of finding the stem, or root, of a word. In this context, the stem is not necessarily the morphological root according to linguists. So, it is not the form of a word that you would find, say, in a vocabulary. For example, an algorithm may produce the stem "consol" for the word "consoling," while in a vocabulary, as a root, you would find "console."
A typical application of stemming is grouping together all instances of words with the same stem for usage in a search library. So, if a user searches for documents containing "friend" they can also find ones with "friends" or "friended."
Porter Stemming Algorithm
Let’s talk about an algorithm that removes suffixes to find the stem: the effective and widely used Porter Stemming Algorithm. The algorithm was originally created by Martin Porter for English. There are also Porter-based/inspired algorithms for other languages, such as French or Russian. You can find all of them on the website Snowball. Snowball is a simple language to describe stemming algorithms, but the algorithms are also described in plain English.
A complete description of the algorithm is beyond the scope of this guide. However, its foundation is easy to grasp. Fundamentally, the algorithm divides a word into regions and then replaces or removes certain suffixes if they are completely contained in the said region. So, for example, the Porter2 (i.e. the updated version) algorithm, states that:
R1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.
And then that you should replace "-tional" with "-tion" if it is found inside R1.
For example:
The word "confrontational" has as R1 region "-frontational"
"-tional" is completely contained in its R1
"Confrontational" becomes "confrontation"
Porter Stemmer is purely algorithmic; it does not rely on an external database or computed rules (i.e. rules created according to a training database). This is a great advantage because it makes it predictable and easy to implement. The disadvantage is that it cannot handle exceptional cases and known mistakes cannot be easily solved. For example, the algorithm creates the same stem for university and universal.
Porter Stemmer is not perfect — but it is simple, effective, and easy to implement. For a language like English, a stemmer can be realized by any competent developer. So, there are many out there for all notable programming languages and we are not going to list them here.
Typical Issues With Other Languages
Most languages that are somewhat close to English, like German or even Romance languages, are generally easy to stem. Actually, the creation of the algorithm itself is complex and requires great knowledge of the language. However, once somebody has done the hard work of creating an algorithm, implementing one is easy.
In stemming, there are many problems with two kinds of languages you will usually encounter. The first kind is agglutinative languages. Setting aside the linguistic meaning of the expression, the issue is that agglutinative languages pile up prefixes and suffixes on the root of a word.
In particular, Turkish is problematic because is both an agglutinative language and a concatenative one, which means that basically, in Turkish, a word can represent a whole English sentence. This makes hard to develop a stemming algorithm for Turkish, but it also makes it less useful. That is because if you stem a Turkish word, you might end up with one stem for each sentence, so you lose a lot of information.
The second kind of issue is related to language with no clearly defined words. Chinese is the prime example of a language that has no alphabet, but only symbols that represent concepts. So, stemming has no meaning for Chinese. Even determining the boundaries of concepts is hard. The problem of dividing a text in its component words is called word segmentation. With English, you can find the boundaries of words just by looking at whitespace or punctuation. There are no such things in Chinese.
Splitting Words
An alternative method to group together similar words relies on splitting them. The foundation of this method is taking apart words into the sequence of characters. These characters are called k-grams, but they are also known as n-grams characters (n-grams might also indicate groups of words). The sequence of characters is built in a sliding manner, advancing by one character at each step, starting and ending with a special symbol that indicates the boundaries of the word. For example, the 3-grams for happy are:
$ha
hap
app
ppy
py$
With the symbol $ used to indicate the beginning and the end of the word.
The exact method used for search is beyond the scope of this article. In general terms, you apply the same process to the search term(s) and then compare the occurrences of the k-grams of the input with the one of the words in the documents. Usually, you apply a statistical coefficient, like the Jaccard coefficient, to determine how much similar the words have to be to be grouped together (i.e. how many grams have to have in common). For example, with a high coefficient, you might group together cat and cats or divide cat and catty.
It is important to note a couple of things: the order of the k-grams and spelling mistakes. The order of the k-grams does not matter; in theory, you could have completely different words that happen to have the same k-grams. In practice, this does not happen. This method is imprecise, which means that it can also protect from spelling mistakes of the user. For example, even if the user input "locamotive" instead of "locomotive," it will probably still show the correct results. That is because 7 of 10 3-grams match. Exact matches would rank higher, but the word "locamotive" does not exist and so it probably has no matches.
Limits and Effectiveness
The great advantage of this technique is that it is not just purely algorithmic and very simple but it also works with all languages. You do not need to build k-grams for English differently from the ones for French. You just take apart the words in the same way. Although, it is important to note that the effectiveness is in the details — you have to pick the right number of k to have the best results.
The ideal number depends on the average length of the word in the language. It should be lower or equal than that. Different languages might have different values, but in general, you can get away with four or five. You will not have the absolute best results with only one choice, but it will work.
The disadvantage is that it looks extremely stupid, let’s face it: It so simple that it should not work. But it actually does, it works well if not better than stemming (PDF). It is shamelessly effective, and it has many other uses. We are going to see one right now.
Generating Names
The general case of generating fake words that look like real words is hard and of limited use. You could create phrases for a fake language, but that is pretty much it. However, it is possible to programmatically create realistic fake names for use in games or for any world-building need.
There are several feasible methods. The easiest works roughly like this:
Create a database of names of the same kind you want to generate (i.e. Roman names, Space Lizards Names, etc.).
Divide the input names in k-grams (i.e. 3-grams of Mark -> $ma – mar – ark – rk$).
Associate a probability to the k-grams: The more frequently they appear in the original database, the higher the chance they appear in the generated name.
Generate the new names!
There are many variants. For instance, you could combine a different number of k-grams for special purposes (i.e. all names start with a 2-gram but end in a 4-gram).
You could also improve the soundness of the generated names simply by looking at the probabilities of the sequences appearing in a certain order. For example, if you randomly start with "ar" the following syllable might be more likely "th" than "or."
This method is not perfect, but it generally works good enough. You can see a few simple examples in langgen
or VNameGenerator
, which shows variations of the mentioned method and a few others.
Classifying Documents
In this section, we include techniques and libraries that measure and identify documents. For example, they can detect the language in which a document is written or measure how difficult it is to read it.
Text Metrics
There are two popular metrics of a text that can be easily implemented: reading time and difficulty of the text. These measurements are useful to inform the reader or to help the writer check that the document respects certain criteria, such as being accessible to a young audience (i.e. low difficulty).
Reading Time
The simplest way of measuring the reading time of a text is to calculate the words in the document, and then divide them by predefined words per minute (wpm) number. The words per minute figure represents the words read by an average reader in a minute. So, if a text has 1,000 words and the wpm is set to 200, the text would take five minutes to read.
That is easy to understand and easy to implement. The catch is that you have to pick the correct wpm rate and that the rate varies according to each language. For example, English readers might read 230 wpm, but French readers might instead read 200 wpm. This is related to the length of the words and the natural flow of the language (i.e. a language could be more concise than another; for instance, it might frequently omit subjects).
The first issue is easily solved. For English, most estimates put the correct wpm between 200 and 230. However, there is still the problem of dealing with different languages. This requires having the correct data for each language and being able to understand the language in which a text is written.
To mitigate both problems you might opt to use a measurement of characters count in order to estimate the reading time. Basically, you remove the punctuation and spaces, then count the characters and divide the sum by 900-1,000.
On linguistic grounds, the measure makes less sense since people do not read single characters. However, the difference between languages is less evident by counting characters. For example, an agglutinative language might have very long words and thus fewer of them. So, it ends up with a similar number of characters to a fusional language like English.
This works better because the differences in speed of reading characters in each language is smaller as a percentage of the total speed. Imagine, for example, that the typical reader of English can read 200 wpm and 950 cpm, while the typical reader of French can read 250 wpm and 1,000 cpm. The absolute difference is the same, but it is less relevant for reading characters. Of course, this is still less than ideal, but it is a simple solution.
Neither of the measures considers the difficulty of the text. That texts that are difficult to read take more time to read, even with the same number of words or characters.
Calculating the Readability of a Text
Usually, the calculation of the readability of a text is linked to grades of education (i.e. years of schooling). So, an easy text might be one that can be read by fourth-graders, while a harder one might need a tenth-grade education. That is both a byproduct of the fact that the algorithms were created for educational purposes and the fact that education is a useful anchor for ranking difficulty. Saying that a text is difficult in absolute terms is somewhat meaningless; saying that it is difficult for seventh-graders makes more sense.
There are several formulas, but they are generally all based on the number of words and sentences in addition to either syllables or the number of difficult words. Let’s see two of the most famous: Flesch-Kincaid and Dale-Chall.
Neither of these formulas is perfect, but both have been scientifically tested. The only caveat is that they should be used only for checking, and not as a guideline. They work if you write normally. If you try to edit a text to lower the score, the results might be incorrect. For example, if you use short words just to make a text seem easy, it looks bad.
Flesch-Kincaid Readability Formula
There are two variants of this formula: Flesch reading ease and Flesch–Kincaid grade level. They are equivalent, but one outputs a score (the higher it is, the easier is the text) and the other a corresponding US grade level. We are going to show the first one.
The readability is generally between 100 to 20. A result of 100 indicates a document that can be easily understood by an 11-year-old student, while a result of 30 or less indicates a document suited for university graduates.
The different parameters can be obtained easily. Only calculating the total number of syllables requires a significant amount of work. The good news is that it is actually doable and there is a reliable algorithm for it. The bad news is that the author of TeX (Frank Liang) wrote his PhD thesis about his hyphenation algorithm. You can find implementation and an accessible explanation of the algorithm in Hyphenopoly.js. The two problems are equivalent since you can only divide a word into two parts between two syllables.
An alternative is to use a hack: Instead of calculating syllables, count the vowels. This hack has been reported to work for English, but it is not applicable to other languages — although, if you use it you lose the scientific validity of the formula, and you just get a somewhat accurate number.
The general structure of the formula has been applied to other languages (i.e. Flesch-Vacca for Italian), but each language has different coefficients.
Dale-Chall Readability Formula
This formula relies also on the number of words and sentences, but instead of syllables, it uses the number of difficult words present in the text.
A difficult word is defined as one that does not belong to a list of 3,000 simple words that 80% of fourth graders understand.
Thus, the formula is easy to use and calculate. The only inconvenience is that you have to maintain a database of these 3,000 words. We are not aware of the formula having been adapted to languages other than English.
The formula generally outputs a score between 4 and 10. Less than 5 indicates a text suitable for fourth-graders, a result of 9 or more indicates a text for college students. You can find a complete table of results at The New Dale-Chall Readability Formula.
It is natural to think that you could modify the formula, to calculate the difficulty in understanding a specialized text. That is to say, you could define difficult words as words belonging to technical terminology. For example, you could calculate how difficult it would be to understand a text for an average person, according to how much computer jargon it contains. Words like parser or compiler could be difficult, while computer or mouse could be considered easy. This might work; however, you would have to calculate the correct coefficients yourself.
Understanding Documents
This section contains more advanced libraries to understand documents. We use the term somewhat loosely; we talk about how a computer can extract or manage the content of a document beyond simple manipulation of words and characters.
We are going to see how you can:
Generate a summary of a document (i.e. an algorithmic answer to the question, What is this article about?)
Sentiment analysis (Does this document contain a positive or negative opinion?)
Parse a document written in a natural language
Translate a document into another language
For the methods listed in the sections above, you could build a library yourself with reasonable effort. From now on, it will get harder. That is because they might require a vast amount of annotated data (i.e. a vocabulary having each word with the corresponding part of speech) or rely on complex machine learning methods. So, we will mostly suggest using libraries.
This is an area with many open problems and active research, so you could find most libraries in Python, a language adopted by the research community, though you could find the occasional research-ready library in another language.
Statistics and machine learning are the current kings of natural language processing. So, there is probably somebody trying to use TensorFlow to accomplish each of these tasks (i.e. deep news summarization). You might try that, too, if you take into account a considerable amount of time for research.
Generation of Summaries
The creation of a summary, or a headline, to correctly represent the meaning of a document is achievable with several methods. Some of them rely on information retrieval techniques, while others are more advanced. The theory is also divided into two strategies: extracting sentences or parts thereof from the original text, and generating abstract summaries.
The second strategy is still an open area of research, so we will concentrate on the first one.
SumBasic
SumBasic is a method that relies on the probability of individual words being present in a sentence to determine the most representative sentence:
First, you have to account for the number of times a word appears in the whole document. With that, you calculate the probability of each word appearing in the document. For instance, if the word appears 5 times and the document has 525 words, its probability is 5/525.
You calculate a weight for each sentence that is the average of the probabilities of all the words in the sentence. For example, if a sentence contains three words with probability 3/525, 5/525, and 10/525, the weight would be 6/525.
Finally, you score the sentences by multiplying the highest probability word of each sentence with its weight. For example, a sentence with a weight of 0.1 and whose best word has a probability of 0.5 would score 0.1 * 0. 5 = 0.05, while another with weight 0.2 and a word with probability 0.4 would score 0.2 * 0.4 = 0.08.
Having found the best sentence, you recalculate the probabilities for each word in the chosen sentence. You recalculate the probabilities as if the chosen sentence was removed from the document. The idea is that the included sentence already contains a part of the whole meaning of the document. So, that part becomes less important — and this helps to avoid excessive repetition. You repeat the process until you reach the needed summary length.
This technique is quite simple. It does not require having a database of documents to build a general probability of a word appearing in any document. You just need to calculate the probabilities in each input document. However, for this to work you have to exclude what is called stopwords. These are common words present in most documents, such as the or is. Otherwise, you might include meaningless sentences that include lots of them. You could also include stemming to improve the results.
The approach based on frequencies is an old and popular one because it is generally effective and simple to implement. SumBasic is good enough that it is frequently used as a baseline in the literature. However, there are even simpler methods. For example, Open Text Summarizer is a 2003 library that uses an even simpler approach. Basically, you count the frequency of each word, then you exclude the common English words (e.g., the, is) and finally you calculate the score of a sentence according to the frequencies of the word it contains.
Graph-Based Methods: TextRank
There are more complex methods of calculating the relevance of the individual sentences. A couple of them take inspiration from PageRank — they are called LexRank and TextRank. They both rely on the relationship between different sentences to obtain a more sophisticated measurement of the importance of sentences, but they differ in the way they calculate the similarity of sentences.
PageRank measures the importance of a document according to the importance of other documents that link to it. The importance of each document, and thus each link, is computed recursively until a balance is reached.
TextRank works on the same principle: The relationship between elements can be used to understand the importance of each individual element. TextRank actually uses a more complex formula than the original PageRank algorithm because a link can be only present or not, while textual connections might be partially present. For instance, you might calculate that two sentences containing different words with the same stem (i.e. cat and cats both have “cat” as their stem) are only partially related.
The original paper describes a generic approach, rather than a specific method. In fact, it also describes two applications: keyword extraction and summarization. The key differences are:
The units you choose as a foundation of the relationship.
The way you calculate the connection and its strength.
For instance, you might choose as units n-grams of words or whole phrases. N-grams of words are sequences of n words, computed the same way you do k-gram for characters. So, for the phrase dogs are better than cats, there are these 3-grams:
dogs are better
are better than
better than cats
Phrases might create weighted links according to how similar they are. Or they might simply create links according to the position they are (i.e. a phrase might link to the previous and following one). The method works the same.
TextRank for Sentence Extraction
TextRank for extracting phrases uses as a unit whole sentences, and as a similarity measure, the number of words in common between them. So, if two phrases contain the words tornado, data, and center, they are more similar than if they contain only two common words. The similarity is normalized based on the length of the phrases to avoid the issue of having longer phrases having higher similarity than shorter ones.
The words used for the similarity measure could be stemmed. Stopwords are usually excluded by the calculation. A further improvement could be to also exclude verbs, although that might be complicated if you do not already have a way to identify the parts of speech.
LexRank differs mainly because as a similarity measure it uses a standard TF-IDF (Term Frequency—Inverse Document Frequency). Basically, with TF-IDF, the value of individual words is first weighted according to how frequently they appear in all documents and in each specific document. For example, if you are summarizing articles for a car magazine, there will be a lot of occurrences of the word car in every document. So, the word car would be of little relevance for each document. However, the word explosion would appear in few documents (hopefully), so it will matter more in each document it appears.
The paper TextRank: Bringing Order into Texts (PDF) describes the approach. ExplainToMe contains a Python implementation of TextRank.
Latent Semantic Analysis
The methods we have seen so far have a weakness: they do not take into account semantics. This weakness is evident when you consider that there are words that have similar meanings (i.e. synonyms) and that most words can have different meanings depending on the context (i.e. polysemy). Latent semantic analysis attempts to overcome these issues.
The expression latent semantic analysis describes a technique more than a specific method — a technique that could be used whenever you need to represent the meaning of words. It can be used for summarization but also for search purposes to find words like the query of the user. For instance, if the user searches for happiness, a search library using LSA could also return results for joy.
A Simple Description
The specific mathematical formulas are a bit complex and involve matrices and operations on them. However, the founding idea is quite simple: words with similar meanings will appear in similar parts of a text. So you start with a normal TF-IDF matrix. Such a matrix contains nothing else than the frequencies of individual words, both inside a specific document and in all the documents evaluated.
The problem is that we want to find a relation between words that do not necessarily appear together. For example, imagine that different documents contain phrases containing the words joy and happiness together other words cookie or chocolate. The words do not appear in the same sentence, but they appear in the same document. One document contains a certain number of such phrases: a dog creates happiness and dogs bring joy to children. In this document, LSA should be able to find a connection between joy and happiness through their mutual connection with “dog.”
The connection is built based on the frequency the words appear together or with related words in the whole set of documents. This allows building connection even in a sentence or document where they do not appear together. So, if joy and happiness appear frequently with “dog,” LSA would associate the specific document with the words (joy, happiness) and dog.
Basically, this technique will transform the original matrix from one linking each term with its frequency, into one with a (weighted) combination of terms linked to each document.
The issue is that there are a lot of words, and combinations thereof, so you need to make a lot of calculations and simplifications. And that is where complex math is needed.
Once you have this matrix, the world is your oyster. That is to say, you could use this measurement of meaning in any number of ways. For instance, you could find the most relevant phrase and then find the phrases which are closest to it, using a graph-based method.
Text summarization and singular value decomposition describe one way to find the best sentences. The Python library sumy offers an implementation.
Parsing Documents
Most computer languages are easy to parse. This is not true for natural languages. There are approaches that give good results, but ultimately, this is still an open area of research. Fundamentally, the issue is that the parsing of a sentence (i.e. analyzing its syntax) and its meaning are interconnected in a natural language. A subject, a verb, a noun, or an adverb are all words and most words that can be subjects can also be objects.
In practical terms, this means that there are no ready-to-use libraries that are good for every use you can think of. We present some libraries that can be used for restricted tasks, such as recognizing parts of speech that can also be of use for improving other methods, like the ones for the creation of summaries.
There is also the frustrating fact that a lot of software is made by academic researchers, which means that it could be easily abandoned for another approach or lack documentation. You cannot really use a work-in-progress, badly maintained software for anything productive. Especially if you care about a language other than English, you might find yourself seeing a good working demo, which was written ten years ago, by somebody with no contact information and without any open-source code available.
You Need Data
To achieve any kind of result with parsing or generally extracting information from a natural language document, you need a lot of data to train the algorithms. This group of data is called a corpus. For use in a system that uses statistical or machine learning techniques, you might just need a lot of real-world data possibly divided in the proper groups (i.e. Wikipedia articles divided by category).
However, if you are using a smart system, you might need this corpus of data to be manually constructed or annotated (i.e. the word dog is a noun that has these X possible meanings). A smart system is one that tries to imitate human understanding, or at least that uses a process that can be followed by humans. For instance, a parser that relies on a grammar that uses rules such as phrase > subject verb (a phrase is made of a subject and a verb), but also defines several classes of verbs that humans would not normally use (i.e. verbs related to motion).
In these cases, the corpus often uses a custom format and is built for specific needs. For example, this system that can answer geographical questions about the United States uses information stored in a Prolog format. The natural consequence is that even what is generally available information, such as dictionary data, can be incompatible between different programs.
On the other hand, there are also good databases that are so valuable that many programs are built around them. WordNet is an example of such a database. It is a lexical database that links groups of words with similar meanings (i.e. synonyms) with their associated definition. It works thus as both a dictionary and a thesaurus. The original version is for English, but it has inspired similar databases for other languages.
What You Can Do
We have presented some of the practical challenges to building your own library to understand text. And we have not even mentioned all the issues related to the ambiguity of human languages. So, differently from what we did for past sections, we are just going to explain what you can do. We are not going to explain the algorithms used to realize them, both because there is no space and also because without the necessary data they would be worthless. Instead, in the next paragraph, we are just going to introduce the most used libraries that you can use to achieve what you need.
Named-Entity Recognition
Named-entity recognition basically means finding the entities mentioned in the document. For example, in the phrase “John Smith is going to Italy,” it should identify John Smith and Italy as entities. It should also be able to correctly keep track of them in different documents.
Sentiment Analysis
Sentiment analysis classifies the sentiment represented by a phrase. In the most basic terms, it means understanding if a phrase indicates a positive or negative statement. A naive Bayes classifier can suffice for this level of understanding. It works in a similar way a spam filter works: It divides the messages into two categories (i.e. spam and non-spam) relying on the probability of each word being present in any of the two categories.
An alternative is to manually associate an emotional ranking to a word. For example, a value between -10/-5 and 0 for catastrophic and one between 0 and 5/10 for nice.
If you need a subtler evaluation you need to resort to machine learning techniques.
Parts of Speech Tagging
Parts of Speech Tagging (usually abbreviated as POS-tagging) indicates the identification and labeling of the different parts of speech (i.e. what is a noun, verb, adjective, etc.). While is an integral part of parsing, it can also be used to simplify other tasks. For instance, it can be used in the creation of summaries to simplify the sentences chosen for the summary (i.e. removing subordinates’ clauses).
Lemmatizer
A lemmatizer returns the lemma for a given word and a part of a speech tag. Basically, it gives the corresponding dictionary form of a word. In some ways, it can be considered an advanced form of a stemmer. It can also be used for similar purposes; namely, it can ensure that all different forms of a word are correctly linked to the same concepts.
For instance, it can transform all instances of cats into cat, for search purposes. However, it can also distinguish between the cases of run as in the verb to run and run as in the noun synonym of a jog.
Chunking
Parts of speech tagging can be considered equivalent to lexing in natural languages. Chunking, also known as shallow parsing, is a step above parts of speech tagging, but one below the final parsing. It connects parts of speech in higher units of meaning, for example, complements. Imagine the phrase John always wins our matches of Russian roulette:
A POS-tagger identifies that Russian is an adjective and roulette a noun
A chunker groups together (of) Russian roulette as a complement or two related parts of speech
The chunker might work to produce units that are going to be used by a parser. It can also work independently, for example, to help in named-entity recognition.
Parsing
The end result is the same as for computer languages: a parse tree. However, the process is quite different — it might start with probabilistic grammar or even with no grammar at all. It also usually continues with a lot of probabilities and statistical methods.
The following is a parse tree created by the Stanford Parser (we are going to see it later) for the phrase My dog likes hunting cats and people. Groups of letters such as NP indicate parts of speech or complements.
(ROOT
(S
(NP (PRP$ My) (NN dog))
(VP (VBZ likes)
(NP (NN hunting) (NNS cats)
(CC and)
(NNS people)))))
Translation
The current best methods for automatic machine translation rely on machine learning. The good news is that this means you just need a great number of documents in the languages you care about, without any annotation. Typical sources of such texts are Wikipedia and the official documentation of the European Union (which requires documents to be translated in all the official languages of the Union).
As anybody that has tried Google Translate or Bing Translator can attest, the results are generally good enough for understanding, but still often a bit off. They cannot substitute a human translator.
Other Uses of NLP
You can apply the same techniques employed to create summaries to different tasks. That is particularly true for the more advanced and semantic-based ones. Notice that the creation of only one summary for many documents is also a different task. That is because you have to take into account different document lengths and avoid repetitions, among other things.
A natural application is the identification of similar documents. If you can devise a method to identify the most meaningful sentences of one document, you can also compare the meaning of two documents.
Another objective with common techniques is information retrieval. In short, if a user searches for one word — say, car — you could use some of these techniques to find documents containing also automobile.
Finally, there is topic modeling, which consists in finding the topics of a collection of documents. In simple terms, it means grouping together words with similar themes. It uses more complex statistical methods than the one used for the creation of summaries. The current state-of-the-art is based on a method called Latent Dirichlet allocation.
Gensim is a very popular and production-ready library that has many such applications. It is written in Python.
Mallet is a Java library mainly designed for topic modeling.
Other Methods and Libraries
The creation of summaries is a fertile area of research with many valid methods already devised. In fact, much more than the ones we have described here. They vary for the approaches and the objective they are designed for. For example, some are created specifically to provide an answer to a question of the user, others to summarize multiple documents, etc.
You can read a brief taxonomy of other methods in Automatic Text Summarization (PDF). The Python library sumy, that we have already mentioned, implements several methods, though not necessarily the ones mentioned in the paper.
Classifier4J
(Java), NClassifier
(C#), and Summariz
e (Python) implement a Bayes classifier in an algorithm described as such:
In order to summarize a document, this algorithm first determines the frequencies of the words in the document. It then splits the document into a series of sentences. Then, it creates a summary by including the first sentence that includes each of the most frequent words. Finally, summary’s sentences are reordered to reflect that of those in the original document. — Summarize.py
These projects that implement a Bayes classifier are all dead, but they are useful to understand how the method could be implemented.
DataTeaser and PyTeaser (both in Python, but originally DataTeaser was in Scala) use a custom approach that combines several simple measurements to create a summary of an article.
The Best Libraries Available
The following libraries can be used for multiple purposes, so we are going to divide this section by the title of the libraries. Most of them are in Python or Java.
Apache OpenNLP
The Apache OpenNLP library is a machine learning-based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also included maximum entropy and perceptron-based machine learning.
Apache OpenNLP is a Java library with an excellent documentation that can fulfill most of the tasks we have just discussed, except for sentiment analysis and translation. The developers provide language models for a few languages in addition to English; the most notable are German, Spanish and Portuguese.
The Classical Language Toolkit
The Classical Language Toolkit (CLTK) offers natural language processing (NLP) support for the languages of Ancient, Classical, and Medieval Eurasia. Greek and Latin functionality are currently most complete.
As the name implies the major feature of the Classical Language Toolkit is the support for classical (ancient) languages, such as Greek and Latin. It has basic NLP tools, such as a lemmatizer, but also indispensable tools to work with ancient languages, such as transliteration support, and peculiar things like Clausulae Analysis. It has good documentation and it is your only choice for ancient languages.
FreeLing
FreeLing is a C++ library providing language analysis functionalities (morphological analysis, named entity detection, PoS-tagging, parsing, Word Sense Disambiguation, Semantic Role Labelling, etc.) for a variety of languages (English, Spanish, Portuguese, Italian, French, German, Russian, Catalan, Galician, Croatian, Slovene, among others).
It is a library with good documentation and even a demo. It supports many languages usually excluded by other tools, but it released the Affero GPL, which is probably the least user-friendly license ever conceived.
Moses
Moses is a statistical machine translation system that allows you to automatically train translation models for any language pair. All you need is a collection of translated texts (parallel corpus). Once you have a trained model, an efficient search algorithm quickly finds the highest probability translation among the exponential number of choices.
The only thing to add is that the system is written in C++ and there is ample documentation.
NLTK
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
Natural Language Toolkit (NLTK) is probably the most-known NLP library for Python. The library can accomplish many tasks in different ways (i.e. using different algorithms). It even has good documentation (if you include the freely available book).
Simply put, it is the standard library for NLP research. Though one issue that some people have is exactly that: it is designed for research and educational purposes. If there are ten ways to do something NLTK would allow you to choose among them all. The intended user is a person with a deep understanding of NLP.
TextBlob is a library that builds upon NLTK (and Pattern) to simplify the processing of textual data. The library also provides translation, but it does not implement it directly: it is simply an interface for Google Translate.
Pattern
Pattern is the most peculiar software in our collection because it is a collection of Python libraries for web mining. It has support for data mining from services such as Google and Twitter (i.e., it provides functions to directly search from Google/Twitter), an HTML parser and many other things. Among these things, there is natural language processing for English and a few other languages, including German, Spanish, French, and Italian — though English support is more advanced than the rest.
The pattern.en module contains a fast part-of-speech tagger for English (identifies nouns, adjectives, verbs, etc. in a sentence), sentiment analysis, tools for English verb conjugation and noun singularization and pluralization, and a WordNet interface.
The rest of the libraries can only support POS-tagging.
Polyglot
Polyglot is a set of NLP libraries for many natural languages in Python. It looks great, although it has little documentation.
It supports fewer languages for the more advanced tasks, such as POS tagging (16) or named entity recognition (40). However, sentiment analysis and language identification can work with more than a hundred of them.
Sentiment and Sentiment
Sentiment is a JavaScript (Node.js) library for sentiment analysis. The library relies on AFINN (a collection of English words with an associated emotional value) and a similar database for Emoji. These databases associate to each word/Emoji a positive or negative value, to indicate a positive or negative sentiment. For example, the word joy has a score of 3, while sad has -2.
The code for the library itself is quite trivial, but it works, and it is easy to use.
var sentiment = require('sentiment');
var r1 = sentiment('Cats are stupid.');
console.dir(r1); // Score: -2, Comparative: -0.666
var r2 = sentiment('Cats are totally amazing!');
console.dir(r2); // Score: 4, Comparative: 1
We have explored different methods of improving the accuracy of a Naive Bayes classifier for sentiment analysis. We observed that a combination of methods like negation handling, word n-grams and feature selection by mutual information results in a significant improvement in accuracy.
We have explored different methods of improving the accuracy of a Naive Bayes classifier for sentiment analysis. We observed that a combination of methods like negation handling, word n-grams, and feature selection by mutual information results in a significant improvement in accuracy.
This means that it can be a good starting point to understanding how to build your own sentiment analysis library.
spaCy: Industrial-Strength Natural Language Processing in Python
The library spaCy claims to be a much more efficient, ready-for-the-real-world, and easy-to-use library than NLTK. In practical terms, it has two advantages over NLTK:
Better performance.
It does not give you the chance of choosing among the many algorithms the one you think is best. Instead, it chooses the best one for each task. While fewer choices might seem bad, it can actually be a good thing. That is if you have no idea what the algorithms do and you have to learn them before making a decision.
In practical terms, it is a library that supports most of the basic tasks we mentioned (i.e. things like named entity recognition and POS-tagging, but not translation or parsing) with a great code-first documentation.
Textacy is a library built on top of spaCY for higher-level NLP tasks. Basically, it simplifies some things including features for cleaning data or managing it better.
The Stanford Natural Language Processing Group Software
The Stanford NLP Group makes some of our natural language processing software available to everyone! We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs. These packages are widely used in industry, academia, and government.
The Stanford NLP group creates and supports many great tools that cover all the purposes we have just mentioned. The only thing missing is sentiment analysis. The most notable software are CoreNLP and Parser. The parser can be seen in action in a web demo. CoreNLP is a combination of several tools, including the parser.
The tools are all in Java. The parser supports a few languages: English, Chinese, Arabic, Spanish, etc. The only downside is that the tools are licensed under the GPL. Commercial licensing is available for proprietary software.
Excluded Software
We think that the libraries we choose are the best ones for parsing, or processing, natural languages. However, we excluded some other interesting software that is usually mentioned, like CogCompNLP or GATE for several reasons:
There might be little to no documentation.
It might have a purely educational or any non-standard license.
It might not be designed for developers, but for end-users.
Summary
We have seen many ways to deal with a document in a natural language to get the information you need from it. Most try to find smart ways to bypass the complex task of parsing natural language.
Despite being hard to parse natural languages, it is still possible to do so if you use the libraries available. When dealing with natural languages, hacking a solution is the suggested way of doing things, since nobody can figure out how to do it properly.
Published at DZone with permission of Gabriele Tomassetti, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments