Using "Natural": A NLP Module for node.js
Like most node modules "natural" is packaged as an NPM and can be installed from the command line with node.js.
Join the DZone community and get the full member experience.
Join For FreeWhether it is for Twitter sentiment analysis or for solving search problems natural language processing (NLP) has become the fulcrum of much of my hobby work in recent years. Initially I usually found myself relying on theNatural Language Toolkit (NLTK)which is a rich library of NLP algorithms forPython. The NLTK is simply fantastic. It's a true one-stop-NLP-shop that's widely adopted, well documented, and open source. Certainly I had to learn what the algorithms did and how they fit together but for the most part the hard work was done for me. It was a very productive situation, to be sure!
Last year, however, brought a new platform to my hobby work:node.js. My, node and its community were young but maturing rapidly.
When the need for natural language facilities arose and I found the pickings pretty slim. I have to be honest. That's *exactly* what I was hoping for; an opportunity to sink my teeth into the algorithms themselves and contribute them back to a young, but growing, community.
Thus I began work on"natural", a module of base natural languages processing algorithms for node.js. The idea was loosely based on the Python NLTK in that all algorithms are in the same package. Initially I didn't think "natural" could be as complete as the NLTK, but as my own understanding as well as community contributions picked up I've become much more hopeful. Also, merging with Rob Ellis's node-nltools back in August of 2011 strengthened "natural" further by rapidly bringing new algorithms and features into the fold.
As of version 0.1.5 Rob, other contributors, and I have managed to get the following feature list together:
- Stemming
- Porter
- Lancaster
- Phonetic
- SoundEx
- Metaphone
- Double Metaphone
- Classification
- Naive Bayes
- Logistic Regression
- String Distance
- Levenshtein (thanks Sid Nallu)
- Jaro-Winkler (thanks Adam Phillabaum)
- Dice's Coefficient (thanks John Crepezzi)
- Tokenization
- Treebank
- Word
- Word-Punctuation
- Inflection
- Numeric
- Nouns Singular/Pluralization
- Present-tense verb Singular/Pluralization
- tf*idf
- n-grams
- WordNet
I'll not cover every single module and feature in this article, but will instead outline what's the most commonly used and most mature.
Installing
Like most node modules "natural" is packaged as an NPM and can be installed from the command line as such:
npm install natural
If you want to install from source (or contribute for that matter) it can be foundhere on GitHub.
Stemming
The first class of algorithms I'd like to outline is stemming. Stemming is the processes of reducing a word to a root (not necessarily the morphological root). In other words the idea is to boil all conjugations, tenses and forms down to a single root word. That root may not end up looking exactly like the English root, but should be close enough for comparison.
Stemming is a typical step in preparing text for use by other algorithms or storage such as classification or even full-text indexing. Both the Lancaster and Porter algorithms are supported as of 0.1.5. Here's a basic example of stemming a word with a Porter Stemmer.
var natural = require('natural'),
stemmer = natural.PorterStemmer;
var stem = stemmer.stem('stems');
console.log(stem);
stem = stemmer.stem('stemming');
console.log(stem);
stem = stemmer.stem('stemmed');
console.log(stem);
stem = stemmer.stem('stem');
console.log(stem);
Above I simply required-up the main "natural" module and grabbed the PorterStemmer sub-module from within. Calling the "stem" function takes an arbitrary string and returns the stem. The above code returns the following output:
stem
stem
stem
stem
For convenience stemmers can patch String with methods to simplify the process by calling theattachmethod. String objects will then have astemmethod.
stemmer.attach();
stem = 'stemming'.stem();
console.log(stem);
It's very possible you'd be interested in stemming a string composed of many words, perhaps an entire document. Theattachmethod provides atokenizeAndStemmethod to accomplish this. It breaks the owning string up into an array of strings, one for each word, and stems them all. For example:
var stems = 'stems returned'.tokenizeAndStem();
console.log(stems);
produces the output:
[ 'stem', 'return' ]
Note that thetokenizeAndStemmethod will omit certain words by default that are considered irrelevant (stop words) from the return array. To instruct the stemmer to not omit stop words pass atruein totokenizeAndStemfor thekeepStopsparameter. Consider:
console.log('i stemmed words.'.tokenizeAndStem());
console.log('i stemmed words.'.tokenizeAndStem(true));
outputting:
[ 'stem', 'word' ]
[ 'i', 'stem', 'word' ]
All of the code above would also work with a Lancaster stemmer by requiring the LancasterStemmer module instead, like:
var natural = require('natural'),
stemmer = natural.LancasterStemmer;
Of course the actual stems produced could be different depending on the algorithm chosen. The Lancaster stemmer tends to be a bit more agressive resulting in roots that look less like their English equivalents, but will likely perform better.
Phonetics
Phonetic algorithms are also provided to determine what words sound like and compare them accordingly. The old (and I mean pre-electronic computers old... like 1918 old) SoundEx and the more modern Metaphone/Double Metaphone algorithms are supported as of 0.1.5.
The following example compares the string "phonetics" and the intentional misspelling "fonetix" and determines they sound alike according to the Metaphone module but the same pattern could be applied to the DoubleMetaphone or SoundEx modules.
var natural = require('natural'),
phonetic = natural.Metaphone;
var wordA = 'phonetics';
var wordB = 'fonetix';
if(phonetic.compare(wordA, wordB))
console.log('they sound alike!');
The raw code the phonetic algorithm produces can be retrieved with theprocessmethod:
var phoneticCode = phonetic.process('phonetics');
console.log(phoneticCode);
resulting in:
FNTKS
Like the stemming implementations the phonetic modules have anattachmethod that patches String with shortcut methods, most notablysoundsLikefor comparison:
phonetic.attach();
if(wordA.soundsLike(wordB))
console.log('they sound alike!');
attachalso patches in aphoneticsandtokenizeAndPhoneticizemethods to retrieve the phonetic code for a single word and an entire corpus respectively.
console.log('phonetics'.phonetics());
console.log('phonetics rock'.tokenizeAndPhoneticize());
which outputs:
FNTKS
[ 'FNTKS', 'RK' ]
The above could could also use SoundEx by substituting the following in for the require.
var natural = require('natural'),
phonetic = natural.SoundEx;
Note that SoundEx and Metaphone may have trouble with non-English words, but Double Metaphone should have some degree of success with many other languages.
tf*idf
tf*idf weights can be used to judge how important a given word is to a given document in a broader corpus (collection of documents). There are two components to a tf*idf weight: the term frequency and the inverse document frequency. To guarantee that a frequently-used, albeit semantically less important, word doesn't gain too much favor you'll want to ensure you have many documents in your TfIdf clone.
Consider the following code which adds a few documents to a corpus and then determines how important the words "ruby" and "node" are to them.
var natural = require('natural'),
TfIdf = natural.TfIdf,
tfidf = new TfIdf();
tfidf.addDocument('i code in c.');
tfidf.addDocument('i code in ruby.');
tfidf.addDocument('i code in ruby and node, but node more often.');
tfidf.addDocument('this document is about natural, written in node');
tfidf.addDocument('i code in fortran.');
console.log('node --------------------------------');
tfidf.tfidfs('node', function(i, measure) {
console.log('document #' + i + ' is ' + measure);
});
console.log('ruby --------------------------------');
tfidf.tfidfs('ruby', function(i, measure) {
console.log('document #' + i + ' is ' + measure);
});
The previous code will output the tf*idf weights for "node" and "ruby". The higher the weight the more important the word is to the document.
node --------------------------------
document #0 is 0
document #1 is 0
document #2 is 3.347952867143343
document #3 is 1.6739764335716716
document #4 is 0
ruby --------------------------------
document #0 is 0
document #1 is 1.6739764335716716
document #2 is 1.6739764335716716
document #3 is 0
document #4 is 0
Additionally, you can measure a word against a single document.
console.log(tfidf.tfidf('node', 0 /* document index */));
console.log(tfidf.tfidf('node', 1));
You can also get a list of all terms in a document ordered by their importance.
tfidf.listTerms(4 /* document index */).forEach(function(item) {
console.log(item.term + ': ' + item.tfidf);
});
yeilding:
fortran: 1.7047480922384253
code: 1.6486586255873816
Inflection
Basic inflectors are in place to convert nouns between plural and singular forms and to turn integers into string counters (i.e. '1st', '2nd', '3rd', '4th 'etc.).
The following example converts the word "radius" into its plural form "radii".
var natural = require('natural'),
nounInflector = new natural.NounInflector();
var plural = nounInflector.pluralize('radius');
console.log(plural);
Singularization follows the same pattern as is illustrated in the following example wich converts the word "beers" to its singular form, "beer".
var singular = nounInflector.singularize('beers');
console.log(singular);
Just like the stemming and phonetic modules anattachmethod is provided to patch String with shortcut methods.
nounInflector.attach();
console.log('radius'.pluralizeNoun());
console.log('beers'.singularizeNoun());
A NounInflector instance can do custom conversion if you provide expressions via theaddPluralandaddSingularmethods. Because these conversion aren't always symmetric (sometimes more patterns may be required to singularize forms than pluralize) there needn't be a one-to-one relationship betweenaddPluralandaddSingularcalls.
nounInflector.addPlural(/(code|ware)/i, '$1z');
nounInflector.addSingular(/(code|ware)z/i, '$1');
console.log('code'.pluralizeNoun());
console.log('ware'.pluralizeNoun());
console.log('codez'.singularizeNoun());
console.log('warez'.singularizeNoun());
which would result in:
codez
warez
code
ware
Here's an example of using the CountInflector module to produce string counter for integers.
var natural = require('natural'),
countInflector = natural.CountInflector;
console.log(countInflector.nth(1));
console.log(countInflector.nth(2));
console.log(countInflector.nth(3));
console.log(countInflector.nth(4));
console.log(countInflector.nth(10));
console.log(countInflector.nth(11));
console.log(countInflector.nth(12));
console.log(countInflector.nth(13));
console.log(countInflector.nth(100));
console.log(countInflector.nth(101));
console.log(countInflector.nth(102));
console.log(countInflector.nth(103));
console.log(countInflector.nth(110));
console.log(countInflector.nth(111));
console.log(countInflector.nth(112));
console.log(countInflector.nth(113));
producing:
1st
2nd
3rd
4th
10th
11th
12th
13th
100th
101st
102nd
103rd
110th
111th
112th
113th
Classification
Classification is currently supported by the Naive Bayes and Logistic regression algorithms, although natural's Naive Bayes implementation is the most mature of the two. You can use them for tasks like spam detection and sentiment analysis.
There are two fundamental steps involved in using a classifier: training and classification.
The following example takes care of the first step by requiring-up the classifier and training it with data. Naturally, this is only a sample. To do any production tasks you'd want many more training documents (hundreds per class depending on their size).
var natural = require('natural'),
classifier = new natural.BayesClassifier();
classifier.addDocument("my unit-tests failed.", 'software');
classifier.addDocument("tried the program, but it was buggy.", 'software');
classifier.addDocument("the drive has a 2TB capacity.", 'hardware');
classifier.addDocument("i need a new power supply.", 'hardware');
classifier.train();
By default the classifier will tokenize the corpus and stem it with a PorterStemmer. You can use a LancasterStemmer by passing it in to the BayesClassifier constructor as such:
var natural = require('natural'),
stemmer = natural.LancasterStemmer,
classifier = new natural.BayesClassifier(stemmer);
With the classifier trained it can now classify documents via theclassifymethod:
console.log(classifier.classify('did the tests pass?'));
console.log(classifier.classify('did you buy a new drive?'));
resulting in the output:
software
hardware
Similarly the classifier can be trained on arrays rather than strings, bypassing tokenization and stemming. This allows the consumer to perform custom tokenization and stemming if any at all. This is especially useful if the corpus is not English.
classifier.addDocument(['unit', 'test'], 'software');
classifier.addDocument(['bug', 'program'], 'software');
classifier.addDocument(['drive', 'capacity'], 'hardware');
classifier.addDocument(['power', 'supply'], 'hardware');
classifier.train();
It's possible to persist and recall the results of a training via thesavemethod:
var natural = require('natural'),
classifier = new natural.BayesClassifier();
classifier.addDocument(['unit', 'test'], 'software');
classifier.addDocument(['bug', 'program'], 'software');
classifier.addDocument(['drive', 'capacity'], 'hardware');
classifier.addDocument(['power', 'supply'], 'hardware');
classifier.train();
classifier.save('classifier.json', function(err, classifier) {
// the classifier is saved to the classifier.json file!
});
The training could then be recalled later with theloadmethod:
var natural = require('natural'),
classifier = new natural.BayesClassifier();
natural.BayesClassifier.load('classifier.json', null, function(err, classifier) {
console.log(classifier.classify('did the tests pass?'));
});
Note that substitutingLogisticRegressionClassifierforBayesClassifiershould generally work as a drop-in replacement.
n-grams
n-grams are essentially the destructuring of a sentence into overlapping, contiguous lists ofnsize and are useful for building probabilistic language models. In this case the n-grams are composed of words but outside of "natural" or even natural language processing they could be of other countable objects.
Consider the following examples which illustrate the production of trigrams (n-grams of length 3), bigrams (n-grams of length 2), and arbitrary n-grams using thetrigrams,bigramsandngramsfunctions respectively.
var NGrams = natural.NGrams;
console.log(NGrams.trigrams('some other words here'));
console.log(NGrams.trigrams(['some', 'other', 'words', 'here']));
both of which produce:
[ [ 'some', 'other', 'words' ], [ 'other', 'words', 'here' ] ]
console.log(NGrams.bigrams('some words here'));
console.log(NGrams.bigrams(['some', 'words', 'here']));
both of which produce:
[ [ 'some', 'words' ], [ 'words', 'here' ] ]
console.log(NGrams.ngrams('some other words here for you', 4));
which output:
[ [ 'some', 'other', 'words', 'here' ], [ 'other', 'words', 'here', 'for' ], [ 'words', 'here', 'for', 'you' ] ]
String Distance
"natural" supplies the Dice's coefficient, Levenshtein distance, and Jaro-Winkler distance algorithms for determining string similarity. These algorithms are concerned with orthographic (spelling) similarity, not necessarily phonetics.
Each algorithm produces a number indicating its perception of similarity, but each is determined differently and can even move in opposite directions. For instance, the more dissimilar two strings are the greater the Levenshtein distance, but Jaro-Winkler considers two totally dissimilar strings to have a value of 0 with identical strings having a value of 1.
The following example shows each algorithm's perception of the difference between the words "execution" and "intention".
var natural = require('natural');
console.log(natural.JaroWinklerDistance('execution', 'intention'));
console.log(natural.LevenshteinDistance('execution', 'intention'));
console.log(natural.DiceCoefficient('execution', 'intention'));
resulting in the output:
0.48148148148148145
8
0.375
Now to consider totally identical strings.
var natural = require('natural');
console.log(natural.JaroWinklerDistance('same', 'same'));
console.log(natural.LevenshteinDistance('same', 'same'));
console.log(natural.DiceCoefficient('same', 'same'));
which yeilds:
1
0
1
Conclusion and Roadmap
Well, that was a summary of a sizable portion of "natural". Many of the algorithms have additional parameters that can be used to tweak their operation and a few modules weren't represented at all, butthe official READMEcan help fill that gap.
There's still plenty in store for "natural". While the current plan is certainly not limited to the following points, these are indeed slated for at least some kind of attention by fall 2012.
- Non-English-specific stemming algorithms
- Pure javascript version
- Maximum entropy classifier
- Clustering algorithms (k-means in development)
- Part of speech tagging
- Punkt sentence segmentation
With the exception of k-means, which is near completion, I'd love community help on nearly every one! To either help out or follow along check outthe GitHub repository.
Published at DZone with permission of Christopher Umbel, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments