Visualizing NLP with Pretrained Models – spaCy and StanfordNLP

Natural Language Processing (NLP) is a crucial aspect of understanding and processing human language using computational methods. In this tutorial, we will explore two popular NLP libraries – spaCy and StanfordNLP – and demonstrate their capabilities using pretrained models.

spaCy – English NLP

Let’s start with spaCy and an English example. We’ll use a snippet about Donald John Trump and visualize various linguistic features.

import spacy

# Load spaCy English model
en = spacy.load("en")

text = ("Donald John Trump (born June 14, 1946) is the 45th and current president of "
        "the United States. Before entering politics, he was a businessman and television personality.")

# Tokenize the text
doc_en = en(text)

# Display sentences and tokens
list(doc_en.sents)

The text is tokenized into sentences and individual tokens. Each token has attributes such as orth (original text), lemma, pos (part of speech), and tag.

from IPython.display import HTML, display
import tabulate

# Display tokens
tokens = [[token] for token in doc_en]
display(HTML(tabulate.tabulate(tokens, tablefmt='html')))

Named Entity Recognition (NER) with spaCy

spaCy provides pretrained models for named entity recognition. Let’s identify entities in our text.

pythonCopy code

# Identify named entities
entities = [(t.orth_, t.ent_iob_, t.ent_type_) for t in doc_en]
display(HTML(tabulate.tabulate(entities, tablefmt='html')))

Entities like “Donald John Trump,” “June 14, 1946,” “45th,” and “the United States” are recognized with their respective types (PERSON, DATE, ORDINAL, GPE).

Dependency Parsing with spaCy

The dependency parser in spaCy helps analyze grammatical relations between tokens.

# Dependency parsing
syntax = [[token.text, token.dep_, token.head.text] for token in doc_en]
display(HTML(tabulate.tabulate(syntax, tablefmt='html')))

This shows the grammatical relations between tokens, revealing the sentence’s structure.

StanfordNLP – Dutch NLP

Now, let’s switch to StanfordNLP and process a Dutch sentence about Charles Michel.

import stanfordnlp

# Download the Dutch model (if not already downloaded)
# stanfordnlp.download('nl')

# Load StanfordNLP Dutch model
nl_stanford = stanfordnlp.Pipeline(lang="nl")
text_nl = "Charles Michel is de eerste minister van België."
doc_nl_stanford = nl_stanford(text_nl)

Combining spaCy and StanfordNLP

You can combine the strengths of spaCy and StanfordNLP. The spacy_stanfordnlp wrapper allows you to integrate StanfordNLP into spaCy.

from spacy_stanfordnlp import StanfordNLPLanguage

# Create a combined pipeline
nl_combined = StanfordNLPLanguage(nl_stanford)
doc_nl_combined = nl_combined(text_nl)

# Display combined information
info = [(t.orth_, t.lemma_, t.pos_, t.tag_) for t in doc_nl_combined]
display(HTML(tabulate.tabulate(info, tablefmt='html')))

This combination provides Dutch lemmatization, part-of-speech tagging, and dependency parsing.

Enhancing with spaCy’s NER

You can extend the combined pipeline with spaCy’s Named Entity Recognition.

nl_combined = StanfordNLPLanguage(nl_stanford)
nl_ner = en.get_pipe("ner")
nl_combined.add_pipe(nl_ner)
nl_combined.vocab.strings.add("PER")

doc_nl_combined = nl_combined(text_nl)

# Display enhanced information
info = [(t.orth_, t.lemma_, t.pos_, t.tag_, t.ent_iob_, t.ent_type_) for t in doc_nl_combined]
display(HTML(tabulate.tabulate(info, tablefmt='html')))

This shows how you can leverage the strengths of both libraries for a more comprehensive NLP analysis.

Conclusion

In conclusion, spaCy and StanfordNLP offer powerful NLP capabilities with pretrained models for multiple languages. Combining their strengths can provide a more robust solution for various linguistic tasks. Explore further, experiment with different languages, and discover the possibilities these libraries offer for understanding and processing natural language.