Visualizing U.S. utility patents

Andrey Kamenov

Andrey Kamenov, Ph.D. Probability and Statistics

Exploring utility patents is no easy task. The system that is currently in place, Cooperative Patent Classification, has its issues. For example, many patent applications include multiple classifications only partially related to the invention itself. This proves useful if you are searching for something specific — on the other hand, it is not as helpful if you are interested in patent mining or visualization.

Luckily, one can easily accomplish most of these tasks with the help of advanced computational techniques. The most promising approach is the use of Natural Language Processing to classify and visualize patents based on their abstracts.

Let’s take a look at the algorithm known as Latent Dirichlet Allocation (or LDA for short). Its primary goal is to find the number of topics that are best suited for classifying a corpus of documents (patents, in our case).

The algorithm is quite computationally expensive. Luckily, there are ways to improve performance. The first approach is using advanced statistical sampling (so-called Gibbs sampling). Another is parallelizing the routine to take advantage of modern multi-core CPUs. After we combined both approaches, running the task on the corpus of 2.5 million documents took just under an hour. You can find a link to the library we used at the bottom of the page.

So, let’s see what the 20 suggested topics are. Here are the words that are most important for each one:

Most of these topics (like numbers 2 and 4, for example) are rather specific. Interestingly, however, the others (like number 1) are quite general. The most important words here are “invention,” “methods” and “present.”

So, what we did was represent each patent as 20 real numbers representing how close its abstract is to each of the topics. But that’s still a lot of data to grasp and visualize. Therefore, the obvious next step would be to employ some kind of dimensionality-reduction technique like t-SNE.

t-SNE map of U.S. patents

A multitude of tightly-knit groups can be seen in this chart.

The “general topic” clusters are quite visible here, represented by red and black dots; this means that these abstracts are relatively vague. Nevertheless, our method still does a good job of placing these patents closely to similar ones.



Discuss this article on our forum with over 1,900,000 registered members.

About Andrey Kamenov

Andrey Kamenov

Andrey Kamenov, Ph.D. Probability and Statistics

Andrey Kamenov is a data scientist working for Advameg Inc. His background includes teaching statistics, stochastic processes and financial mathematics in Moscow State University and working for a hedge fund. His academic interests range from statistical data analysis to optimal stopping theory. Andrey also enjoys his hobbies of photography, reading and powerlifting.

Other posts by Andrey Kamenov:

Leave a Reply

Your email address will not be published. Required fields are marked *