Andrey Kamenov, Ph.D. Probability and Statistics
Andrey Kamenov is a data scientist working for Advameg Inc. His background includes teaching statistics, stochastic processes and financial mathematics in Moscow State University and working for a hedge fund. His academic interests range from statistical data analysis to optimal stopping theory. Andrey also enjoys his hobbies of photography, reading and powerlifting.
The recently-released 2016 data on motor vehicle crashes from the NHTSA shows a grim fact: The number of pedestrian fatalities has reached an all-time high. It’s certainly worth it to take a look at possible reasons behind the growth in pedestrian fatalities — the FARS database contains rather detailed information about each crash.
First of all, a quick glance at the numbers reveals that the increase in fatalities can be narrowed down mostly to crashes that happened at night. In fact, daytime fatalities are roughly in line with the early 2000s numbers.
A lot of news articles suggest that SUVs are inherently dangerous to both pedestrians and other drivers. This seems logical — these vehicles are usually heavier, so the energy of impact should be greater. In addition, SUVs are taller, which most likely puts pedestrians at greater risk in a possible crash.
Automatic braking is still considered a novelty in the car world. But the technology is maturing, causing many people to question if it’s going to be seen in more and more new cars each year. And can it really help significantly reduce the number of road fatalities?
Eleven percent of all fatal crashes in 2016 involved at least one driver who was distracted or drowsy — a record low figure (that means just under 3,800 crashes in absolute terms). We can consider this number a ballpark estimate for how many lives could have been saved by Automatic Emergency Braking (AEB).
The U.S. government has multiple different rules about awarding its contracts — one of these states that every agency should award a specific percentage of the total sum it spends to small businesses. For instance, the Department of Defence set its goal for small business prime contracting in 2018 at 22 percent. It also sets several other goals, including those for women-owned small businesses or small disadvantaged businesses.
Naturally, the goal differs from one agency to another, and the definition of small business itself changes depending on the industry.
On paper, the government’s efforts have been pretty effective. Small firms received 30.5 percent of the government’s total contract spending in 2017, up 1.8 percent from 2015. But these overall numbers don’t mean much to your small business since the percentage varies significantly between different industries and states, so let’s get a more detailed look. For example, it appears that small business’ participation numbers are markedly lower in the midwestern states.
Let’s have a look at the quietest and loudest cities in the U.S. To give you a broader perspective, only the top values are listed for each Census division. In order to compare the listed numbers, please keep in mind that a 10dBA difference is perceived as doubling the loudness. So the average background noise level of 50dBA (typical for big cities like Las Vegas or Los Angeles) can be described as four times the loudness of a quiet town (30dBA).
According to a U.S. Environmental Protection Agency press release dating back as far as 1974, average outdoor noise levels below 55 dB “are identified as preventing activity interference and annoyance.” At the same time, more than 18.5 million Americans live in areas where the average existing noise levels are above this threshold.
Two states, New York and Illinois, stand out on the map below. The share of the total population living in areas with high noise levels in these states is 18.1 percent and 15.7 percent respectively. The percentage registered in California is surprisingly small for such a densely-populated and urbanized area: only 1.5 percent.
Vehicle sales in the U.S. bottomed out in 2009. The financial crisis hit light truck sales especially hard — the numbers here fell by more than 40 percent. In comparison, car sales saw a more modest 30 percent decrease. High gasoline prices in 2008 also certainly didn’t help in keeping sales afloat.
After a few years of steady recovery, the vehicle sales dynamic has been somewhat mixed in the recent years: You can observe this phenomenon in the chart below.
What are the most innovative IT fields? The patent classification system can sometimes be hard to grasp. Additionally, the Patent and Trademark Office has recently finished the transition from one system to another. However, this doesn’t help us follow the hottest topics either. The mapping is available on the official website, but it is purely statistical. This means that you can’t really compare different classifications.
Another approach that we used was determining the topics ourselves. We based our method on applying the Latent Dirichlet Allocation algorithm to the (parsed and stemmed) application abstracts. Granted, this approach is relatively computationally expensive. The results are quite consistent with the original classification, but are, in fact, universal.
Here are the top five topics, based on the entire corpus of IT patent grants.
Exploring utility patents is no easy task. The system that is currently in place, Cooperative Patent Classification, has its issues. For example, many patent applications include multiple classifications only partially related to the invention itself. This proves useful if you are searching for something specific — on the other hand, it is not as helpful if you are interested in patent mining or visualization.
Luckily, one can easily accomplish most of these tasks with the help of advanced computational techniques. The most promising approach is the use of Natural Language Processing to classify and visualize patents based on their abstracts.
Let’s take a look at the algorithm known as Latent Dirichlet Allocation (or LDA for short). Its primary goal is to find the number of topics that are best suited for classifying a corpus of documents (patents, in our case).
Today we’ll take a look at the seasonality of peer-to-peer lending. Thanks to data provided by Prosper, we can perform a quantitative analysis of loan charge-off times and find if there are any seasonal patterns present.
First, let’s take a look at the lifespan distribution of defaulted loans. We see that the default rates increase significantly during the first several months, with the largest number of defaults registered in the eighth and ninth months after origination.