The curse of dimensionality - Dataconomy

2022-07-10 00:44:39 By : Ms. May Yang

The curse of dimensionality comes into play when we deal with a lot of data having many dimensions or features. The dimension of the data is the number of characteristics or columns in a dataset.

High-dimensional data has several challenges, the most notable of which is that it becomes extremely difficult to find meaningful correlations while processing and visualizing it. In addition, as the number of dimensions increases, training the model becomes much slower. More dimensions invite more chances for multicollinearity as well. Multicollinearity is a condition in which two or more variables are found to be highly correlated with one another.

The curse of dimensionality is a term used to describe the issues when classifying, organizing, and analyzing high-dimensional data, particularly data sparsity and “closeness” of data.

Data sparsity is an issue that arises when you go to higher dimensions. Because the amount of space represented grows so quickly that data can’t keep up, it becomes sparse, as seen below. The sparsity problem is a big issue for statistical significance. As the data space approaches two dimensions and then three dimensions, the amount of data filling it decreases. As a consequence of this, the data for analysis grows dramatically. 

Back in Berlin! Data Natives 2022, in person and online - tickets available now!

Consider a data set with four points in one dimension (only one feature in the data set). It may be simply represented using a line, and the dimension space is equal to 4 since there are only four data points. Suppose we include another feature, which results in a 4-dimensional increase in space. If we add one more component to it, the space will expand to 16 dimensions. Dimensions space grows exponentially as the number of dimensions rises.

The second issue is how to sort or classify the data. Data may appear similar in low-dimensional spaces, but as the dimension increases, these data points may seem more distant. In the image above, two dimensions appear close together but look distant when viewed in three dimensions. The curse of dimensionality has the same effect on data.

With the increase in the dimensions, the calculating distance between observations becomes increasingly difficult, and all algorithms that rely on correlation calculate it to be an uphill struggle.

Neural networks are instantiated with a certain number of features (dimensions). Each data has its own set of characteristics, each one falling somewhere along a dimension. We may want one feature to handle color, for example, while another handles weight. Each feature adds information, and if we could comprehend every feature conceivable, we would be able to accurately convey which fruit we are thinking about. However, an infinite number of features necessitates infinite training instances, thereby rendering our network’s real-world usefulness doubtful.

The amount of training data required grows drastically with each new feature. Even if we only had 15 features, each being a ‘yes’ or ‘no’ question, the number of training samples needed would be 21532,000.

The following are just a few examples of domains where the direct consequence of the curse of dimensionality may be observed: Machine learning takes the worst hit from the curse.

In Machine Learning, a marginal increase in dimensionality necessitates a substantial expansion in the amount of data to maintain comparable results. The by-product of a phenomenon that occurs with high-dimensional data is the curse of dimensionality.

Anomaly detection is finding unusual items or events in the data. In high-dimensional data, anomalies frequently have many irrelevant attributes; various things appear more often in neighbor lists than others.

When there are more possibilities for input combinations, the complexity grows quickly, and the curse of dimensionality strikes.

To deal with the curse of dimensionality caused by high-dimensional data, a collection of methods known as “Dimensionality Reduction Techniques” is employed. Dimensionality reduction procedures are divided into “Feature selection” and “Feature extraction.”

The features are evaluated for usefulness and then chosen or eliminated in feature selection methods. The following are some of the most popular Feature selection procedures.

The high-dimensional features are combined into low-dimensional components (PCA or ICA) or factored into low-dimensional components (FA).

We are looking for contributors and here is your chance to shine. Click the button below to learn more!