AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APACHE SPARK

CONTACT: PRAVEEN KUMAR. L (,+91 – 9791938249)
MAIL ID: , praveen@nexgenproject.com
Web: www.nexgenproject.com,
AN INFORMATION THEORY-BASED FEATURE
SELECTIONFRAMEWORK FOR BIG DATA UNDER APACHE SPARK
ABSTRACT
With the advent of extremely high dimensional datasets, dimensionality reduction techniques are
becoming mandatory. Of the many techniques available, feature selection (FS) is of growing
interest for its ability to identify both relevant features and frequently repeated instances in huge
datasets. We aim to demonstrate that standard FS methods can be parallelized in big data
platforms like Apache Spark so as to boost both performance and accuracy. We propose a
distributed implementation of a generic FS framework that includes broad group of well-known
information theory-based methods. Experimental results for a broad set of real-world datasets
show that our distributed framework is capable of rapidly dealing with ultrahigh-dimensional
datasets as well as those with a huge number of samples, outperforming the sequential version in
all the cases studied.
EXISTING SYSTEM:
Existing FS methods are not expected to scale well when dealing with big data due to the fact
that efficiency may significantly deteriorates or that the FS approach may even become
inapplicable. Scalable distributed programming protocols and frameworks have emerged in the
last decade to manage the problem of big data. The first programming model developed was
MapReduce along with its open-source implementation Apache Hadoop. More recently, Apache
Spark has been presented as a new distributed framework with a fast, general large-scale data
processing engine that is popular with ML researchers due to its suitability for iterative
procedures. Likewise, several libraries for approaching ML tasks in big data environments have
emerged in recent years. The first such library was Mahout , followed up by MLlib ,built on the

Spark system. Thanks to Spark’s capacity for in-memory computation that speeds up iterative
processes, algorithms developed for this kind of platform havebecome pervasive in the industry.
Although several gold standard algorithms for ML tasks have been redesigned to incorporate a
distributed implementation for big data technologies, this is not yet the case for FS algorithms.
PROPOSED SYSTEM:
This paper aims to fill the existing gap by demonstrating that standard FS methods can be
designed for big data platforms that can be usefully applied to big datasets while boosting both
performance and accuracy. We propose a new distributed design for an FS generic framework
based on information theory that is implemented using the Apache Spark paradigm. To make this
adaptation feasible, a wide variety of techniques from the distributed environment have been
used, including information caching, data partitioning and replication of relevant variables. Note
that adapting this framework to Spark implies a major challenge as it requires deep restructuring
of classic algorithms. To test the effectiveness of our framework, we applied it toa complete set
of real-world datasets [up to O(10) features and instances]. The results point to competitive
performance (in terms of generalization and efficiency) when dealing with datasets that are huge
in terms of both number of features and instances. As an illustrative example, we were able to
select100 features in a dataset with 29 ×1067features and 19 ×10instances in under 60 min (using
a 432-core cluster).
CONCLUSION
In discussing the problem of processing big data, especially from the perspective of
dimensionality, we have highlighted the impact of correctly identifying relevant features in
datasets and the corresponding difficulties caused by the combinatorial effects of incoming data
growing in terms of both instances and features. Despite the growing interest in dimensionality
reduction for big data, few FS methods have been developedto date that can adequately deal with
high-dimensionality problems. Adopting an information theory approach, we adapted a generic

FS framework for big data from a proposal byBrown et al.. The framework contains
implementations of many state-of-the-art FS algorithms, including mRMRand JMI. The
adaptation entailed a radical redesign ofBrown et al.’s framework so as to adapt it to a
distributed paradigm. This paper has also resulted in contributing an FSmodule to the emerging
Spark and MLlib platforms, which, to date, included no complex FS algorithm. Our experimental
results demonstrate the usefulness of ourFS solution applied to a broad set of large real-world
problems. Our solution performed well with 2-D of big data (samples andfeatures) and yielded
competitive performance results in dealingwith both ultrahigh-dimensionality datasets and
datasets with a huge number of samples. Our distributed approach consistently outperformed the
sequential version, enabling the resolution of problems that could not be usefully resolved using
the classical approach. Future research will focus on the following.1) Designing new information
theory-based approaches for high-speed data streams and also extending theseapproaches to the
handling of drifts in concepts.2) Analyzing the impact of approximative selection on high-
dimensional data via faster solutions that do not incur a high penalty on accuracy.3) Designing a
new fully automatic FS system that selects the most relevant subset of features from a full set of
features, thereby eliminating the need to deﬁne the number of features to select in each
execution.
REFERENCES
[1] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vec-tor machines,” ACM Trans.
Intell. Syst. Technol., vol. 2, no. 3,pp. 1–27, 2011. [Online]. Available:
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
[2] I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh, FeatureExtraction: Foundations and
Applications (Studies in Fuzziness and SoftComputing). Secaucus, NJ, USA: Springer-Verlag,
2006.

[3] Z. A. Zhao and H. Liu, Spectral Feature Selection for Data Mining.Boca Raton, FL, USA:
CRC Press, 2011.
[4] V. Bolón-Canedo, N. Sánchez-Maroño, and A. Alonso-Betanzos,Feature Selection for High-
Dimensional Data (Artificial Intelligence:Foundations, Theory, and Algorithms). New York,
NY, USA: Springer,2015.
[5] Q. Wu, Z. Wang, F. Deng, Z. Chi, and D. D. Feng, “Realistic humanaction recognition with
multimodal feature selection and fusion,” IEEETrans. Syst., Man, Cybern., Syst., vol. 43, no. 4,
pp. 875–885, Jul. 2013.
[6] V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos,J. M. Benítez, and F. Herrera,
“A review of microarray datasetsand applied feature selection methods,” Inf. Sci., vol. 282, pp.
111–135,Oct. 2014.
[7] V. Bolón-Canedo, N. Sánchez-Maroño, and A. Alonso-Betanzos,“Recent advances and
emerging challenges of feature selection in thecontext of big data,” Knowl. Based Syst., vol. 86,
pp. 33–45, Sep. 2015.
[8] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processingon large clusters,” in
Proc. OSDI, San Francisco, CA, USA, 2004,pp. 137–150.
[9] T. White, Hadoop, The Definitive Guide. Sebastopol, CA, USA: O’ReillyMedia, 2012.
[10] Apache Hadoop Project. (2017). Apache Hadoop. Accessed onFeb. 2017. [Online].
Available: http://hadoop.apache.org/
[11] H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning Spark:Lightening-Fast Big
Data Analytics. Beijing, China: O’Reilly Media,2015.

AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APACHE SPARK

More Related Content

What's hot

Similar to AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APACHE SPARK

More from Nexgen Technology

Recently uploaded

AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APACHE SPARK