CONTACT: PRAVEEN KUMAR. L (,+91 – 9791938249)
MAIL ID: , praveen@nexgenproject.com
Web: www.nexgenproject.com,
AN INFORMATION THEORY-BASED FEATURE
SELECTIONFRAMEWORK FOR BIG DATA UNDER APACHE SPARK
ABSTRACT
With the advent of extremely high dimensional datasets, dimensionality reduction techniques are
becoming mandatory. Of the many techniques available, feature selection (FS) is of growing
interest for its ability to identify both relevant features and frequently repeated instances in huge
datasets. We aim to demonstrate that standard FS methods can be parallelized in big data
platforms like Apache Spark so as to boost both performance and accuracy. We propose a
distributed implementation of a generic FS framework that includes broad group of well-known
information theory-based methods. Experimental results for a broad set of real-world datasets
show that our distributed framework is capable of rapidly dealing with ultrahigh-dimensional
datasets as well as those with a huge number of samples, outperforming the sequential version in
all the cases studied.
EXISTING SYSTEM:
Existing FS methods are not expected to scale well when dealing with big data due to the fact
that efficiency may significantly deteriorates or that the FS approach may even become
inapplicable. Scalable distributed programming protocols and frameworks have emerged in the
last decade to manage the problem of big data. The first programming model developed was
MapReduce along with its open-source implementation Apache Hadoop. More recently, Apache
Spark has been presented as a new distributed framework with a fast, general large-scale data
processing engine that is popular with ML researchers due to its suitability for iterative
procedures. Likewise, several libraries for approaching ML tasks in big data environments have
emerged in recent years. The first such library was Mahout , followed up by MLlib ,built on the
CONTACT: PRAVEEN KUMAR. L (,+91 – 9791938249)
MAIL ID: , praveen@nexgenproject.com
Web: www.nexgenproject.com,
Spark system. Thanks to Spark’s capacity for in-memory computation that speeds up iterative
processes, algorithms developed for this kind of platform havebecome pervasive in the industry.
Although several gold standard algorithms for ML tasks have been redesigned to incorporate a
distributed implementation for big data technologies, this is not yet the case for FS algorithms.
PROPOSED SYSTEM:
This paper aims to fill the existing gap by demonstrating that standard FS methods can be
designed for big data platforms that can be usefully applied to big datasets while boosting both
performance and accuracy. We propose a new distributed design for an FS generic framework
based on information theory that is implemented using the Apache Spark paradigm. To make this
adaptation feasible, a wide variety of techniques from the distributed environment have been
used, including information caching, data partitioning and replication of relevant variables. Note
that adapting this framework to Spark implies a major challenge as it requires deep restructuring
of classic algorithms. To test the effectiveness of our framework, we applied it toa complete set
of real-world datasets [up to O(10) features and instances]. The results point to competitive
performance (in terms of generalization and efficiency) when dealing with datasets that are huge
in terms of both number of features and instances. As an illustrative example, we were able to
select100 features in a dataset with 29 Ɨ1067features and 19 Ɨ10instances in under 60 min (using
a 432-core cluster).
CONCLUSION
In discussing the problem of processing big data, especially from the perspective of
dimensionality, we have highlighted the impact of correctly identifying relevant features in
datasets and the corresponding difficulties caused by the combinatorial effects of incoming data
growing in terms of both instances and features. Despite the growing interest in dimensionality
reduction for big data, few FS methods have been developedto date that can adequately deal with
high-dimensionality problems. Adopting an information theory approach, we adapted a generic
CONTACT: PRAVEEN KUMAR. L (,+91 – 9791938249)
MAIL ID: , praveen@nexgenproject.com
Web: www.nexgenproject.com,
FS framework for big data from a proposal byBrown et al.. The framework contains
implementations of many state-of-the-art FS algorithms, including mRMRand JMI. The
adaptation entailed a radical redesign ofBrown et al.’s framework so as to adapt it to a
distributed paradigm. This paper has also resulted in contributing an FSmodule to the emerging
Spark and MLlib platforms, which, to date, included no complex FS algorithm. Our experimental
results demonstrate the usefulness of ourFS solution applied to a broad set of large real-world
problems. Our solution performed well with 2-D of big data (samples andfeatures) and yielded
competitive performance results in dealingwith both ultrahigh-dimensionality datasets and
datasets with a huge number of samples. Our distributed approach consistently outperformed the
sequential version, enabling the resolution of problems that could not be usefully resolved using
the classical approach. Future research will focus on the following.1) Designing new information
theory-based approaches for high-speed data streams and also extending theseapproaches to the
handling of drifts in concepts.2) Analyzing the impact of approximative selection on high-
dimensional data via faster solutions that do not incur a high penalty on accuracy.3) Designing a
new fully automatic FS system that selects the most relevant subset of features from a full set of
features, thereby eliminating the need to define the number of features to select in each
execution.
REFERENCES
[1] C.-C. Chang and C.-J. Lin, ā€œLIBSVM: A library for support vec-tor machines,ā€ ACM Trans.
Intell. Syst. Technol., vol. 2, no. 3,pp. 1–27, 2011. [Online]. Available:
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
[2] I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh, FeatureExtraction: Foundations and
Applications (Studies in Fuzziness and SoftComputing). Secaucus, NJ, USA: Springer-Verlag,
2006.
CONTACT: PRAVEEN KUMAR. L (,+91 – 9791938249)
MAIL ID: , praveen@nexgenproject.com
Web: www.nexgenproject.com,
[3] Z. A. Zhao and H. Liu, Spectral Feature Selection for Data Mining.Boca Raton, FL, USA:
CRC Press, 2011.
[4] V. Bolón-Canedo, N. SÔnchez-Maroño, and A. Alonso-Betanzos,Feature Selection for High-
Dimensional Data (Artificial Intelligence:Foundations, Theory, and Algorithms). New York,
NY, USA: Springer,2015.
[5] Q. Wu, Z. Wang, F. Deng, Z. Chi, and D. D. Feng, ā€œRealistic humanaction recognition with
multimodal feature selection and fusion,ā€ IEEETrans. Syst., Man, Cybern., Syst., vol. 43, no. 4,
pp. 875–885, Jul. 2013.
[6] V. Bolón-Canedo, N. SÔnchez-Maroño, A. Alonso-Betanzos,J. M. Benítez, and F. Herrera,
ā€œA review of microarray datasetsand applied feature selection methods,ā€ Inf. Sci., vol. 282, pp.
111–135,Oct. 2014.
[7] V. Bolón-Canedo, N. SĆ”nchez-MaroƱo, and A. Alonso-Betanzos,ā€œRecent advances and
emerging challenges of feature selection in thecontext of big data,ā€ Knowl. Based Syst., vol. 86,
pp. 33–45, Sep. 2015.
[8] J. Dean and S. Ghemawat, ā€œMapreduce: Simplified data processingon large clusters,ā€ in
Proc. OSDI, San Francisco, CA, USA, 2004,pp. 137–150.
[9] T. White, Hadoop, The Definitive Guide. Sebastopol, CA, USA: O’ReillyMedia, 2012.
[10] Apache Hadoop Project. (2017). Apache Hadoop. Accessed onFeb. 2017. [Online].
Available: http://hadoop.apache.org/
[11] H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning Spark:Lightening-Fast Big
Data Analytics. Beijing, China: O’Reilly Media,2015.

AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APACHE SPARK

  • 1.
    CONTACT: PRAVEEN KUMAR.L (,+91 – 9791938249) MAIL ID: , praveen@nexgenproject.com Web: www.nexgenproject.com, AN INFORMATION THEORY-BASED FEATURE SELECTIONFRAMEWORK FOR BIG DATA UNDER APACHE SPARK ABSTRACT With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Of the many techniques available, feature selection (FS) is of growing interest for its ability to identify both relevant features and frequently repeated instances in huge datasets. We aim to demonstrate that standard FS methods can be parallelized in big data platforms like Apache Spark so as to boost both performance and accuracy. We propose a distributed implementation of a generic FS framework that includes broad group of well-known information theory-based methods. Experimental results for a broad set of real-world datasets show that our distributed framework is capable of rapidly dealing with ultrahigh-dimensional datasets as well as those with a huge number of samples, outperforming the sequential version in all the cases studied. EXISTING SYSTEM: Existing FS methods are not expected to scale well when dealing with big data due to the fact that efficiency may significantly deteriorates or that the FS approach may even become inapplicable. Scalable distributed programming protocols and frameworks have emerged in the last decade to manage the problem of big data. The first programming model developed was MapReduce along with its open-source implementation Apache Hadoop. More recently, Apache Spark has been presented as a new distributed framework with a fast, general large-scale data processing engine that is popular with ML researchers due to its suitability for iterative procedures. Likewise, several libraries for approaching ML tasks in big data environments have emerged in recent years. The first such library was Mahout , followed up by MLlib ,built on the
  • 2.
    CONTACT: PRAVEEN KUMAR.L (,+91 – 9791938249) MAIL ID: , praveen@nexgenproject.com Web: www.nexgenproject.com, Spark system. Thanks to Spark’s capacity for in-memory computation that speeds up iterative processes, algorithms developed for this kind of platform havebecome pervasive in the industry. Although several gold standard algorithms for ML tasks have been redesigned to incorporate a distributed implementation for big data technologies, this is not yet the case for FS algorithms. PROPOSED SYSTEM: This paper aims to fill the existing gap by demonstrating that standard FS methods can be designed for big data platforms that can be usefully applied to big datasets while boosting both performance and accuracy. We propose a new distributed design for an FS generic framework based on information theory that is implemented using the Apache Spark paradigm. To make this adaptation feasible, a wide variety of techniques from the distributed environment have been used, including information caching, data partitioning and replication of relevant variables. Note that adapting this framework to Spark implies a major challenge as it requires deep restructuring of classic algorithms. To test the effectiveness of our framework, we applied it toa complete set of real-world datasets [up to O(10) features and instances]. The results point to competitive performance (in terms of generalization and efficiency) when dealing with datasets that are huge in terms of both number of features and instances. As an illustrative example, we were able to select100 features in a dataset with 29 Ɨ1067features and 19 Ɨ10instances in under 60 min (using a 432-core cluster). CONCLUSION In discussing the problem of processing big data, especially from the perspective of dimensionality, we have highlighted the impact of correctly identifying relevant features in datasets and the corresponding difficulties caused by the combinatorial effects of incoming data growing in terms of both instances and features. Despite the growing interest in dimensionality reduction for big data, few FS methods have been developedto date that can adequately deal with high-dimensionality problems. Adopting an information theory approach, we adapted a generic
  • 3.
    CONTACT: PRAVEEN KUMAR.L (,+91 – 9791938249) MAIL ID: , praveen@nexgenproject.com Web: www.nexgenproject.com, FS framework for big data from a proposal byBrown et al.. The framework contains implementations of many state-of-the-art FS algorithms, including mRMRand JMI. The adaptation entailed a radical redesign ofBrown et al.’s framework so as to adapt it to a distributed paradigm. This paper has also resulted in contributing an FSmodule to the emerging Spark and MLlib platforms, which, to date, included no complex FS algorithm. Our experimental results demonstrate the usefulness of ourFS solution applied to a broad set of large real-world problems. Our solution performed well with 2-D of big data (samples andfeatures) and yielded competitive performance results in dealingwith both ultrahigh-dimensionality datasets and datasets with a huge number of samples. Our distributed approach consistently outperformed the sequential version, enabling the resolution of problems that could not be usefully resolved using the classical approach. Future research will focus on the following.1) Designing new information theory-based approaches for high-speed data streams and also extending theseapproaches to the handling of drifts in concepts.2) Analyzing the impact of approximative selection on high- dimensional data via faster solutions that do not incur a high penalty on accuracy.3) Designing a new fully automatic FS system that selects the most relevant subset of features from a full set of features, thereby eliminating the need to define the number of features to select in each execution. REFERENCES [1] C.-C. Chang and C.-J. Lin, ā€œLIBSVM: A library for support vec-tor machines,ā€ ACM Trans. Intell. Syst. Technol., vol. 2, no. 3,pp. 1–27, 2011. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ [2] I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh, FeatureExtraction: Foundations and Applications (Studies in Fuzziness and SoftComputing). Secaucus, NJ, USA: Springer-Verlag, 2006.
  • 4.
    CONTACT: PRAVEEN KUMAR.L (,+91 – 9791938249) MAIL ID: , praveen@nexgenproject.com Web: www.nexgenproject.com, [3] Z. A. Zhao and H. Liu, Spectral Feature Selection for Data Mining.Boca Raton, FL, USA: CRC Press, 2011. [4] V. Bolón-Canedo, N. SĆ”nchez-MaroƱo, and A. Alonso-Betanzos,Feature Selection for High- Dimensional Data (Artificial Intelligence:Foundations, Theory, and Algorithms). New York, NY, USA: Springer,2015. [5] Q. Wu, Z. Wang, F. Deng, Z. Chi, and D. D. Feng, ā€œRealistic humanaction recognition with multimodal feature selection and fusion,ā€ IEEETrans. Syst., Man, Cybern., Syst., vol. 43, no. 4, pp. 875–885, Jul. 2013. [6] V. Bolón-Canedo, N. SĆ”nchez-MaroƱo, A. Alonso-Betanzos,J. M. BenĆ­tez, and F. Herrera, ā€œA review of microarray datasetsand applied feature selection methods,ā€ Inf. Sci., vol. 282, pp. 111–135,Oct. 2014. [7] V. Bolón-Canedo, N. SĆ”nchez-MaroƱo, and A. Alonso-Betanzos,ā€œRecent advances and emerging challenges of feature selection in thecontext of big data,ā€ Knowl. Based Syst., vol. 86, pp. 33–45, Sep. 2015. [8] J. Dean and S. Ghemawat, ā€œMapreduce: Simplified data processingon large clusters,ā€ in Proc. OSDI, San Francisco, CA, USA, 2004,pp. 137–150. [9] T. White, Hadoop, The Definitive Guide. Sebastopol, CA, USA: O’ReillyMedia, 2012. [10] Apache Hadoop Project. (2017). Apache Hadoop. Accessed onFeb. 2017. [Online]. Available: http://hadoop.apache.org/ [11] H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning Spark:Lightening-Fast Big Data Analytics. Beijing, China: O’Reilly Media,2015.