Saturday, 12 March 2022

Sign Language Recognition System using TensorFlow Object Detection API

  Communication is defined as the act of sharing or exchanging information, ideas or feelings. To establish communication between two people, both of them are required to have knowledge and understanding of a common language. But in the case of deaf and dumb people, the means of communication are different. Deaf is the inability to hear and dumb is the inability to speak. They communicate using sign language among themselves and with normal people but normal people do not take seriously the importance of sign language. Not everyone possesses the knowledge and understanding of sign language which makes communication difficult between a normal person and a deaf and dumb person. To overcome this barrier, one can build a model based on machine learning. A model can be trained to recognize different gestures of sign language and translate them into English. This will help a lot of people in communicating and conversing with deaf and dumb people. The existing Indian Sing Language Recognition systems are designed using machine learning algorithms with single and double-handed gestures but they are not real-time. In this paper, we propose a method to create an Indian Sign Language dataset using a webcam and then using transfer learning, train a TensorFlow model to create a real-time Sign Language Recognition system. The system achieves a good level of accuracy even with a limited size dataset.

Full Text: PrePrint-Version, Publisher-Version


Wednesday, 19 January 2022

A Data Structure Perspective to the RDD-based Apriori Algorithm on Spark

  During the recent years, a number of efficient and scalable frequent itemset mining algorithms for big data analytics have been proposed by many researchers. Initially, MapReduce-based frequent itemset mining algorithms on Hadoop cluster were proposed. Although, Hadoop has been developed as a cluster computing system for handling and processing big data, but the performance of Hadoop does not meet the expectation for the iterative algorithms of data mining, due to its high I/O, and writing and then reading intermediate results in the disk. Consequently, Spark has been developed as another cluster computing infrastructure which is much faster than Hadoop due to its in-memory computation. It is highly suitable for iterative algorithms and supports batch, interactive, iterative, and stream processing of data. Many frequent itemset mining algorithms have been re-designed on the Spark, and most of them are Apriori based. All these Spark-based Apriori algorithms use Hash Tree as the underlying data structure. This paper investigates the efficiency of various data structures for the Spark-based Apriori. Although, the data structure perspective has been investigated previously, but for MapReduce-based Apriori, and it must be re-investigated in the distributed computing environment of Spark. The considered underlying data structures are Hash Tree, Trie, and Hash Table Trie. The experimental results on the benchmark datasets show that the performance of Spark-based Apriori with Trie and Hash Table Trie are almost similar but both perform many times better than Hash Tree in the distributed computing environment of Spark.

Full Text: PrePrint-Version, Publisher-Version


RDD-Eclat: Approaches to Parallelize Eclat Algorithm on Spark RDD Framework

  Initially, a number of frequent itemset mining (FIM) algorithms have been designed on the Hadoop MapReduce, a distributed big data processing framework. But, due to heavy disk I/O, MapReduce is found to be inefficient for such highly iterative algorithms. Therefore, Spark, a more efficient distributed data processing framework, has been developed with in-memory computation and resilient distributed dataset (RDD) features to support the iterative algorithms. On the Spark RDD framework, Apriori and FP-Growth based FIM algorithms have been designed, but Eclat-based algorithm has not been explored yet. In this paper, RDD-Eclat, a parallel Eclat algorithm on the Spark RDD framework is proposed with its five variants. The proposed algorithms are evaluated on the various benchmark datasets, which shows that RDD-Eclat outperforms the Spark-based Apriori by many times. Also, the experimental results show the scalability of the proposed algorithms on increasing the number of cores and size of the dataset.

Full Text: PrePrint-Version, Publisher-Version


Performance optimization of MapReduce-based Apriori algorithm on Hadoop cluster

  Many techniques have been proposed to implement the Apriori algorithm on MapReduce framework but only a few have focused on performance improvement. FPC (Fixed Passes Combined-counting) and DPC (Dynamic Passes Combined-counting) algorithms combine multiple passes of Apriori in a single MapReduce phase to reduce the execution time. In this paper, we propose improved MapReduce based Apriori algorithms VFPC (Variable Size based Fixed Passes Combined-counting) and ETDPC (Elapsed Time based Dynamic Passes Combined-counting) over FPC and DPC. Further, we optimize the multi-pass phases of these algorithms by skipping pruning step in some passes, and propose Optimized-VFPC and Optimized-ETDPC algorithms. Quantitative analysis reveals that counting cost of additional un-pruned candidates produced due to skipped-pruning is less significant than reduction in computation cost due to the same. Experimental results show that VFPC and ETDPC are more robust and flexible than FPC and DPC whereas their optimized versions are more efficient in terms of execution time.

Full Text: PrePrint-Version, Publisher-Version


Observations on Factors Affecting Performance of MapReduce based Apriori on Hadoop Cluster

  Designing fast and scalable algorithm for mining frequent itemsets is always being a most eminent and promising problem of data mining. Apriori is one of the most broadly used and popular algorithm of frequent itemset mining. Designing efficient algorithms on MapReduce framework to process and analyze big datasets is contemporary research nowadays. In this paper, we have focused on the performance of MapReduce based Apriori on homogeneous as well as on heterogeneous Hadoop cluster. We have investigated a number of factors that significantly affects the execution time of MapReduce based Apriori running on homogeneous and heterogeneous Hadoop Cluster. Factors are specific to both algorithmic and non- algorithmic improvements. Considered factors specific to algorithmic improvements are filtered transactions and data structures. Experimental results show that how an appropriate data structure and filtered transactions technique drastically reduce the execution time. The non-algorithmic factors include speculative execution, nodes with poor performance, data locality & distribution of data blocks, and parallelism control with input split size. We have applied strategies against these factors and fine tuned the relevant parameters in our particular application. Experimental results show that if cluster specific parameters are taken care of then there is a significant reduction in execution time. Also we have discussed the issues regarding MapReduce implementation of Apriori which may significantly influence the performance.

Full Text: PrePrint-VersionPublisher-Version


Mining Association Rules in Various Computing Environments: A Survey

 Association Rule Mining (ARM) is one of the well know and most researched technique of data mining. There are so many ARM algorithms have been designed that their counting is a large number. In this paper we have surveyed the various ARM algorithms in four computing environments. The considered computing environments are sequential computing, parallel and distributed computing, grid computing and cloud computing. With the emergence of new computing paradigm, ARM algorithms have been designed by many researchers to improve the efficiency by utilizing the new paradigm. This paper represents the journey of ARM algorithms started from sequential algorithms, and through parallel and distributed, and grid based algorithms to the current state-of-the-art, along with the motives for adopting new machinery.

Full Text:  PrePrint-Version, Publisher-Version


Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster

  Mining frequent itemsets from massive datasets is always being a most important problem of data mining. Apriori is the most popular and simplest algorithm for frequent itemset mining. To enhance the efficiency and scalability of Apriori, a number of algorithms have been proposed addressing the design of efficient data structures, minimizing database scan and parallel and distributed processing. MapReduce is the emerging parallel and distributed technology to process big datasets on Hadoop Cluster. To mine big datasets it is essential to re-design the data mining algorithm on this new paradigm. In this paper, we implement three variations of Apriori algorithm using data structures hash tree, trie and hash table trie i.e. trie with hash technique on MapReduce paradigm. We emphasize and investigate the significance of these three data structures for Apriori algorithm on Hadoop cluster, which has not been given attention yet. Experiments are carried out on both real life and synthetic datasets which shows that hash table trie data structures performs far better than trie and hash tree in terms of execution time. Moreover the performance in case of hash tree becomes worst.

Full Text: Publisher-Version


Review of Apriori Based Algorithms on MapReduce Framework

 The Apriori algorithm that mines frequent itemsets is one of the most popular and widely used data mining algorithms. Now days many algorithms have been proposed on parallel and distributed platforms to enhance the performance of Apriori algorithm. They differ from each other on the basis of load balancing technique, memory system, data decomposition technique and data layout used to implement them. The problems with most of the distributed framework are overheads of managing distributed system and lack of high level parallel programming language. Also with grid computing there is always potential chances of node failures which cause multiple re-executions of tasks. These problems can be overcome by the MapReduce framework introduced by Google. MapReduce is an efficient, scalable and simplified programming model for large scale distributed data processing on a large cluster of commodity computers and also used in cloud computing. In this paper, we present the overview of parallel Apriori algorithm implemented on MapReduce framework. They are categorized on the basis of Map and Reduce functions used to implement them e.g. 1-phase vs. k-phase, I/O of Mapper, Combiner and Reducer, using functionality of Combiner inside Mapper etc. This survey discusses and analyzes the various implementations of Apriori on MapReduce framework on the basis of their distinguishing characteristics. Moreover, it also includes the advantages and limitations of MapReduce framework.

Full Text: Publisher-Version


Big Data: Technologies, Trends and Applications

  Big Data is an excessive amount of imprecise data in variety of formats generated from variety of sources with rapid speed. It is most buzzed terms among researcher, industry and academia. Big Data is not only limited to data perspective but it has been emerged as a stream that includes associated technologies, tools and real word applications. The objective of this paper is to provide a simple, comprehensive and brief introduction of Big Data to the beginners in subject. In this paper, we provide an overview of Hadoop and its subprojects and a brief review of various developed technologies for Big Data. We also discuss some recent trends and eminent applications in Big Data. Although this paper does not touch each and every dimension of Big Data as it is not possible to make it in a single paper but essential aspects are covered, which may benefit to the people new in Big Data world.

Full Text: Publisher-Version


A Comparative Study of Association Rule Mining Algorithms on Grid and Cloud Platform

  Association rule mining is a time consuming process due to involving both data intensive and computation intensive nature. In order to mine large volume of data and to enhance the scalability and performance of existing sequential association rule mining algorithms, parallel and distributed algorithms are developed. These traditional parallel and distributed algorithms are based on homogeneous platform and are not lucrative for heterogeneous platform such as grid and cloud. This requires design of new algorithms which address the issues of good data set partition and distribution, load balancing strategy, optimization of communication and synchronization technique among processors in such heterogeneous system. Grid and cloud are the emerging platform for distributed data processing and various association rule mining algorithms have been proposed on such platforms. This survey article integrates the brief architectural aspect of distributed system, various recent approaches of grid based and cloud based association rule mining algorithms with comparative perception. We differentiate between approaches of association rule mining algorithms developed on these architectures on the basis of data locality, programming paradigm, fault tolerance, communication cost, partition and distribution of data sets. Although it is not complete in order to cover all algorithms, yet it can be very useful for the new researchers working in the direction of distributed association rule mining algorithms.

Full Text: PrePrint-VersionPublisher-Version