Big Data / Hadoop

Big Data is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it.

  • Large in quantity
  • Captured at rapid rate
  • Structure and unstructured
  • Whose Scale , Diversity and complexity require new architecture, techniques and algorithms


  • Volume
  • Variety
  • Velocity

Apaches Hadoop - The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

The project includes these modules:

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  • Cassandra™: A scalable multi-master database with no single points of failure.
  • Chukwa™: A data collection system for managing large distributed systems.
  • HBase™: A scalable, distributed database that supports structured data storage for large tables.
  • Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.