Savi IT Training Academy: Big Data Hadoop

Big Data Hadoop

Introduction

About This Course:

Data sets that have the potential to grow rapidly need to be manageable. This course provides the knowledge to use new Big Data tools and learn ways of storing information that will allow for efficient processing and analysis for informed business decision-making. Further, you learn to store, manage process and analyze massive amounts of unstructured data.

You Will Learn How To

Unleash the power of Big Data for competitive advantage
Select and apply the correct Big Data stores for your disparate data sets
Leverage Hadoop to process large data sets to benefit business and technical decisions
Apply sophisticated analytic techniques and tools to process and analyze Big Data
Evaluate and select appropriate vendor products as a part of a Big Data implementation plan

Hands-On Training

Creating an interactive Hadoop MapReduce job flow
Querying Hadoop MapReduce jobs using Hive
Loading unstructured data into Hadoop Distributed File System (HDFS)
Simplifying Big Data processing and communicating with Pig Latin
Creating and customizing applications to analyze data
Implementing a targeted Big Data strategy

Big Data Contents:

Introduction to Big Data(1.5 Hrs)

Defining Big Data

The four dimensions of Big Data: volume, velocity, variety, veracity
Introducing the Storage, MapReduce and Query Stack

Delivering business benefit from Big Data

Establishing the business importance of Big Data
Addressing the challenge of extracting useful data
Integrating Big Data with traditional data

Storing Big Data(4.5 Hrs)

Analyzing your data characteristics

Selecting data sources for analysis
Eliminating redundant data
Establishing the role of NoSQL

Overview of Big Data stores

Data models: key value, graph, document, column-family
Hadoop Distributed File System
HBase
Hive
Cassandra
Hypertable
Amazon S3
BigTable
DynamoDB
MongoDB
Redis
Riak
Neo4J

Selecting Big Data stores

Choosing the correct data stores based on your data characteristics
Moving code to data
Implementing polyglot data store solutions
Aligning business goals to the appropriate data store

Processing Big Data(3.5 Hrs)

Integrating disparate data stores

Mapping data to the programming framework
Connecting and extracting data from storage
Transforming data for processing
Subdividing data in preparation for Hadoop MapReduce

Employing Hadoop MapReduce

Creating the components of Hadoop MapReduce jobs
Distributing data processing across server farms
Executing Hadoop MapReduce jobs
Monitoring the progress of job flows

The Building Blocks of Hadoop MapReduce

Distinguishing Hadoop daemons
Investigating the Hadoop Distributed File System
Selecting appropriate execution modes: local, pseudo-distributed, fully distributed

Tools and Techniques to Analyze Big Data(3 Hrs)

Abstracting Hadoop MapReduce jobs with Pig

Communicating with Hadoop in Pig Latin
Executing commands using the Grunt Shell
Streamlining high-level processing

Performing ad-hoc Big Data querying with Hive

Persisting data in the Hive MegaStore
Performing queries with HiveQL
Investigating Hive file formats

Creating business value from extracted data

Mining data with Mahout
Visualizing processed results with reporting tools

Developing a Big Data Strategy(3 Hrs)

Defining a Big Data strategy for your organization

Establishing your Big Data needs
Meeting business goals with timely data
Evaluating commercial Big Data tools
Managing organizational expectations

Enabling analytic innovation

Focusing on business importance
Framing the problem
Selecting the correct tools
Achieving timely results

Statistical analysis of Big Data

Leveraging RHadoop functionality
Generating statistical reports with RHadoop
Exploiting RHadoop visualization
Making use of analytical results

Implementing a Big Data Solution(1.5 Hrs)

Selecting suitable vendors and hosting options
Balancing costs against business value
Keeping ahead of the curve

Who should attend?

Programmers, architects, administrators and data analysts who want a foundational overview of the key components required to effectively analyze Big Data. Familiarity working with computers and business applications is assumed. Programming experience is beneficial but not required.

More Info about the course

What is this course about?

This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees learn to store, manage, process and analyze massive amounts of unstructured data for competitive advantage, select and implement the correct Big Data stores and apply sophisticated analytic techniques and tools to process and analyze big data. They also leverage Hadoop to mine large data sets to inform and benefit business and technical decision making, evaluate and select appropriate vendor products as part of a Big Data implementation plan for their organization.

Who will benefit from this course?

Anyone seeking to exploit the benefits of Big Data technologies. The course provides an overview of how to plan and implement a Big Data solution and the various technologies that comprise Big Data. Many examples and exercises of Big Data systems are provided throughout the course. The programming examples are in Java but the primary focus is on best practices that can be applied to any supported programming language.
Attendees with a technical background will gain an understanding of the inner workings of a Big Data solution and how to implement it in their workplace. Management attendees will gain an understanding of where Big Data can be used to benefit their businesses.

What background do I need?

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

I am from a non-technical background. Will I benefit from the course?

Yes! The course presents both the business and technical benefits of Big Data. The technical discussions are at a level that attendees with a business background can understand and apply. Where technical knowledge is required, sufficient guidance for all backgrounds is provided to enable activities to be completed and the learning objectives achieved.

What is Big Data?

Big Data is a term used to define data sets that have the potential to rapidly grow so large that they become unmanageable. The Big Data movement includes new tools and ways of storing information that allow efficient processing and analysis for informed business decision-making.

What is MapReduce?

MapReduce is a parallel programming model that allows distributed processing on large data sets on a cluster of computers. MapReduce was originally implemented by Google as part of their searching and indexing of the Internet. It has since grown in popularity and is quickly being adopted by most industries.

What is Hadoop?

Hadoop is an open source implementation of MapReduce by the Apache group. It is a high performance distributed storage and processing system. Hadoop fills the gap in the market by effectively storing and providing computational capabilities for substantial amounts of data. There is commercial support from multiple vendors and prepackaged cloud solutions.

Which Big Data products and tools does this course use?

The course provides hands-on exposure to a number of Big Data products including Redis, MongoDB, Cassandra, Neo4J, Hadoop/MapReduce, Pig, Hive, RHadoop, Mahout. Other data stores are also discussed during the course.

Will there be any programming in the course?

While programming experience is not required to attend the course, we will discuss programming examples to enable attendees to gain practical experience working with Big Data solutions. The exercises are structured in such a way that all experience levels will be challenged. For those attendees that do not have any programming experience, the exercises include guided instructions to enable them to complete the programming exercises. Those that have programming experience are challenged with bonus activities to help showcase specific Big Data capabilities.

Hadoop Development for Big Data Solutions

About This Course: The availability of large data sets presents new opportunities and challenges to organizations of all sizes. This course provides the hands-on programming skills to develop solutions that run on the Apache Hadoop platform to efficiently process a variety of Big Data. Additionally, you learn to test and deploy Big Data solutions on commodity clusters.

You will Learn how to :

Implement Hadoop jobs to extract business value from large and varied data sets
Write, customize and deploy MapReduce jobs to summarize data
Load and retrieve unstructured data from HDFS and HBase
Develop Hive and Pig queries to simplify data analysis
Test and debug jobs using MRUnit
Monitor task execution and cluster health

Hands On Exercises:

Developing efficient parallel algorithms
Analyzing log files and developing multi-stage Java MapReduce jobs
Developing custom combiners for more efficient processing
Loading and retrieving data from Hadoop Distributed File System and HBase
Analyzing data with HiveQL queries and Pig Latin scripts
Validating requirements with MRUnit

Course Contents

Introduction to Hadoop(1.75 Hrs )

Identifying the business benefits of Hadoop
Surveying the Hadoop ecosystem
Selecting a suitable distribution

Parallelizing Program Execution(2.5 Hrs)

Meeting the challenges of parallel programming

Investigating parallelizable challenges: algorithms, data and information exchange
Estimating the storage and complexity of Big Data

Parallel programming with MapReduce

Dividing and conquering large-scale problems
Uncovering jobs suitable for MapReduce
Solving typical business problems

Implementing Real-World MapReduce Jobs(3.75 Hrs)

Applying the Hadoop MapReduce paradigm

Configuring the development environment
Exploring the Hadoop distribution
Creating the components of MapReduce jobs
Introducing the Hadoop daemons
Analyzing the stages of MapReduce processing: splitting, mapping, shuffling and reducing

Building complex MapReduce jobs

Selecting and employing multiple mappers and reducers
Leveraging built-in mappers, reducers and partitioners
Coordinating jobs with Oozie workflow scheduler
Streaming tasks through various programming languages

Customizing MapReduce(4 Hrs)

Solving common data manipulation problems

Executing algorithms: parallel sorts, joins and searches
Analyzing log files, social media data and emails

Implementing partitioners and combiners

Identifying network bound, CPU bound, and disk I/O bound parallel algorithms
Reducing network traffic with combiners
Dividing the workload efficiently using partitioners
Collecting metrics with counters

Persisting Big Data with Distributed Data stores(4.25 Hrs)

Making the case for distributed data

Achieving high performance data throughput
Recovering from media failure through redundancy

Interfacing with Hadoop Distributed File System (HDFS)

Breaking down the structure and organization of HDFS
Loading raw data and retrieving results
Reading and writing data programmatically
Partitioning text or binary data
Manipulating Hadoop SequenceFile types

Structuring data with HBase

Migrating from structured to unstructured storage
Applying NoSQL concepts with schema on read
Transferring relational data to HBase with Sqoop
Comparing HBase to other types of NoSQL data stores

Simplifying Data Analysis with Query Languages (4 Hrs)

Unleashing the power of SQL with Hive and Impala

Structuring data with the Hive MetaStore
Extracting, Transforming and Loading (ETL) data
Querying with HiveQL
Extending HiveQL with User-Defined Functions (UDF)
Performing real-time queries with Impala

Executing workflows with Pig

Developing Pig Latin scripts to consolidate workflows
Integrating Pig queries with Java
Interacting with data through the grunt console
Extending HiveQL with User-Defined Functions (UDF)

Managing and Deploying Big Data Solutions(2.75 Hrs)

Testing and debugging Hadoop code

Logging significant events for auditing and debugging
Debugging in local mode
Validating requirements with MRUnit

Deploying, monitoring and tuning performance

Deploying to a production cluster
Optimizing performance with administrative tools
Monitoring cluster health with Hadoop User Experience (HUE) and Ganglia

Who should attend?

This course is for developers, architects and testers who desire hands-on experience writing code for Hadoop. It can be helpful to technical managers interested in the development process.

More info about the Course

What is this course about?

The availability of large data sets presents new opportunities and challenges to organizations of all sizes. This course provides the hands-on programming skills to leverage the Apache Hadoop platform to efficiently process a variety of Big Data. Additionally, you learn to test and deploy Big Data solutions on commodity clusters. This course also covers Pig, Hive, HBase and other components of the Hadoop ecosystem. Further, this course teaches testing, deployment and best practices to architect and develop a complete Big Data solution.

Who will benefit from this course?

This course is for developers, architects and testers who desire hands-on experience writing code for Hadoop. It can also be helpful to technical managers interested in the development process.

What background do I need?

You should have Java experience at the level of Course 471, Java Programming Introduction: Hands-On or equivalent experience. Exposure to SQL is helpful.

Will there be any programming in the course?

Yes! Approximately 40 percent of the course time is devoted to hands-on programming.

What tools and platforms are used?

The platform is Java running on RedHat Linux. The tools used include Eclipse and various text editors.

Which Big Data products does this course use?

The course covers a number of Big Data products including Apache Hadoop, MapReduce, Hadoop Distributed File System (HDFS), HBase, Hive, and Pig. Additional parts of the Hadoop ecosystem will be covered such as Sqoop, Oozie, Impala and MRUnit. Other datastores will be mentioned for comparison.

What is Big Data?

What is Hadoop?

Hadoop is an open source implementation of MapReduce by the Apache group and is the most widely used platform on which to solve problems in processing large, complex data sets that would otherwise be intractable using conventional means. It is a high performance distributed storage and processing system. Hadoop fills the gap in the market by effectively storing and providing computational capabilities for substantial amounts of data. There is commercial support from multiple vendors and prepackaged cloud solutions.

What is MapReduce?

How are Hadoop programs developed?

Primarily programs are written in Java although Hadoop has facilities to handle programs written in other languages like C++, Python, and .NET. Programs can also be written in scripting languages like Pig. Data in HDFS can be queried using a SQL-like syntax with Hive.

What are the advantages of using Hadoop?

Hadoop provides the ability to process and analyze more data than was previously possible at a lower cost
It runs on scalable commodity clusters
It has self-healing capabilities to survive hardware failures
It operates on various types of data and adapts to meet varying degrees of structure
HDFS automatically provides robustness and redundancy for performance and reliability
There are many associated projects that enhance the Hadoop ecosystem and ease development

Pages

Big Data Hadoop