Big Data Hadoop
Introduction
Data sets that have the potential to grow rapidly need to be manageable.
This course provides the knowledge to use new Big Data tools and learn ways of
storing information that will allow for efficient processing and analysis for
informed business decision-making. Further, you learn to store, manage process
and analyze massive amounts of unstructured data.
You Will Learn How To
- Unleash the power of Big Data for competitive
advantage
- Select and apply the correct Big Data stores
for your disparate data sets
- Leverage Hadoop to process large data sets to
benefit business and technical decisions
- Apply sophisticated analytic techniques and
tools to process and analyze Big Data
- Evaluate and select appropriate vendor
products as a part of a Big Data implementation plan
Hands-On Training
- Creating an interactive Hadoop MapReduce job
flow
- Querying Hadoop MapReduce jobs using Hive
- Loading unstructured data into Hadoop
Distributed File System (HDFS)
- Simplifying Big Data processing and
communicating with Pig Latin
- Creating and customizing applications to
analyze data
- Implementing a targeted Big Data strategy
Big Data Contents:
Introduction to Big Data(1.5 Hrs)
Defining Big Data
- The four dimensions of Big Data: volume,
velocity, variety, veracity
- Introducing the Storage, MapReduce and Query
Stack
Delivering business benefit from Big Data
- Establishing the business importance of Big
Data
- Addressing the challenge of extracting useful
data
- Integrating Big Data with traditional data
Storing Big Data(4.5 Hrs)
Analyzing your data characteristics
- Selecting data sources for analysis
- Eliminating redundant data
- Establishing the role of NoSQL
Overview of Big Data stores
- Data models: key value, graph, document,
column-family
- Hadoop Distributed File System
- HBase
- Hive
- Cassandra
- Hypertable
- Amazon S3
- BigTable
- DynamoDB
- MongoDB
- Redis
- Riak
- Neo4J
Selecting Big Data stores
- Choosing the correct data stores based on your
data characteristics
- Moving code to data
- Implementing polyglot data store solutions
- Aligning business goals to the appropriate
data store
Processing Big Data(3.5 Hrs)
Integrating disparate data stores
- Mapping data to the programming framework
- Connecting and extracting data from storage
- Transforming data for processing
- Subdividing data in preparation for Hadoop
MapReduce
Employing Hadoop MapReduce
- Creating the components of Hadoop MapReduce
jobs
- Distributing data processing across server
farms
- Executing Hadoop MapReduce jobs
- Monitoring the progress of job flows
The Building Blocks of Hadoop MapReduce
- Distinguishing Hadoop daemons
- Investigating the Hadoop Distributed File
System
- Selecting appropriate execution modes: local,
pseudo-distributed, fully distributed
Tools and Techniques to Analyze Big Data(3 Hrs)
Abstracting Hadoop MapReduce jobs with Pig
- Communicating with Hadoop in Pig Latin
- Executing commands using the Grunt Shell
- Streamlining high-level processing
Performing ad-hoc Big Data querying with Hive
- Persisting data in the Hive MegaStore
- Performing queries with HiveQL
- Investigating Hive file formats
Creating business value from extracted data
- Mining data with Mahout
- Visualizing processed results with reporting
tools
Developing a Big Data Strategy(3 Hrs)
Defining a Big Data strategy for your organization
- Establishing your Big Data needs
- Meeting business goals with timely data
- Evaluating commercial Big Data tools
- Managing organizational expectations
Enabling analytic innovation
- Focusing on business importance
- Framing the problem
- Selecting the correct tools
- Achieving
timely results
Statistical analysis of Big Data
- Leveraging RHadoop functionality
- Generating statistical reports with RHadoop
- Exploiting RHadoop visualization
- Making use of analytical results
Implementing a Big Data Solution(1.5 Hrs)
- Selecting suitable vendors and hosting options
- Balancing costs against business value
- Keeping ahead of the curve
Who should attend?
Programmers, architects, administrators and data analysts who want a
foundational overview of the key components required to effectively analyze Big
Data. Familiarity working with computers and business applications is assumed.
Programming experience is beneficial but not required.
More Info about the course
What is this course
about?
This course is an
overview of Big Data tools and technologies. It establishes a strong working
knowledge of the concepts, techniques, and products associated with Big Data.
Attendees learn to store, manage, process and analyze massive amounts of unstructured
data for competitive advantage, select and implement the correct Big Data
stores and apply sophisticated analytic techniques and tools to process and
analyze big data. They also leverage Hadoop to mine large data sets to inform
and benefit business and technical decision making, evaluate and select
appropriate vendor products as part of a Big Data implementation plan for their
organization.
Who will benefit
from this course?
Anyone seeking to
exploit the benefits of Big Data technologies. The course provides an overview
of how to plan and implement a Big Data solution and the various technologies
that comprise Big Data. Many examples and exercises of Big Data systems are
provided throughout the course. The programming examples are in Java but the primary
focus is on best practices that can be applied to any supported programming
language. Attendees with a technical background will gain an understanding of the inner workings of a Big Data solution and how to implement it in their workplace. Management attendees will gain an understanding of where Big Data can be used to benefit their businesses.
What background do
I need?
You should have a
working knowledge of the Microsoft Windows platform. A basic knowledge of
programming is helpful but not required.
I am from a
non-technical background. Will I benefit from the course?
Yes! The course
presents both the business and technical benefits of Big Data. The technical
discussions are at a level that attendees with a business background can
understand and apply. Where technical knowledge is required, sufficient
guidance for all backgrounds is provided to enable activities to be completed
and the learning objectives achieved.
What is Big Data?
Big Data is a term
used to define data sets that have the potential to rapidly grow so large that
they become unmanageable. The Big Data movement includes new tools and ways of
storing information that allow efficient processing and analysis for informed
business decision-making.
What is MapReduce?
MapReduce is a
parallel programming model that allows distributed processing on large data
sets on a cluster of computers. MapReduce was originally implemented by Google
as part of their searching and indexing of the Internet. It has since grown in
popularity and is quickly being adopted by most industries.
What is Hadoop?
Hadoop is an open
source implementation of MapReduce by the Apache group. It is a high
performance distributed storage and processing system. Hadoop fills the gap in
the market by effectively storing and providing computational capabilities for
substantial amounts of data. There is commercial support from multiple vendors
and prepackaged cloud solutions.
Which Big Data
products and tools does this course use?
The course provides
hands-on exposure to a number of Big Data products including Redis, MongoDB,
Cassandra, Neo4J, Hadoop/MapReduce, Pig, Hive, RHadoop, Mahout. Other data
stores are also discussed during the course.
Will there be any
programming in the course?
While programming
experience is not required to attend the course, we will discuss programming
examples to enable attendees to gain practical experience working with Big Data
solutions. The exercises are structured in such a way that all experience
levels will be challenged. For those attendees that do not have any programming
experience, the exercises include guided instructions to enable them to
complete the programming exercises. Those that have programming experience are
challenged with bonus activities to help showcase specific Big Data capabilities.
Hadoop Development for Big Data
Solutions
About This Course: The availability of large data sets
presents new opportunities and challenges to organizations of all sizes. This
course provides the hands-on programming skills to develop solutions that run
on the Apache Hadoop platform to efficiently process a variety of Big Data.
Additionally, you learn to test and deploy Big Data solutions on commodity
clusters.
You will Learn how to :
- Implement Hadoop jobs to extract business
value from large and varied data sets
- Write, customize and deploy MapReduce jobs to
summarize data
- Load and retrieve unstructured data from HDFS
and HBase
- Develop Hive and Pig queries to simplify data
analysis
- Test and debug jobs using MRUnit
- Monitor task execution and cluster health
Hands On Exercises:
- Developing efficient parallel algorithms
- Analyzing log files and developing multi-stage
Java MapReduce jobs
- Developing custom combiners for more efficient
processing
- Loading and retrieving data from Hadoop
Distributed File System and HBase
- Analyzing data with HiveQL queries and Pig
Latin scripts
- Validating requirements with MRUnit
Course Contents
Introduction to Hadoop(1.75 Hrs )
- Identifying the business benefits of Hadoop
- Surveying the Hadoop ecosystem
- Selecting a suitable distribution
Parallelizing Program Execution(2.5 Hrs)
Meeting the challenges of parallel programming
- Investigating parallelizable challenges:
algorithms, data and information exchange
- Estimating the storage and complexity of Big
Data
Parallel programming with MapReduce
- Dividing and conquering large-scale problems
- Uncovering jobs suitable for MapReduce
- Solving typical business problems
Implementing Real-World MapReduce Jobs(3.75 Hrs)
Applying the Hadoop MapReduce paradigm
- Configuring the development environment
- Exploring the Hadoop distribution
- Creating the components of MapReduce jobs
- Introducing the Hadoop daemons
- Analyzing the stages of MapReduce processing:
splitting, mapping, shuffling and reducing
Building complex MapReduce jobs
- Selecting and employing multiple mappers and
reducers
- Leveraging built-in mappers, reducers and
partitioners
- Coordinating jobs with Oozie workflow
scheduler
- Streaming tasks through various programming
languages
Customizing MapReduce(4 Hrs)
Solving common data manipulation problems
- Executing algorithms: parallel sorts, joins
and searches
- Analyzing log files, social media data and
emails
Implementing partitioners and combiners
- Identifying network bound, CPU bound, and disk
I/O bound parallel algorithms
- Reducing network traffic with combiners
- Dividing the workload efficiently using
partitioners
- Collecting metrics with counters
Persisting Big Data with Distributed Data stores(4.25 Hrs)
Making the case for distributed data
- Achieving high performance data throughput
- Recovering from media failure through
redundancy
Interfacing with Hadoop Distributed File System (HDFS)
- Breaking down the structure and organization
of HDFS
- Loading raw data and retrieving results
- Reading and writing data programmatically
- Partitioning text or binary data
- Manipulating Hadoop SequenceFile types
Structuring data with HBase
- Migrating from structured to unstructured
storage
- Applying NoSQL concepts with schema on read
- Transferring relational data to HBase with
Sqoop
- Comparing HBase to other types of NoSQL data stores
Simplifying Data Analysis with Query Languages (4 Hrs)
Unleashing the power of SQL with Hive and Impala
- Structuring data with the Hive MetaStore
- Extracting, Transforming and Loading (ETL)
data
- Querying with HiveQL
- Extending HiveQL with User-Defined Functions
(UDF)
- Performing real-time queries with Impala
Executing workflows with Pig
- Developing Pig Latin scripts to consolidate
workflows
- Integrating Pig queries with Java
- Interacting with data through the grunt
console
- Extending HiveQL with User-Defined Functions
(UDF)
Managing and Deploying Big Data Solutions(2.75 Hrs)
Testing and debugging Hadoop code
- Logging significant events for auditing and
debugging
- Debugging in local mode
- Validating requirements with MRUnit
Deploying, monitoring and tuning performance
- Deploying to a production cluster
- Optimizing performance with administrative
tools
- Monitoring cluster health with Hadoop User
Experience (HUE) and Ganglia
Who should attend?
This course is for
developers, architects and testers who desire hands-on experience writing code
for Hadoop. It can be helpful to technical managers interested in the
development process.
More info about the Course
What is this course about?
The availability of large
data sets presents new opportunities and challenges to organizations of all
sizes. This course provides the hands-on programming skills to leverage the
Apache Hadoop platform to efficiently process a variety of Big Data.
Additionally, you learn to test and deploy Big Data solutions on commodity
clusters. This course also covers Pig, Hive, HBase and other components of the
Hadoop ecosystem. Further, this course teaches testing, deployment and best
practices to architect and develop a complete Big Data solution.
Who will benefit from this course?
This course is for
developers, architects and testers who desire hands-on experience writing code
for Hadoop. It can also be helpful to technical managers interested in the
development process.
What background do I need?
You should have Java
experience at the level of Course 471, Java Programming Introduction: Hands-On or equivalent experience. Exposure to SQL is helpful.
Will there be any programming in the course?
Yes! Approximately 40 percent
of the course time is devoted to hands-on programming.
What tools and platforms are used?
The platform is Java
running on RedHat Linux. The tools used include Eclipse and various text
editors.
Which Big Data products does this course use?
The course covers a
number of Big Data products including Apache Hadoop, MapReduce, Hadoop
Distributed File System (HDFS), HBase, Hive, and Pig. Additional parts of the
Hadoop ecosystem will be covered such as Sqoop, Oozie, Impala and MRUnit. Other
datastores will be mentioned for comparison.
What is Big Data?
Big Data is a term used
to define data sets that have the potential to rapidly grow so large that they
become unmanageable. The Big Data movement includes new tools and ways of
storing information that allow efficient processing and analysis for informed
business decision-making.
What is Hadoop?
Hadoop is an open source
implementation of MapReduce by the Apache group and is the most widely used
platform on which to solve problems in processing large, complex data sets that
would otherwise be intractable using conventional means. It is a high
performance distributed storage and processing system. Hadoop fills the gap in
the market by effectively storing and providing computational capabilities for
substantial amounts of data. There is commercial support from multiple vendors
and prepackaged cloud solutions.
What is MapReduce?
MapReduce is a parallel
programming model that allows distributed processing on large data sets on a
cluster of computers. MapReduce was originally implemented by Google as part of
their searching and indexing of the Internet. It has since grown in popularity
and is quickly being adopted by most industries.
How are Hadoop programs developed?
Primarily programs are
written in Java although Hadoop has facilities to handle programs written in
other languages like C++, Python, and .NET. Programs can also be written in
scripting languages like Pig. Data in HDFS can be queried using a SQL-like
syntax with Hive.
What are the advantages of using Hadoop?
- Hadoop provides the ability to process and
analyze more data than was previously possible at a lower cost
- It runs on scalable commodity clusters
- It has self-healing capabilities to survive
hardware failures
- It operates on various types of data and adapts
to meet varying degrees of structure
- HDFS automatically provides robustness and
redundancy for performance and reliability
- There are many associated projects that
enhance the Hadoop ecosystem and ease development
No comments:
Post a Comment