A big data engineer is an information technology (IT) professional who is responsible for designing, building,
testing and maintaining complex data processing systems that work with large data sets. This type of data
specialist aggregates, cleanses, transforms and enriches different forms of data so that downstream data
consumers — such as business analysts and data scientists — can systematically extract information.
Big data is a label that describes massive volumes of customer, product and operational data, typically in the
terabyte and petabyte ranges. Big data analytics can be used to optimize key business and operational use cases,
mitigate compliance and regulatory risks and create net-new revenue streams.
- Fundamentals of Big Data
- Hadoop Fundamentals
- MapReduce In Depth
- Ingesting Data into Hadoop
- Intro to Python Programming
- Intro to Apache Spark
- Project 1: Data Ingestion from RDBMS into HIVE (ORC File)
- Project 2: Data Ingestion Streaming Data Using Kafka into Hadoop/Hive
- Project 3: Data Analysis & Stream Processing using Spark 1
- Project 4: Data Analysis & Stream Processing using Spark & Hive 2
- Project 5: Data Visualization Using Apache
This course is presented by :
- Design, construct and maintain large-scale data processing systems. This collects data from various data sources -- structured or not.
- Store data in a data warehouse or data lake repository.
- Handle raw data using data processing transformations and algorithms to create predefined data structures. Deposit the results into a data warehouse or data lake for downstream processing.
- Transform and integrate various data into a scalable data repository (such as a data warehouse, data lake, cloud).
- Understand different data transformation tools, techniques and algorithms.
- Implement technical processes and business logic to transform collected data into meaningful and valuable information. This data should meet the necessary quality, governance and compliance considerations for operational and business usage to be considered trustable.
- Understand operational and management options, as well as the differences between data repository structures, massively parallel processing (MPP) databases and hybrid cloud
- Evaluate, compare and improve data pipelines. This includes design pattern innovation, data lifecycle design, data ontology alignment, annotated data sets and elastic search approaches.
- Prepare automated data pipelines to transform and feed the data into dev, QA and production environments.