Part IV spark streaming Programming Guide (1) The execution mechanism of Spark Streaming, Transformations and Output Operations, Spark Streaming data sources and Spark Streaming sinks are discussed. This article will continue the previous content, mainly including the following contents:
Time based window operation
Posted by predhtz on Tue, 24 May 2022 22:24:57 +0300
Spark SQL is a module that Spark uses to process structured data. It provides a programming abstraction called DataFrame and acts as a distributed SQL query engine. In Hive, Hive SQL is converted into MapReduce and then submitted to the cluster for execution, which greatly simplifies the complexity of writing MapReduce programs, bec ...
Posted by SoaringUSAEagle on Mon, 23 May 2022 18:39:54 +0300
Summarizing the use of Hudi Spark SQL, I still use Hudi 0 Taking version 9.0 as an example, some changes in the latest version will also be mentioned slightly. Hudi has supported Spark SQL since version 0.9.0. It was contributed by pengzhiwei, a classmate of Alibaba. Pengzhiwei is no longer responsible for Hudi, but is instead in the c ...
Posted by TheBentinel.com on Thu, 19 May 2022 00:31:54 +0300
There are many articles explaining the principle of Spark Streaming, which will not be introduced here. This paper mainly introduces the programming model, coding practice and some optimization instructions using Kafka as the data source
Posted by Adam_28 on Mon, 16 May 2022 21:59:57 +0300
Data skew is the most common problem in data development, and it is also a question that must be asked in interviews. So why is the data skewed? When will data skew occur? and how to solve it?
What is data skew: The essence of data skew is uneven data distribution. Some tasks process a large amount of data, which leads to a longe ...
Posted by Paris! on Fri, 13 May 2022 21:01:37 +0300
For the current k8s-based spark application, there are mainly two ways to run
spark natively supports spark on k8s
k8s-based operator spark on k8s operator
The former is the implementation of the k8s client introduced by the spark community to support the resource management framework of k8s The latter is an operator developed by the k8s ...
Posted by cyball on Wed, 11 May 2022 04:22:32 +0300
Spark core components
Before explaining the Spark architecture, let's first understand several core components of Spark and find out their functions. 1. Application:Spark application User programs built on Spark, including Driver code and code running in the Executor of each node of the cluster. It consists of one or more Job jobs. As show ...
Posted by Someone789 on Sun, 08 May 2022 23:43:36 +0300
JOIN operation is a very common data processing operation. As a unified big data processing engine, Spark provides a very rich JOIN scenario. This article will introduce the five JOIN strategies provided by Spark, hoping to help you. This paper mainly includes the following contents:
Factors affecting JOIN operation
Five strategies implemented ...
Posted by Helminthophobe on Sat, 07 May 2022 15:14:37 +0300
Abstract: Real-time computing technology has been applied to various fields such as advertising, e-commerce, games, and entertainment. For example, e-commerce websites analyze user attributes in real time, and push relevant products to customers based on the analysis results; online games analyze player data in real time, and then analyze game ...
Posted by alvinho on Fri, 06 May 2022 03:10:34 +0300
Shallow SQL boys may only know that after pyspark constructs the sparkSession object [of course, enableHiveSupport], write an SQL:
spark.sql("write a SQL string here");
Then spark will complete various operations of select ing, querying data, inserting and overwriting data into the result table according to the SQL here. ...
Posted by Fearless_Fish on Wed, 04 May 2022 16:16:50 +0300