Categories
Uncategorized

Kylin configure and build Spark Cube

HDP Version: 2.6.4.0

Kylin Version: 2.5.1

Machine: Three CentOS-7,8G memory

Kylin calculation engine in addition to MapReduce, as well as faster Spark, this paper comes to the example Kylin kylin_sales_cube to test the speed of Spark Cube building.

First, the configuration parameters associated Spark of Kylin

Before running Spark cubing, I recommend a look at these configurations and customize them according to the situation of the cluster. The following is a recommended configuration, open the Spark dynamic resource allocation:

## Spark conf (default is in spark/conf/spark-defaults.conf)
kylin.engine.spark-conf.spark.master=yarn
kylin.engine.spark-conf.spark.submit.deployMode=cluster
kylin.engine.spark-conf.spark.yarn.queue=default
kylin.engine.spark-conf.spark.driver.memory=2G
kylin.engine.spark-conf.spark.executor.memory=4G
kylin.engine.spark-conf.spark.executor.instances=40
kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024
kylin.engine.spark-conf.spark.shuffle.service.enabled=true
kylin.engine.spark-conf.spark.eventLog.enabled=true
kylin.engine.spark-conf.spark.eventLog.dir=hdfs\:///kylin/spark-history
kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs\:///kylin/spark-history
#kylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false
#
#### Spark conf for specific job
#kylin.engine.spark-conf-mergedict.spark.executor.memory=6G
#kylin.engine.spark-conf-mergedict.spark.memory.fraction=0.2
#
## manually upload spark-assembly jar to HDFS and then set this property will avoid repeatedly uploading jar
## at runtime
kylin.engine.spark-conf.spark.yarn.archive=hdfs://node71.data:8020/kylin/spark/spark-libs.jar
kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
#
## 如果是HDP版本,请取消下述三行配置的注释
kylin.engine.spark-conf.spark.driver.extraJavaOptions=-Dhdp.version=current
kylin.engine.spark-conf.spark.yarn.am.extraJavaOptions=-Dhdp.version=current
kylin.engine.spark-conf.spark.executor.extraJavaOptions=-Dhdp.version=current

Which kylin.engine.spark-conf.spark.yarn.archive configuration is designated a jar package Kylin engine to run the jar package needs its own generated and uploaded to the HDFS. Because I perform Kylin service user is kylin, so the first user to switch to the kylin continue execution. Command is as follows:

su - kylin
cd /usr/hdp/2.6.4.0-91/kylin
# 生成spark-libs.jar文件
jar cv0f spark-libs.jar -C $KYLIN_HOME/spark/jars/ ./
# 上传到HDFS上的指定目录
hadoop fs -mkdir -p /kylin/spark/
hadoop fs -put spark-libs.jar /kylin/spark/

Second, modify the configuration of Cube

After configuring the parameters related to Spark Kylin good, then we need to modify the calculation engine Cube Spark, amend as follows:

Kylin specified first generation Cube comes with the script: sh $ {KYLIN_HOME} /bin/sample.sh, two loaded Cube on Kylin Web page.

Then visit our Kylin Web UI, then click Model -> Action -> Edit button:

Click Step Five: Advanced Setting, paddling down the page, change the Cube Engine type, change the MapReduce to Spark. Then save the configuration changes. As shown below:

Click “Next” to go to the “Configuration Overwritings” page, click “+Property” to add the property “kylin.engine.spark.rdd-partition-cut-mb” with a value of “500” (for the following reasons):

The sample cube has two memoryexhausted metrics: “COUNT DISTINCT” and “TOPN (100)”; when the source data is smaller, their size is less accurate: the estimated size is much larger than the real, resulting in more RDD partitions being cut, making the build faster. 500 for which is a more reasonable number. Click “Next” and “Save” to save the cube.

For Cube without “COUNT DISTINCT” and “TOPN”, leave the default configuration.

Third, build Cube

After you save the cube configuration of the good changes, click Action -> Build, select the starting time of construction (make sure there is data in the starting time, otherwise meaningless construct cube), and then start building cube.

During the process of building the cube, you can open the Yarn ResourceManager UI to view the status of the task. When the cube is built to step 7, you can open the Spark UI page, which will show the progress of each stage along with detailed information.

Kylin Spark is their own internal use, so we also need additional start Spark History Server.

${KYLIN_HOME}/spark/sbin/start-history-server.sh hdfs://:8020/kylin/spark-history

Visit: http: // ip: 18080 /, you can see the details of job Spark Cube building, which is information of great help for troubleshooting and performance tuning.

Four, FAQ

Spark Cube building in the use of the process, encountered two errors are resolved, hereby record and let everyone know, are full of dry goods within public number.

1, Spark on Yarn configuration adjustments

Error contents:

Exception in thread "main" java.lang.IllegalArgumentException: Required executor memory (4096+1024 MB) is above the max threshold (4096 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.

problem analysis:

According to the error log analysis, task execution memory (4096 + 1024MB) higher than the maximum threshold of this cluster. You can adjust the memory configuration or perform tasks Yarn of Spark.

The required execution memory (4096+1024MB) for the Spark task is:

  • kylin.engine.spark-conf.spark.executor.memory=4G
  • kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024

Yarn Related Configuration:

    yarn.nodemanager.resource.memory-mb: NoDemanager is a proxy for a single node in YARN that requires interaction with ApplicationMaster of the application and ResourceManager of the cluster manager. This attribute represents the total amount of physical memory available to the node Yarn.

    yarn.scheduler.maximum-allocation-mb: Represents the maximum amount of physical memory that a single task can request. The configuration value cannot be greater than the size of the yarn.nodemanager.resource.memory-mb configuration value.

Solution:

Yarn arranged to adjust, for example, adjusting the size yarn.scheduler.maximum-allocation-mb, Dependent on yarn.nodemanager.resource.memory-mb, so two configurations adjusted to (4096 + 1024 MB) is larger than the execution memory value, such as: 5888 MB.

2, constructed Cube eighth step: Convert Cuboid Data to HFile error

Error contents:

java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.hbase.io.hfile.HFile

problem analysis:

kylin.engine.spark-conf.spark.yarn.archive parameter value specified file is missing spark-libs.jar HBase associated class files.

Solution:

More class files due to the absence of relevant HBase, Kylin reference solution given official website reported still can not find the class file, so I will HBase related jar packages are added to the spark-libs.jar inside. If you have generated a spark-libs.jar and uploaded to HDFS, then you need to re-package upload. Specific steps are as follows:

su - kylin
cd /usr/hdp/2.6.4.0-91/kylin
cp -r /usr/hdp/2.6.4.0-91/hbase/lib/hbase* /usr/hdp/2.6.4.0-91/kylin/spark/jars/
rm -rf spark-libs.jar;jar cv0f spark-libs.jar -C spark/jars/ ./
hadoop fs -rm -r /kylin/spark/spark-libs.jar    
hadoop fs -put spark-libs.jar /kylin/spark/

Then switch to Kylin Web page, continue to build Cube.

Five Spark and MapReduce contrast,

Spark Cube using Construction took a total of about 7 minutes, as shown below:

Cube using MapReduce Construction took a total of about 15 minutes, as shown below:

Or use Spark cube build fast, a lot faster!

VI Summary

This article introduces:

    How to Configure Related Spark Parameters for Kylin

    How to Change Cube’s Compute Engine

    Generate spark-libs.jar package and upload to HDFS

    Spark Cube building process FAQ

    Speed comparison between Spark and MapReduce building Cube

Reference links to this article:

  • http://kylin.apache.org/cn/docs/tutorial/cube_spark.html
  • https://community.cloudera.com/t5/Support-Questions/Apache-Kylin-with-Spark/m-p/241590

Recommended reading:

  • https://841809077.github.io/categories/Kylin/

Leave a Reply