我正在尝试使用SDK for Java在EMR上运行Spark,但我在让spark-submit使用我存储在S3上的JAR时遇到了问题。相关代码如下:
public String launchCluster() throws Exception {
StepFactory stepFactory = new StepFactory();
// Creates a cluster flow step for debugging
StepConfig enableDebugging = new StepConfig().withName("Enable debugging")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newEnableDebuggingStep());
// Here is the original code before I tried command-runner.jar.
// When using this, I get a ClassNotFoundException for
// org.apache.spark.SparkConf. This is because for some reason,
// the super-jar that I'm generating doesn't include apache spark.
// Even so, I believe EMR should already have Spark installed if
// I configure this correctly...
// HadoopJarStepConfig runExampleConfig = new HadoopJarStepConfig()
// .withJar(JAR_LOCATION)
// .withMainClass(MAIN_CLASS);
HadoopJarStepConfig runExampleConfig = new HadoopJarStepConfig()
.withJar("command-runner.jar")
.withArgs(
"spark-submit",
"--master", "yarn",
"--deploy-mode", "cluster",
"--class", SOME_MAIN_CLASS,
SOME_S3_PATH_TO_SUPERJAR,
"-useSparkLocal", "false"
);
StepConfig customExampleStep = new StepConfig().withName("Example Step")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(runExampleConfig);
// Create Applications so that the request knows to launch
// the cluster with support for Hadoop and Spark.
// Unsure if Hadoop is necessary...
Application hadoopApp = new Application().withName("Hadoop");
Application sparkApp = new Application().withName("Spark");
RunJobFlowRequest request = new RunJobFlowRequest().withName("spark-cluster")
.withReleaseLabel("emr-5.15.0")
.withSteps(enableDebugging, customExampleStep)
.withApplications(hadoopApp, sparkApp)
.withLogUri(LOG_URI)
.withServiceRole("EMR_DefaultRole")
.withJobFlowRole("EMR_EC2_DefaultRole")
.withVisibleToAllUsers(true)
.withInstances(new JobFlowInstancesConfig()
.withInstanceCount(3)
.withKeepJobFlowAliveWhenNoSteps(true)
.withMasterInstanceType("m3.xlarge")
.withSlaveInstanceType("m3.xlarge")
);
return result.getJobFlowId();
}这些步骤完成时没有出现错误,但实际上并没有输出anything...when我检查了日志,stderr包含以下内容
Warning: Skip remote jar s3://somebucket/myservice-1.0-super.jar.
和
18/07/17 22:08:31 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
根据日志,我不确定问题是什么。我相信我在集群上正确地安装了Spark。此外,给出一些上下文-当我直接使用withJar与存储在S3上的超级JAR而不是命令运行器(并且没有withArgs)一起使用时,它正确地获取了JAR,但随后它没有安装Spark -我得到了SparkConf的ClassNotFoundException (以及JavaSparkContext,这取决于我的Spark作业代码首先尝试创建的内容)。
如果有任何建议,我们将不胜感激!
发布于 2018-10-13 19:40:16
我认为,如果您使用的是最新的电子病历版本(例如EMR -5.17.0),那么在runExampleConfig语句中,--master参数应该是yarn-cluster而不是yarn。我也有同样的问题,在这个改变之后,它对我来说工作得很好。
https://stackoverflow.com/questions/51391911
复制相似问题