从0开始学习大数据之java spark编程入门与项目实践
本文实例讲述了大数据javaspark编程。分享给大家供大家参考,具体如下:
上节搭建好了eclipsespark编程环境
在测试运行scala或java编写spark程序,在eclipse平台都可以运行,但打包导出jar,提交spark-submit运行,都不能执行,最后确定是版本问题,就是你在eclipse调试的spark版本需和spark-submit提交spark的运行版本一致,还有就是scala版本一致,才能正常运行。
以下是javaspark程序运行
1.新建maven项目SparkApps
注意pom.xml中spark-core的版本
我原来调试使用的是
org.apache.spark spark-core_2.12 2.4.0
打包成jar到提交spark-submit运行,总是提示错误,因为spark下载的是spark-1.6.0-cdh5.16.0版本的,与eclipse中spark2.4.0版本有些语句用法不一致。
2.项目中新建类JavaWordCount
packagecom.linbin.SparkApps; importscala.Tuple2; importorg.apache.spark.SparkConf; importorg.apache.spark.api.java.JavaPairRDD; importorg.apache.spark.api.java.JavaRDD; importorg.apache.spark.api.java.JavaSparkContext; importorg.apache.spark.api.java.function.FlatMapFunction; importorg.apache.spark.api.java.function.Function2; importorg.apache.spark.api.java.function.PairFunction; importjava.util.Arrays; importjava.util.Iterator; importjava.util.List; importjava.util.regex.Pattern; publicclassJavaWordCount{ privatestaticfinalPatternSPACE=Pattern.compile(""); publicstaticvoidmain(String[]args)throwsException{ if(args.length<1){ System.err.println("Usage:JavaWordCount"); System.exit(1); } SparkConfsparkConf=newSparkConf().setAppName("JavaWordCount"); //setMaster在打包导出时无需设定 sparkConf.setMaster("local[2]"); JavaSparkContextctx=newJavaSparkContext(sparkConf); JavaRDD words=lines.flatMap(newFlatMapFunction (){ @Override /*以下spark2.X * *publicIterator call(Strings){ *return(Arrays.asList(SPACE.split(s)).iterator(); *} */ //以下spark1.X publicIterable call(Strings)throwsException{ returnArrays.asList(SPACE.split(s)); } }); JavaPairRDD ones=words.mapToPair(newPairFunction (){ @Override publicTuple2 call(Strings){ returnnewTuple2 (s,1); } }); JavaPairRDD counts=ones.reduceByKey(newFunction2 (){ @Override publicIntegercall(Integeri1,Integeri2){ returni1+i2; } }); List >output=counts.collect(); for(Tuple2,?>tuple:output){ System.out.println(tuple._1()+":"+tuple._2()); } ctx.stop(); ctx.close(); } }
3.在eclipse中运行as “java Application”
正常输出结果
4.Eclipse中打包导出为sparkapps.jar
5.提交给spark中执行
[root@centos7bin]#./spark-submit--masterspark://centos7:7077--classcom.linbin.SparkApps.JavaWordCount/home/linbin/workspace/sparkapps.jarhdfs://centos7:8020/hello.txt
6.执行结果,正常输出
[root@centos7bin]#./spark-submit--masterspark://centos7:7077--classcom.linbin.SparkApps.JavaWordCount/home/linbin/workspace/sparkapps.jarhdfs://centos7:8020/hello.txt
18/11/2914:37:38INFOspark.SparkContext:RunningSparkversion1.6.0
18/11/2914:37:39INFOspark.SecurityManager:Changingviewaclsto:root
18/11/2914:37:39INFOspark.SecurityManager:Changingmodifyaclsto:root
18/11/2914:37:39INFOspark.SecurityManager:SecurityManager:authenticationdisabled;uiaclsdisabled;userswithviewpermissions:Set(root);userswithmodifypermissions:Set(root)
18/11/2914:37:39INFOutil.Utils:Successfullystartedservice'sparkDriver'onport40507.
18/11/2914:37:39INFOslf4j.Slf4jLogger:Slf4jLoggerstarted
18/11/2914:37:39INFORemoting:Startingremoting
18/11/2914:37:39INFORemoting:Remotingstarted;listeningonaddresses:[akka.tcp://sparkDriverActorSystem@172.16.48.71:35776]
18/11/2914:37:39INFORemoting:Remotingnowlistensonaddresses:[akka.tcp://sparkDriverActorSystem@172.16.48.71:35776]
18/11/2914:37:39INFOutil.Utils:Successfullystartedservice'sparkDriverActorSystem'onport35776.
18/11/2914:37:39INFOspark.SparkEnv:RegisteringMapOutputTracker
18/11/2914:37:39INFOspark.SparkEnv:RegisteringBlockManagerMaster
18/11/2914:37:39INFOstorage.DiskBlockManager:Createdlocaldirectoryat/tmp/blockmgr-dd9c0da7-1d22-45ba-9f9d-05d027801ccc
18/11/2914:37:39INFOstorage.MemoryStore:MemoryStorestartedwithcapacity530.0MB
18/11/2914:37:39INFOspark.SparkEnv:RegisteringOutputCommitCoordinator
18/11/2914:37:39INFOserver.Server:jetty-8.y.z-SNAPSHOT
18/11/2914:37:39INFOserver.AbstractConnector:StartedSelectChannelConnector@0.0.0.0:4040
18/11/2914:37:39INFOutil.Utils:Successfullystartedservice'SparkUI'onport4040.
18/11/2914:37:39INFOui.SparkUI:StartedSparkUIathttp://172.16.48.71:4040
18/11/2914:37:39INFOspark.SparkContext:AddedJARfile:/home/linbin/workspace/sparkapps.jaratspark://172.16.48.71:40507/jars/sparkapps.jarwithtimestamp1543473459974
18/11/2914:37:40INFOclient.AppClient$ClientEndpoint:Connectingtomasterspark://centos7:7077...
18/11/2914:37:40INFOcluster.SparkDeploySchedulerBackend:ConnectedtoSparkclusterwithappIDapp-20181129143740-0003
18/11/2914:37:40INFOclient.AppClient$ClientEndpoint:Executoradded:app-20181129143740-0003/0onworker-20181129113634-172.16.48.71-34880(172.16.48.71:34880)with2cores
18/11/2914:37:40INFOcluster.SparkDeploySchedulerBackend:GrantedexecutorIDapp-20181129143740-0003/0onhostPort172.16.48.71:34880with2cores,1024.0MBRAM
18/11/2914:37:40INFOclient.AppClient$ClientEndpoint:Executorupdated:app-20181129143740-0003/0isnowRUNNING
18/11/2914:37:40INFOutil.Utils:Successfullystartedservice'org.apache.spark.network.netty.NettyBlockTransferService'onport40438.
18/11/2914:37:40INFOnetty.NettyBlockTransferService:Servercreatedon40438
18/11/2914:37:40INFOstorage.BlockManagerMaster:TryingtoregisterBlockManager
18/11/2914:37:40INFOstorage.BlockManagerMasterEndpoint:Registeringblockmanager172.16.48.71:40438with530.0MBRAM,BlockManagerId(driver,172.16.48.71,40438)
18/11/2914:37:40INFOstorage.BlockManagerMaster:RegisteredBlockManager
18/11/2914:37:40INFOcluster.SparkDeploySchedulerBackend:SchedulerBackendisreadyforschedulingbeginningafterreachedminRegisteredResourcesRatio:0.0
18/11/2914:37:40INFOstorage.MemoryStore:Blockbroadcast_0storedasvaluesinmemory(estimatedsize156.5KB,free529.9MB)
18/11/2914:37:40INFOstorage.MemoryStore:Blockbroadcast_0_piece0storedasbytesinmemory(estimatedsize16.5KB,free529.8MB)
18/11/2914:37:40INFOstorage.BlockManagerInfo:Addedbroadcast_0_piece0inmemoryon172.16.48.71:40438(size:16.5KB,free:530.0MB)
18/11/2914:37:40INFOspark.SparkContext:Createdbroadcast0fromtextFileatJavaWordCount.java:45
18/11/2914:37:41INFOmapred.FileInputFormat:Totalinputpathstoprocess:1
18/11/2914:37:41INFOspark.SparkContext:Startingjob:collectatJavaWordCount.java:103
18/11/2914:37:41INFOscheduler.DAGScheduler:RegisteringRDD3(mapToPairatJavaWordCount.java:73)
18/11/2914:37:41INFOscheduler.DAGScheduler:Gotjob0(collectatJavaWordCount.java:103)with1outputpartitions
18/11/2914:37:41INFOscheduler.DAGScheduler:Finalstage:ResultStage1(collectatJavaWordCount.java:103)
18/11/2914:37:41INFOscheduler.DAGScheduler:Parentsoffinalstage:List(ShuffleMapStage0)
18/11/2914:37:41INFOscheduler.DAGScheduler:Missingparents:List(ShuffleMapStage0)
18/11/2914:37:41INFOscheduler.DAGScheduler:SubmittingShuffleMapStage0(MapPartitionsRDD[3]atmapToPairatJavaWordCount.java:73),whichhasnomissingparents
18/11/2914:37:41INFOstorage.MemoryStore:Blockbroadcast_1storedasvaluesinmemory(estimatedsize4.8KB,free529.8MB)
18/11/2914:37:41INFOstorage.MemoryStore:Blockbroadcast_1_piece0storedasbytesinmemory(estimatedsize2.7KB,free529.8MB)
18/11/2914:37:41INFOstorage.BlockManagerInfo:Addedbroadcast_1_piece0inmemoryon172.16.48.71:40438(size:2.7KB,free:530.0MB)
18/11/2914:37:41INFOspark.SparkContext:Createdbroadcast1frombroadcastatDAGScheduler.scala:1004
18/11/2914:37:41INFOscheduler.DAGScheduler:Submitting1missingtasksfromShuffleMapStage0(MapPartitionsRDD[3]atmapToPairatJavaWordCount.java:73)(first15tasksareforpartitionsVector(0))
18/11/2914:37:41INFOscheduler.TaskSchedulerImpl:Addingtaskset0.0with1tasks
18/11/2914:37:41INFOcluster.SparkDeploySchedulerBackend:RegisteredexecutorNettyRpcEndpointRef(null)(centos7:35702)withID0
18/11/2914:37:41INFOscheduler.TaskSetManager:Startingtask0.0instage0.0(TID0,centos7,executor0,partition0,NODE_LOCAL,2175bytes)
18/11/2914:37:41INFOstorage.BlockManagerMasterEndpoint:Registeringblockmanagercentos7:34022with530.0MBRAM,BlockManagerId(0,centos7,34022)
18/11/2914:37:42INFOstorage.BlockManagerInfo:Addedbroadcast_1_piece0inmemoryoncentos7:34022(size:2.7KB,free:530.0MB)
18/11/2914:37:42INFOstorage.BlockManagerInfo:Addedbroadcast_0_piece0inmemoryoncentos7:34022(size:16.5KB,free:530.0MB)
18/11/2914:37:42INFOscheduler.TaskSetManager:Finishedtask0.0instage0.0(TID0)in1146msoncentos7(executor0)(1/1)
18/11/2914:37:42INFOscheduler.TaskSchedulerImpl:RemovedTaskSet0.0,whosetaskshaveallcompleted,frompool
18/11/2914:37:42INFOscheduler.DAGScheduler:ShuffleMapStage0(mapToPairatJavaWordCount.java:73)finishedin1.445s
18/11/2914:37:42INFOscheduler.DAGScheduler:lookingfornewlyrunnablestages
18/11/2914:37:42INFOscheduler.DAGScheduler:running:Set()
18/11/2914:37:42INFOscheduler.DAGScheduler:waiting:Set(ResultStage1)
18/11/2914:37:42INFOscheduler.DAGScheduler:failed:Set()
18/11/2914:37:42INFOscheduler.DAGScheduler:SubmittingResultStage1(ShuffledRDD[4]atreduceByKeyatJavaWordCount.java:90),whichhasnomissingparents
18/11/2914:37:42INFOstorage.MemoryStore:Blockbroadcast_2storedasvaluesinmemory(estimatedsize2.9KB,free529.8MB)
18/11/2914:37:42INFOstorage.MemoryStore:Blockbroadcast_2_piece0storedasbytesinmemory(estimatedsize1754.0B,free529.8MB)
18/11/2914:37:42INFOstorage.BlockManagerInfo:Addedbroadcast_2_piece0inmemoryon172.16.48.71:40438(size:1754.0B,free:530.0MB)
18/11/2914:37:42INFOspark.SparkContext:Createdbroadcast2frombroadcastatDAGScheduler.scala:1004
18/11/2914:37:42INFOscheduler.DAGScheduler:Submitting1missingtasksfromResultStage1(ShuffledRDD[4]atreduceByKeyatJavaWordCount.java:90)(first15tasksareforpartitionsVector(0))
18/11/2914:37:42INFOscheduler.TaskSchedulerImpl:Addingtaskset1.0with1tasks
18/11/2914:37:42INFOscheduler.TaskSetManager:Startingtask0.0instage1.0(TID1,centos7,executor0,partition0,NODE_LOCAL,1949bytes)
18/11/2914:37:42INFOstorage.BlockManagerInfo:Addedbroadcast_2_piece0inmemoryoncentos7:34022(size:1754.0B,free:530.0MB)
18/11/2914:37:42INFOspark.MapOutputTrackerMasterEndpoint:Askedtosendmapoutputlocationsforshuffle0tocentos7:35702
18/11/2914:37:42INFOspark.MapOutputTrackerMaster:Sizeofoutputstatusesforshuffle0is137bytes
18/11/2914:37:42INFOscheduler.TaskSetManager:Finishedtask0.0instage1.0(TID1)in70msoncentos7(executor0)(1/1)
18/11/2914:37:42INFOscheduler.TaskSchedulerImpl:RemovedTaskSet1.0,whosetaskshaveallcompleted,frompool
18/11/2914:37:42INFOscheduler.DAGScheduler:ResultStage1(collectatJavaWordCount.java:103)finishedin0.074s
18/11/2914:37:42INFOscheduler.DAGScheduler:Job0finished:collectatJavaWordCount.java:103,took1.603764s
went:1
driver:1
The:3
hitting:1
road,:1
avoid:1
colorful:1
had:1
highway,:1
basket:1
across:1
guilty:1
A:1
blissfully:1
Easter:1
he:1
in:1
eggs:1
dead.:1
side:1
cry.:1
over:2
Bunny,:1
Much:1
along:1
unfortunately:1
man:2
what:1
out:1
felt:1
lover,:1
swerved2:1
well:1
road.:1
the:12
got:1
his:2
He:1
hit.:1
began:1
animal:1
was:3
front:1
a:1
rabbit:1
when:1
sensitive:1
pulled:1
car:1
all:1
carrying:1
to:5
driver,:1
as:2
:1
hopping1:1
see:1
of:5
driving:1
become:1
basket.:1
an:1
place.:1
saw:1
but:1
jumped:1
and:3
Bunny:3
middle:1
flying:1
being:1
dismay,:1
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/metrics/json,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/stages/stage/kill,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/api,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/static,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/executors/threadDump,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/executors/json,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/executors,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/environment/json,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/environment,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/storage/rdd/json,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/storage/rdd,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/storage/json,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/storage,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/stages/pool/json,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/stages/pool,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/stages/stage/json,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/stages/stage,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/stages/json,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/stages,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/jobs/job/json,null}
18/11/2914:37:42INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/jobs/job,null}
18/11/2914:37:43INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/jobs/json,null}
18/11/2914:37:43INFOhandler.ContextHandler:stoppedo.s.j.s.ServletContextHandler{/jobs,null}
18/11/2914:37:43INFOui.SparkUI:StoppedSparkwebUIathttp://172.16.48.71:4040
18/11/2914:37:43INFOcluster.SparkDeploySchedulerBackend:Shuttingdownallexecutors
18/11/2914:37:43INFOcluster.SparkDeploySchedulerBackend:Askingeachexecutortoshutdown
18/11/2914:37:43INFOspark.MapOutputTrackerMasterEndpoint:MapOutputTrackerMasterEndpointstopped!
7.在浏览器可以看到作业记录
更多关于java算法相关内容感兴趣的读者可查看本站专题:《Java数据结构与算法教程》、《Java操作DOM节点技巧总结》、《Java文件与目录操作技巧汇总》和《Java缓存操作技巧汇总》
希望本文所述对大家java程序设计有所帮助。
声明:本文内容来源于网络,版权归原作者所有,内容由互联网用户自发贡献自行上传,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任。如果您发现有涉嫌版权的内容,欢迎发送邮件至:czq8825#qq.com(发邮件时,请将#更换为@)进行举报,并提供相关证据,一经查实,本站将立刻删除涉嫌侵权内容。