hdfs dfs -ls /user/littlesuccess/AdvancedAnalysisWithSparkhdfs dfs -mkdir /user/littlesuccess/AdvancedAnalysisWithSpark/ch11hdfs dfs -put fish.py /user/littlesuccess/AdvancedAnalysisWithSpark/ch11
做好上述准备工作之后,就可以运行pyspark代码了:
raw_data = sc.textFile(‘hdfs://172.31.25.243:8020/user/littlesuccess/AdvancedAnalysisWithSpark/ch11/fish.py‘)data = (raw_data
.filter(lambda x: x.startswith("#"))
.map(lambda x: map(float, x.split(‘,‘)))) data.take(5)
运行过程中发现了一个错误:
>>> data.take(5) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/spark/python/pyspark/rdd.py", line 1081, in take totalParts = self._jrdd.partitions().size() File "/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o31.partitions. : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1713) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1322) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3974) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:813) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getFileInfo(AuthorizationProviderProxyClientProtocol.java:502) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:815) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
发现原因在于我的集群设置了NameNode HA,而我的脚本中的hdfs用的是StandBy NameNode的地址,这个问题就解决了。
重新运行命令,又发现如下错误:
时间: 2024-10-10 20:58:51