‘ve got big RDD(1gb) in yarn cluster. On local machine, which use this cluster I have only 512 mb. I‘d like to iterate over values in RDD on my local machine. I can‘t use collect(), because it would create too big array locally which more then my heap. I need some iterative way. There is method iterator(), but it requires some additional information, I can‘t provide. UDP: commited toLocalIterator method
|
|||||||||
|
5 Answers
up vote26down voteaccepted |
Update: TL;DR And the original answer might give a rough idea how it works: First of all, get the array of partition indexes:
Then create smaller rdds filtering out everything but a single partition. Collect the data from smaller rdds and iterate over values of a single partition:
I didn‘t try this code, but it should work. Please write a comment if it won‘t compile. Of cause, it will work only if the partitions are small enough. If they aren‘t, you can always increase the number of partitions with
|
||||||||||||
|
Did you find this question interesting? Try our newsletter
Sign up for our newsletter and get our top new questions delivered to your inbox (see an example).
up vote10down vote |
Wildfire answer seems semantically correct, but I‘m sure you should be able to be vastly more efficient by using the API of Spark. If you want to process each partition in turn, I don‘t see why you can‘t using
Or this
|
||||||||||||
|
up vote5down vote |
Here is the same approach as suggested by @Wildlife but written in pyspark. The nice thing about this approach - it lets user access records in RDD in order. I‘m using this code to feed data from RDD into STDIN of the machine learning tool‘s process.
Produces output:
|
|||
add a comment |
up vote1down vote |
Map/filter/reduce using Spark and download the results later? I think usual Hadoop approach will work. Api says that there are map - filter - saveAsFile commands:https://spark.incubator.apache.org/docs/0.8.1/scala-programming-guide.html#transformations
|
||||||||||||
|
up vote1down vote |
For Spark 1.3.1 , the format is as follows
|