51工具盒子

依楼听风雨
笑看云卷云舒,淡观潮起潮落

How does reduceByKey() in pyspark knows which column is key and which one is value?

英文:

How does reduceByKey() in pyspark knows which column is key and which one is value?

问题 {#heading}

我是一个对Pyspark新手,正在阅读这个

reduceByKey()如何知道它应该将第一列视为键,第二列视为值,还是反过来呢?

我在reduceByKey()的代码中没有看到任何列名或索引的提及。reduceByKey()默认将第一列视为键,第二列视为值吗?

如果数据框中有多列,如何执行reduceByKey()

我知道可以使用df.select(col1, col2).reduceByKey()。我只是想知道是否还有其他方法。 英文:

I'm a newbie to Pyspark and going through this.

How does reduceByKey() know whether it should consider the first column as and second as value or vice-versa.

I don't see any column name or index mentioned in the code in reduceByKey(). Does reduceByKey() by default considers the first column as key and second as value?

How to perform reduceByKey() if there are multiple columns in the dataframe?

I'm aware of df.select(col1, col2).reduceByKey(). I'm just looking if there is any other way.

答案1 {#1}

得分: 1

我不确定你使用的是哪个版本的Spark,但这并不太重要。我假设你使用的是最新版本3.4.1。

如果我们查看reduceByKey函数在源代码中的函数签名,可以看到如下内容:

    def reduceByKey(
        self: "RDD[Tuple[K, V]]",
        func: Callable[[V, V], V],
        numPartitions: Optional[int] = None,
        partitionFunc: Callable[[K], int] = portable_hash,
    ) -> "RDD[Tuple[K, V]]":

因此,这个函数期望你的RDD的类型是Tuple[K, V],其中K代表键(key),V代表值(value)。

如果你要对具有多列的RDD执行reduceByKey操作,你可以将它们转换为一个单值列,该列本身是值的元组。

以你提供的网站数据为例,我们在数据中添加了一列:

data = [
    ("Project", 1, 2),
    ("Gutenberg's", 1, 2),
    ("Alice's", 1, 2),
    ("Adventures", 1, 2),
    ("in", 1, 2),
    ("Wonderland", 1, 2),
    ("Project", 1, 2),
    ("Gutenberg's", 1, 2),
    ("Adventures", 1, 2),
    ("in", 1, 2),
    ("Wonderland", 1, 2),
    ("Project", 1, 2),
    ("Gutenberg's", 1, 2),
]

让我们将rdd转换为具有正确形状的RDD(2列,一个键列和一个值列):

rdd2 = rdd.map(lambda x: (x[0], (x[1], x[2])))

rdd2.collect() [('Project', (1, 2)), ('Gutenberg's', (1, 2)), ('Alice's', (1, 2)), ('Adventures', (1, 2)), ('in', (1, 2)), ('Wonderland', (1, 2)), ('Project', (1, 2)), ('Gutenberg's', (1, 2)), ('Adventures', (1, 2)), ('in', (1, 2)), ('Wonderland', (1, 2)), ('Project', (1, 2)), ('Gutenberg's', (1, 2))]

假设我们想要对这两列进行归约,对于第一列我们想要求和(就像你提供的示例网站),对于第二列我们想要求乘积:

rdd3 = rdd2.reduceByKey(lambda a, b: (a[0] + b[0], a[1] * b[1]))

rdd3.collect() [('Alice's', (1, 2)), ('in', (2, 4)), ('Project', (3, 8)), ('Gutenberg's', (3, 8)), ('Adventures', (2, 4)), ('Wonderland', (2, 4))]

英文:

I'm not sure of which version of Spark you are on, but it should not matter that much. I'll assume you are on version 3.4.1, which is the latest version as of the time of writing this.

If we take a look at the function signature of reduceByKey in the source code, we see this:

    def reduceByKey(
        self: "RDD[Tuple[K, V]]",
        func: Callable[[V, V], V],
        numPartitions: Optional[int] = None,
        partitionFunc: Callable[[K], int] = portable_hash,
    ) -> "RDD[Tuple[K, V]]":

So indeed, this function expects your RDD to be of type Tuple[K, V] where K stands for key and V stands for value.

Now, if you perform reduceByKey where you have more columns, you can just turn them into a single value column that is of itself a tuple of values.

An example would be your example website's data, where we add an extra column:

data = [
    ("Project", 1, 2),
    ("Gutenberg's", 1, 2),
    ("Alice's", 1, 2),
    ("Adventures", 1, 2),
    ("in", 1, 2),
    ("Wonderland", 1, 2),
    ("Project", 1, 2),
    ("Gutenberg's", 1, 2),
    ("Adventures", 1, 2),
    ("in", 1, 2),
    ("Wonderland", 1, 2),
    ("Project", 1, 2),
    ("Gutenberg's", 1, 2),
]

Let's turn rdd into an RDD with the correct shape (2 columns, a key and a value column):

rdd2 = rdd.map(lambda x: (x[0], (x[1], x[2])))

rdd2.collect() [('Project', (1, 2)), ('Gutenberg's', (1, 2)), ('Alice's', (1, 2)), ('Adventures', (1, 2)), ('in', (1, 2)), ('Wonderland', (1, 2)), ('Project', (1, 2)), ('Gutenberg's', (1, 2)), ('Adventures', (1, 2)), ('in', (1 , 2)), ('Wonderland', (1, 2)), ('Project', (1, 2)), ('Gutenberg's', (1, 2))]

And let's say we want to reduce the 2 columns, but for the first column we want to do a sum (like in your example site) and the second one we want to multiply:

rdd3 = rdd2.reduceByKey(lambda a, b: (a[0] + b[0], a[1] * b[1]))

rdd3.collect() [('Alice's', (1, 2)), ('in', (2, 4)), ('Project', (3, 8)), ('Gutenberg's', (3, 8)), ('Adventures', (2, 4)), ('Wonderland', (2, 4))]


赞(1)
未经允许不得转载:工具盒子 » How does reduceByKey() in pyspark knows which column is key and which one is value?