在“GROUP BY"中重用选择表达式的结果；条款?

2021-11-14 00:00:00 apache-spark scala apache-spark-sql mysql spark-dataframe

在 MySQL 中，我可以有这样的查询:

In MySQL, I can have a query like this:

select cast(from_unixtime(t.time, '%Y-%m-%d %H:00') as datetime) as timeHour , ... from some_table t group by timeHour, ... order by timeHour, ...

其中 GROUP BY 中的 timeHour 是选择表达式的结果.

where timeHour in the GROUP BY is the result of a select expression.

但是我刚刚尝试了一个类似于 Sqark SQL 中的查询，我得到了一个错误

But I just tried a query similar to that in Sqark SQL, and I got an error of

Error: org.apache.spark.sql.AnalysisException: cannot resolve '`timeHour`' given input columns: ...

我对 Spark SQL 的查询是这样的:

My query for Spark SQL was this:

select cast(t.unixTime as timestamp) as timeHour , ... from another_table as t group by timeHour, ... order by timeHour, ...

这个结构在 Spark SQL 中可行吗?

Is this construct possible in Spark SQL?

推荐答案

这个结构在 Spark SQL 中可行吗?

Is this construct possible in Spark SQL?

是的，是.您可以通过两种方式使其在 Spark SQL 中工作，以在 GROUP BY 和 ORDER BY 子句中使用新列

Yes, It is. You can make it works in Spark SQL in 2 ways to use new column in GROUP BY and ORDER BY clauses

使用子查询的方法一:

SELECT timeHour, someThing FROM (SELECT from_unixtime((starttime/1000)) AS timeHour , sum(...) AS someThing , starttime FROM some_table) WHERE starttime >= 1000*unix_timestamp('2017-09-16 00:00:00') AND starttime <= 1000*unix_timestamp('2017-09-16 04:00:00') GROUP BY timeHour ORDER BY timeHour LIMIT 10;

方法 2 使用 WITH//优雅的方式:

-- create alias WITH table_aliase AS(SELECT from_unixtime((starttime/1000)) AS timeHour , sum(...) AS someThing , starttime FROM some_table) -- use the same alias as table SELECT timeHour, someThing FROM table_aliase WHERE starttime >= 1000*unix_timestamp('2017-09-16 00:00:00') AND starttime <= 1000*unix_timestamp('2017-09-16 04:00:00') GROUP BY timeHour ORDER BY timeHour LIMIT 10;

在 Scala 中使用 Spark DataFrame(wo SQL) API 的替代方法:

// This code may need additional import to work well val df = .... //load the actual table as df import org.apache.spark.sql.functions._ df.withColumn("timeHour", from_unixtime($"starttime"/1000)) .groupBy($"timeHour") .agg(sum("...").as("someThing")) .orderBy($"timeHour") .show() //another way - as per eliasah comment df.groupBy(from_unixtime($"starttime"/1000).as("timeHour")) .agg(sum("...").as("someThing")) .orderBy($"timeHour") .show()

相关文章