Notice

Link

« 2026/03 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Tags more

Archives

Today

Total

관리 메뉴

스택큐힙리스트

스파크 SQL의 DataFrame에서 열 유형을 어떻게 변경할 수 있나요? 본문

카테고리 없음

스파크 SQL의 DataFrame에서 열 유형을 어떻게 변경할 수 있나요?

스택큐힙리스트 2023. 11. 30. 00:26

어떤 작업을 수행하고 있다고 가정해봅시다:

val df = sqlContext.load(com.databricks.spark.csv, Map(path -> cars.csv, header -> true))
df.printSchema()
root
 |-- year: string (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- comment: string (nullable = true)
 |-- blank: string (nullable = true)
df.show()
year make  model comment              blank
2012 Tesla S     No comment
1997 Ford  E350  Go get one now th...

하지만 실제로는 year을 Int로 원했습니다 (또는 다른 열들도 변형하고 싶다는 의미입니다).

내가 생각해낸 최선의 방법은 다음과 같습니다:

df.withColumn(year2, 'year.cast(Int)).select('year2 as 'year, 'make, 'model, 'comment, 'blank)
org.apache.spark.sql.DataFrame = [year: int, make: string, model: string, comment: string, blank: string]

이 방법은 조금 복잡합니다.

나는 R에서 작성할 수 있는 것에 익숙하므로, 예를 들어 다음과 같이 작성할 수 있었으면 좋겠습니다.

df2 <- df %>%
   mutate (year = year %>% as.integer,
           make = make %>% toupper)

아마도 뭔가 빠졌을 가능성이 있습니다. Spark/Scala에서 좀 더 좋은 방법이 있어야합니다...

답변 1

https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Dataset@withColumn(colName:String,col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame

가장 오래된 답변

Spark 버전 1.4부터는 DataType과 함께 column에 캐스트 메서드를 적용할 수 있습니다:

import org.apache.spark.sql.types.IntegerType
val df2 = df.withColumn(yearTmp, df.year.cast(IntegerType))
    .drop(year)
    .withColumnRenamed(yearTmp, year)

SQL 표현식을 사용하는 경우 다음과 같이 할 수도 있습니다:

val df2 = df.selectExpr(cast(year as int) year, 
                        make, 
                        model, 
                        comment, 
                        blank)

더 많은 정보를 보려면 문서를 확인하세요:
http://spark.apache.org/docs/1.6.0/api/scala/#org.apache.spark.sql.DataFrame

답변 2

Spark SQL의 DataFrame에서 컬럼 유형을 변경하는 방법에는 여러 가지가 있습니다. 여기서는 몇 가지 포인트를 간결하고 직관적으로 설명하겠습니다.
첫째로, DataFrame의 타입을 변경하는 가장 간단한 방법은 `withColumn` 메서드와 함께 `cast` 함수를 사용하는 것입니다. 이 함수는 열에 새로운 유형을 할당하여 DataFrame을 반환합니다. 예를 들어, 정수 컬럼을 실수로 변경하려면 다음과 같이 코드를 작성할 수 있습니다.
```scala
val newDF = oldDF.withColumn(newColumn, col(oldColumn).cast(DoubleType))
```
둘째로, 스키마와 함께 DataFrame을 생성할 때 컬럼의 유형을 지정할 수 있습니다. 이 방법은 DataFrame을 생성하는 단계에서 유형을 지정하여 원하는 결과를 얻는 데 유용합니다. 예를 들어, 스키마와 함께 DataFrame을 생성하고 실수 컬럼을 생성하려면 다음과 같이 코드를 작성할 수 있습니다.
```scala
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField(name, StringType, nullable = false),
StructField(age, IntegerType, nullable = false),
StructField(salary, DoubleType, nullable = false)
))
val df = spark.createDataFrame(Seq((John, 25, 50000.0), (Lisa, 30, 70000.0))).toDF(schema)
```
이렇게하면 salary 컬럼의 유형이 실수로 지정됩니다.
컬럼 유형 변경은 데이터 유형 호환성에 따라 제한될 수 있습니다. 유형을 변경하려는 컬럼이 데이터 값을 적절하게 수용할 수 없는 경우 오류가 발생할 수 있습니다. 따라서 컬럼 유형을 변경하기 전에 데이터 값을 확인하고 적절한 예외 처리를 수행하는 것이 좋습니다.
이것은 Spark SQL의 DataFrame에서 컬럼 유형을 변경하는 몇 가지 방법에 대한 간단한 설명이었습니다. 이러한 방법을 사용하면 DataFrame 내에서 유형 변환 작업을 수행할 수 있습니다. 이로써 데이터 처리 및 분석 작업을 효율적으로 수행할 수 있습니다.

Comments

스택큐힙리스트

스파크 SQL의 DataFrame에서 열 유형을 어떻게 변경할 수 있나요? 본문

스파크 SQL의 DataFrame에서 열 유형을 어떻게 변경할 수 있나요?

가장 오래된 답변

티스토리툴바