XML 패키지를 사용하여 html 테이블을 R 데이터 프레임으로 스크랩

programing

XML 패키지를 사용하여 html 테이블을 R 데이터 프레임으로 스크랩

showcode 2023. 6. 14. 22:00

XML 패키지를 사용하여 html 테이블을 R 데이터 프레임으로 스크랩

XML 패키지를 사용하여 html 테이블을 어떻게 스크랩합니까?

브라질 축구팀의 위키피디아 페이지를 예로 들어 보겠습니다.저는 그것을 R로 읽고 "브라질이 FIFA 공인 팀과 경기한 모든 경기 목록" 표를 데이터 프레임으로 얻고 싶습니다.어떻게 해야 하나요?

…또는 더 짧은 시도:

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://en.wikipedia.org/wiki/Brazil_national_football_team",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

선택된 테이블이 페이지에서 가장 긴 테이블입니다.

tables[[which.max(n.rows)]]

library(RCurl)
library(XML)

# Download page using RCurl
# You may need to set proxy details, etc.,  in the call to getURL
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
# Process escape characters
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

# Parse the html tree, ignoring errors on the page
pagetree <- htmlTreeParse(webpage, error=function(...){})

# Navigate your way through the tree. It may be possible to do this more efficiently using getNodeSet
body <- pagetree$children$html$children$body 
divbodyContent <- body$children$div$children[[1]]$children$div$children[[4]]
tables <- divbodyContent$children[names(divbodyContent)=="table"]

#In this case, the required table is the only one with class "wikitable sortable"  
tableclasses <- sapply(tables, function(x) x$attributes["class"])
thetable  <- tables[which(tableclasses=="wikitable sortable")]$table

#Get columns headers
headers <- thetable$children[[1]]$children
columnnames <- unname(sapply(headers, function(x) x$children$text$value))

# Get rows from table
content <- c()
for(i in 2:length(thetable$children))
{
   tablerow <- thetable$children[[i]]$children
   opponent <- tablerow[[1]]$children[[2]]$children$text$value
   others <- unname(sapply(tablerow[-1], function(x) x$children$text$value)) 
   content <- rbind(content, c(opponent, others))
}

# Convert to data frame
colnames(content) <- columnnames
as.data.frame(content)

추가할 편집 내용:

샘플 출력

                     Opponent Played Won Drawn Lost Goals for Goals against  % Won
    1               Argentina     94  36    24   34       148           150  38.3%
    2                Paraguay     72  44    17   11       160            61  61.1%
    3                 Uruguay     72  33    19   20       127            93  45.8%
    ...

그rvest와 함께xml2html 웹 페이지를 구문 분석하는 데 사용되는 또 다른 패키지입니다.

library(rvest)
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
file<-read_html(theurl)
tables<-html_nodes(file, "table")
table1 <- html_table(tables[4], fill = TRUE)

구문이 보다 사용하기 쉽습니다.xml패키지 및 대부분의 웹 페이지에 대해 패키지는 필요한 모든 옵션을 제공합니다.

Xpath를 사용하는 다른 옵션입니다.

library(RCurl)
library(XML)

theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)

# Extract table header and contents
tablehead <- xpathSApply(pagetree, "//*/table[@class='wikitable sortable']/tr/th", xmlValue)
results <- xpathSApply(pagetree, "//*/table[@class='wikitable sortable']/tr/td", xmlValue)

# Convert character vector to dataframe
content <- as.data.frame(matrix(results, ncol = 8, byrow = TRUE))

# Clean up the results
content[,1] <- gsub("Â ", "", content[,1])
tablehead <- gsub("Â ", "", tablehead)
names(content) <- tablehead

이 결과를 생성합니다.

> head(content)
   Opponent Played Won Drawn Lost Goals for Goals against % Won
1 Argentina     94  36    24   34       148           150 38.3%
2  Paraguay     72  44    17   11       160            61 61.1%
3   Uruguay     72  33    19   20       127            93 45.8%
4     Chile     64  45    12    7       147            53 70.3%
5      Peru     39  27     9    3        83            27 69.2%
6    Mexico     36  21     6    9        69            34 58.3%

언급URL : https://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package

'programing' 카테고리의 다른 글

배열 첨자 안에 "휘발성" 키워드가 나타나는 목적은 무엇입니까? (0)	2023.06.14
Oracle PL/SQL을 위한 유닛 테스트 프레임워크? (0)	2023.06.14
FUN 내부의 응용프로그램 인덱스 이름에 액세스합니다. (0)	2023.06.14
Firebase를 사용하여 페이지를 새로 고칠 때 토큰 null을 가져옵니다.auth(.oAuthStateChanged) (0)	2023.06.14
TS2611: 'foo'는 클래스 'A'에서 속성으로 정의되지만 여기서 'B'에서는 액세스로 재정의됩니다. (0)	2023.06.14

현재글XML 패키지를 사용하여 html 테이블을 R 데이터 프레임으로 스크랩

각종 프로그래밍 정보를 다루는 블로그입니다.

SWIFT, JSON, bash, Windows, android, WPF, reactjs, sql-server, Excel, WordPress, TypeScript, AngularJS, Python, ASP.NET, Ajax, oracle, git, spring-boot, MongoDB, Eclipse,

Today :
Yesterday :

showcode

XML 패키지를 사용하여 html 테이블을 R 데이터 프레임으로 스크랩

XML 패키지를 사용하여 html 테이블을 R 데이터 프레임으로 스크랩

'programing' 카테고리의 다른 글

'programing'의 다른글

티스토리툴바

« 2025/08 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

XML 패키지를 사용하여 html 테이블을 R 데이터 프레임으로 스크랩

XML 패키지를 사용하여 html 테이블을 R 데이터 프레임으로 스크랩

'programing' 카테고리의 다른 글

'programing'의 다른글

관련글

티스토리툴바