试图解析希伯来语.HTML网页,并且在使用RCurl工具时遇到了问题。我一直在读下面的文章:
我使用了以下R代码:
library(XML)
library(RCurl)
url_get<-"http://www.agora.co.il/toGet.asp?searchType=searchAll&dealType=1&dealStatus=1"
download.file(url_get, "codes/tmp.html")
txt <- readLines("codes/tmp.html", encoding="UTF-8")
pagetree <- htmlParse(txt, useInternalNodes = TRUE, encoding="UTF-8")而readLines()产生适当的希伯来语(בעלימקצוע);
txt[345]
[1] "<a id=\"professionals\" href=\"/texts/midrag.asp?parameter=\" target=\"_blank\" title=\"בעלי מקצוע\">"htmlParse()把它搞砸了(׳·׳-׳-׳₪׳׳™׳™׳“׳׳™׳™׳”׳)׳׳™׳“׳׳׳׳™׳吲哚”“。
<a href="http://shlah.agora.co.il/financial/financial1.html">׳׳¦׳׳× ׳׳”׳׳™׳ ׳•׳¡</a><br><br><span class="linkWords">׳׳•׳— ׳—׳₪׳¦׳™ ׳™׳“ ׳©׳ ׳™׳” ׳׳׳¡׳™׳¨׳” ׳‘׳—׳™׳ ׳ ׳‘׳׳‘׳“ -有什么想法吗?
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Hebrew_Israel.1255 LC_CTYPE=Hebrew_Israel.1255 LC_MONETARY=Hebrew_Israel.1255
[4] LC_NUMERIC=C LC_TIME=Hebrew_Israel.1255
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RCurl_1.95-4.3 bitops_1.0-6 XML_3.98-1.1
loaded via a namespace (and not attached):
[1] tools_3.1.1发布于 2014-08-25 14:46:37
我不能重复你的问题。以下是我所采取的步骤:
https://stackoverflow.com/questions/25446141
复制相似问题