我有一个没有完全格式化的地址列表。大多数人有着相同的基本结构,但大约五分之一的人没有被正确输入。
df1包含24个地址,每个地址都是一个字符串。我的目标是找到似乎缺少单词或数字的地址,并将它们添加到最有可能属于它们的每个字符串中。
我的方法是计算每个唯一的单词/数字在dataframe中出现的次数。出现在80%或更多行中的单词被标识为需要添加到每个地址的单词。任何缺字都需要根据包含所有寻址元素的地址的格式,添加到“正确”位置。
我可以识别需要添加的单词,但如果每个字符串不存在,我还没有找到将这些单词添加到字符串中的方法;我也没有找到确保将它们添加到字符串中正确位置的方法。这更加复杂,因为在我真实的数据集中,地址的格式在不同的区域之间并不是固定的,例如,在本例中,建筑编号和道路名称应该是第3和第4地址元素。有时他们会是第一和第二,第二和第三等。所以,我一直试图开发的解决方案也需要是动态的。
这是我的样本数据集:
df1 <- data.frame(V1=c("apt 23 5 roadname cityville b11abc", "apt 47 5 roadname cityville b11abc", "apt 24 roadname cityville b11abc", "apt 3 roadname cityville b11abc", "apt 44 5 roadname cityville b11abc", "apt 88 5 roadname cityville b11abc", "apt 7 5 roadname cityville b11abc", "apt 41 5 roadname cityville b11abc", "apt 55 5 roadname cityville b11abc", "apt 19 5 roadname cityville b11abc", "85 5 roadname cityville b11abc", "apt 12 roadname cityville b11abc", "apt 452 5 roadname cityville b11abc", "apt 1 5 roadname cityville b11abc", "99 5 roadname cityville b11abc", "apt 73 5 roadname cityville b11abc", "74 roadname cityville b11abc", "apt 75 5 roadname cityville b11abc", "apt 63 5 roadname cityville b11abc", "apt 48 5 roadname cityville b11abc", "apt 123 5 roadname cityville b11abc", "apt 56 5 roadname cityville b11abc", "6 5 roadname cityville b11abc", "apt 2 6 roadname cityville b11abc"), stringsAsFactors = F)这是我识别需要添加的单词的方法:
df1_words <- as.data.frame(table(t(as.data.frame(as.list(unlist(strsplit(df1$V1, " ")))))))
df1_words_80 <- subset(df1_words, Freq >= round(nrow(df1)/100*80))这就是我想要的输出:
df2 <- data.frame(V1=c("apt 23 5 roadname cityville b11abc", "apt 47 5 roadname cityville b11abc", "apt 24 5 roadname cityville b11abc", "apt 3 5 roadname cityville b11abc", "apt 44 5 roadname cityville b11abc", "apt 88 5 roadname cityville b11abc", "apt 7 5 roadname cityville b11abc", "apt 41 5 roadname cityville b11abc", "apt 55 5 roadname cityville b11abc", "apt 19 5 roadname cityville b11abc", "apt 85 5 roadname cityville b11abc", "apt 12 5 roadname cityville b11abc", "apt 452 5 roadname cityville b11abc", "apt 1 5 roadname cityville b11abc", "apt 99 5 roadname cityville b11abc", "apt 73 5 roadname cityville b11abc", "apt 74 5 roadname cityville b11abc", "apt 75 5 roadname cityville b11abc", "apt 63 5 roadname cityville b11abc", "apt 48 5 roadname cityville b11abc", "apt 123 5 roadname cityville b11abc", "apt 56 5 roadname cityville b11abc", "apt 6 5 roadname cityville b11abc", "apt 2 6 roadname cityville b11abc"), stringsAsFactors = F)在将ikop的解决方案应用于实际数据集之后,编辑遇到了一个问题,当列表包含长度不同的地址时,我遇到了一个问题。我认为问题是,一些短地址(例如包含5个单词)试图将通常在位置的6、7、8、9等处的频繁单词插入其中,这是不可能的,从而造成错误。我可以想到两种解决方案,一种是倒计时而不是向前计算,或者可能是更简单的选项(我认为这个选项最适合我的特殊需求),忽略包含异常短字符串的行。
当将df3与ikop的解决方案一起使用时,我遇到的问题可以复制
df3 <- data.frame(V1=c("apt really long name 23 5 roadname cityville b11abc", "apt really long name 47 5 roadname cityville b11abc", "apt really long name 24 roadname cityville b11abc", "apt 3 roadname cityville b11abc", "apt really long name 44 5 roadname cityville b11abc", "apt really long name 88 5 roadname cityville b11abc", "apt really long name 7 5 roadname cityville b11abc", "apt really long name 41 5 roadname cityville b11abc", "apt really long name 55 5 roadname cityville b11abc", "apt really long name 19 5 roadname cityville b11abc", "85 5 roadname cityville b11abc", "apt really long name 12 roadname cityville b11abc", "apt really long name 452 5 roadname cityville b11abc", "apt really long name 1 5 roadname cityville b11abc", "99 5 roadname cityville b11abc", "apt really long name 73 5 roadname cityville b11abc", "74 roadname cityville b11abc", "apt 75 5 roadname cityville b11abc", "apt really long name 63 5 roadname cityville b11abc", "apt really long name 48 5 roadname cityville b11abc", "apt really long name 123 5 roadname cityville b11abc", "apt really long name 56 5 roadname cityville b11abc", "6 5 roadname cityville b11abc", "apt really long name 2 6 roadname cityville b11abc"), stringsAsFactors = F)发布于 2017-05-09 05:43:49
这是一个令人讨厌的解决方案,它将使你获得大部分的机会。
## For each word that appears in at least 80% of the rows compute
## the most frequent position it appears in:
library(dplyr)
splitList <- strsplit(df1$V1, " ")
wordVec <- unique(unlist(splitList))
wordFrequencyDf <- lapply(wordVec, function(theWord){
freqWord <- sum(unlist(splitList) == theWord)
posVec <- unlist(lapply(splitList, function(x) which(x == theWord)))
mostFreqPos <- sort(table(posVec), decreasing = TRUE)[1] %>% names %>% as.numeric
data.frame(theWord, freqWord,mostFreqPos)
}) %>%
do.call('rbind',.) %>%
dplyr::mutate(theWord = as.character(theWord)) %>%
dplyr::filter(freqWord >= round(nrow(df1)*0.8)) %>%
dplyr::arrange(mostFreqPos)
## Now loop over those words and insert the word in the relevant
## position if necessary:
for (ii in seq(along = wordFrequencyDf$theWord)){
splitList <- lapply(splitList, function(x){
relPos <- wordFrequencyDf$mostFreqPos[ii]
if (x[relPos] != wordFrequencyDf$theWord[ii]){
if (relPos == 1){
strBefore <- NULL
} else {
strBefore <- x[1:(relPos-1)]
}
if (relPos > length(x)){
strAfter <- NULL
} else {
strAfter <- x[relPos:length(x)]
}
x <- c(strBefore, wordFrequencyDf$theWord[ii], strAfter)
}
x
})
}
## Paste list together into a single string again:
df2 <- data.frame(V1 = sapply(splitList, function(x) paste(x, collapse = " ")))结果:
df2
# V1
# 1 apt 23 5 roadname cityville b11abc
# 2 apt 47 5 roadname cityville b11abc
# 3 apt 24 5 roadname cityville b11abc
# 4 apt 3 5 roadname cityville b11abc
# 5 apt 44 5 roadname cityville b11abc
# 6 apt 88 5 roadname cityville b11abc
# 7 apt 7 5 roadname cityville b11abc
# 8 apt 41 5 roadname cityville b11abc
# 9 apt 55 5 roadname cityville b11abc
# 10 apt 19 5 roadname cityville b11abc
# 11 apt 85 5 roadname cityville b11abc
# 12 apt 12 5 roadname cityville b11abc
# 13 apt 452 5 roadname cityville b11abc
# 14 apt 1 5 roadname cityville b11abc
# 15 apt 99 5 roadname cityville b11abc
# 16 apt 73 5 roadname cityville b11abc
# 17 apt 74 5 roadname cityville b11abc
# 18 apt 75 5 roadname cityville b11abc
# 19 apt 63 5 roadname cityville b11abc
# 20 apt 48 5 roadname cityville b11abc
# 21 apt 123 5 roadname cityville b11abc
# 22 apt 56 5 roadname cityville b11abc
# 23 apt 6 5 roadname cityville b11abc
# 24 apt 2 5 roadname cityville b11abc 6 roadname cityville b11abc您可以看到,该方法在最后一行中失败。在这里,原始行在第3位置没有一个"5" (正如代码所期望的)。但问题是,建筑编号并没有完全丢失,字符串只是包含一个不同的建筑编号。然而,代码将其解释为缺少的建筑编号,并将"5"插入到第3位置。
https://stackoverflow.com/questions/43856683
复制相似问题