我有一个变量"bio_sentences“,正如变量的名称所示,它有四到五个个体的生物句子(从" bio”变量中提取并拆分成句子)。我试图用这种逻辑来确定一个人的性别.
Femalew <- c("She", "Her")
Check <- str_extract_all(bio,Femalew)
Check <- Check[Check != "character(0)"]
Gender <- vector("character")
if(length(Check) > 0){
Gender[1] <- "Female"
}else{
Gender[1] <- "Male"
}
for(i in 1:length(bio_sentences)){
Gender[i] <- Gender[1]
} 我得到了一个很好的结果(在我的数据集中大多数是男性),尽管句子中有“她”或“她”,但几乎没有遗漏(有些女性没有被发现)。我是否可以提高逻辑的准确性,或者部署grepl之类的新功能?
编辑:
data1.Gender A B C D E data1.Description
1 Female 0 0 0 0 0 Ranjit Singh President of Boparan Holdings Limited Ranjit is President of Boparan Holdings Limited.
2 Female 0 0 0 NA NA He founded the business in 1993 and has more than 25 years’ experience in the food industry.
3 Female 0 0 0 NA NA Ranjit is particularly skilled at growing businesses, both organically and through acquisition.
4 Female 0 0 0 NA NA Notable acquisitions include Northern Foods and Brookes Avana in 2011.
5 Female 0 0 0 NA NA Ranjit and his wife Baljinder Boparan are the sole shareholders of Boparan Holdings, the holding company for 2 Sisters Food Group.
6 Female 0 0 0 NA NA s以上是数据中的一个人,我的要求是代码读取"data1.description“中的所有行(在我的代码中,这是一个for循环,所以它读取每个人的所有句子),正如您可以看到的,在其中一个句子中有一个明显的"He”,但是通过应用我以前写过的逻辑,我将它作为“女性”来理解。
发布于 2018-10-02 12:17:30
正如@Merijn van Tilborg所述,您的句子应该非常清楚,因为如果有多个代词,您的工作就不能提供期望的结果。
但是,您也可以管理这些情况,我们可以尝试使用dplyr和tidytext包,但是我们必须稍微清理一下数据:
# explicit the genders
female <- c("She", "Her")
male <- c("He", "His")
# here your data, with several examples of cases
df <- data.frame(
line = c(1,2,3,4,5,6),
text = c("She is happy", # female
"Her dog is happy", # female (if we look at the subject, it's not female..)
"He is happy", # male
"His dog is happy", # male
"It is happy", # ?
"She and he are happy"), # both!
stringsAsFactors = FALSE ) # life saver现在我们可以尝试这样的方法:
library(tidytext)
library(dplyr)
df %>%
unnest_tokens(word, text) %>% # put words in rows
mutate(gender = ifelse(word %in% tolower(female),'female',
ifelse(word %in% tolower(male), 'male','unknown'))) %>% # detect male and female, remember tolower!
filter(gender!='unknown') %>% # remove the unknown
right_join(df) %>% # join with the original sentences keeping all of them
select(-word) # remove useless column
line gender text
1 1 female She is happy
2 2 female Her dog is happy
3 3 male He is happy
4 4 male His dog is happy
5 5 <NA> It is happy
6 6 female She and he are happy
7 6 male She and he are happy你可以看到,1,2,3,4句符合你的标准,"it“没有定义,如果有男性和女性,我们会加倍,让你理解为什么。
最后,您可以折叠在一行中,添加到dplyr链中如下:
%>% group_by(text, line) %>% summarise(gender = paste(gender, collapse = ','))
# A tibble: 6 x 3
# Groups: text [?]
text line gender
<chr> <dbl> <chr>
1 He is happy 3 male
2 Her dog is happy 2 female
3 His dog is happy 4 male
4 It is happy 5 NA
5 She and he are happy 6 female,male
6 She is happy 1 female 编辑:让我们尝试使用您的数据:
data1 <- read.table(text="
data1.Gender A B C D E data1.Description
1 Female 0 0 0 0 0 'Ranjit Singh President of Boparan Holdings Limited Ranjit is President of Boparan Holdings Limited.'
2 Female 0 0 0 NA NA 'He founded the business in 1993 and has more than 25 years’ experience in the food industry.'
3 Female 0 0 0 NA NA 'Ranjit is particularly skilled at growing businesses, both organically and through acquisition.'
4 Female 0 0 0 NA NA 'Notable acquisitions include Northern Foods and Brookes Avana in 2011.'
5 Female 0 0 0 NA NA 'Ranjit and his wife Baljinder Boparan are the sole shareholders of Boparan Holdings, the holding company for 2 Sisters Food Group.'
6 Female 0 0 0 NA NA 's'",stringsAsFactors = FALSE)
# explicit the genders, in this case I've put also the names
female <- c("She", "Her","Baljinder")
male <- c("He", "His","Ranjit")
# clean the data
df <- data.frame(
line = rownames(data1),
text = data1$data1.Description,
stringsAsFactors = FALSE)
library(tidytext)
library(dplyr)
df %>%
unnest_tokens(word, text) %>% # put words in rows
mutate(gender = ifelse(word %in% tolower(female),'female',
ifelse(word %in% tolower(male), 'male','unknown'))) %>% # detect male and female, remember tolower!
filter(gender!='unknown') %>% # remove the unknown
right_join(df) %>% # join with the original sentences keeping all of them
select(-word) %>%
group_by(text, line) %>%
summarise(gender = paste(gender, collapse = ',')) 因此:
Joining, by = "line"
# A tibble: 6 x 3
# Groups: text [?]
text line gender
<chr> <chr> <chr>
1 He founded the business in 1993 and has more than 25 years’ ex~ 2 male
2 Notable acquisitions include Northern Foods and Brookes Avana ~ 4 NA
3 Ranjit and his wife Baljinder Boparan are the sole shareholder~ 5 male,male,fe~
4 Ranjit is particularly skilled at growing businesses, both org~ 3 male
5 Ranjit Singh President of Boparan Holdings Limited Ranjit is P~ 1 male,male
6 s 6 NA 真正的游戏是定义所有你可以认为是“男性”或“女性”的词。
发布于 2018-10-02 11:49:17
这要复杂得多,因为上下文是这里的关键。看看下面三个短语..。
苏珊有一位伟大的教授,他的名字叫亚当。他教了他最喜欢的学生.(苏珊不是女性,而是男性)
苏珊有一位伟大的教授,他的名字叫亚当。他教会了她所有的知识..。(好的,我们现在有一个她,但也有一个他)
苏珊有个很棒的教授叫亚当。亚当教会了她所有的知识..。(好的,我们有一个女的)
发布于 2018-10-03 09:14:46
除了已经给出的答案之外,我还强烈建议在名单上加上最常见的女性名字。例如,在网上可以很容易地找到它们,成为一个国家中最受欢迎的100位女性名字。我相信,即使你在女性名单中添加了大约500个最重要的名字,你也会得到一个不错的开始,对男性也是如此。
此外,我还举了一个例子,给出了一些决策规则。有多大可能是女性还是男性。一种方法可能只是计算发生的次数和计算比率。根据比例,你可以自己做决定。我的选择只是一个任意的例子,并将其作为每个决策的一行(可以编码得更高效)。
library(data.table) ## just my personal preference above dplyr
library(stringr) ## just my personal favorite when I deal with strings
df = data.table(text = c("Because Sandra is a female name and we talk a few times about her, she is most likely a female he says.",
"Sandra is mentioned and the only references are about how she did everything to achieve her goals.",
"Nothing is mentioned that reveals a gender.",
"She talks about him and he talks about her.",
"Sandra says: he is nice and she is nice too.",
"Adam is a male and we only talk about him")))
f.indicators = c("she", "her", "susan", "sandra")
m.indicators = c("he", "him", "his", "steve", "adam")
df[, f.count := sum(str_split(str_to_lower(text), "[[:space:]]|[[:punct:]]")[[1]] %in% f.indicators, na.rm = TRUE), by = text]
df[, m.count := sum(str_split(str_to_lower(text), "[[:space:]]|[[:punct:]]")[[1]] %in% m.indicators, na.rm = TRUE), by = text]
df[f.count != 0 | m.count != 0, gender_ratio_female := f.count / (f.count + m.count)]
df[, decision := "Unknown"]
df[gender_ratio_female == 1, decision := "Female, no male indications"]
df[gender_ratio_female == 0, decision := "Male, no female indicators"]
df[gender_ratio_female > 0.4 & gender_ratio_female < 0.6, decision := "Gender should be checked"]
df[gender_ratio_female > 0.6 & gender_ratio_female < 1, decision := "Probably a Female"]
df[gender_ratio_female > 0 & gender_ratio_female < 0.4, decision := "Probably a Male"]对不起,我很难在这里格式化输出表,我是新来的。
text f.count m.count gender_ratio_female decision
1: Because Sandra is a female name and we talk a few times about her, she is most likely a female he says. 3 1 0.7500 Probably a Female
2: Sandra is mentioned and the only references are about how she did everything to achieve her goals. 3 0 1.0000 Female, no male indications
3: Nothing is mentioned that reveals a gender. 0 0 NA Unknown
4: She talks about him and he talks about her. 2 2 0.5000 Gender should be checked
5: Sandra says: he is nice and she is nice too. 2 1 0.6667 Probably a Female
6: Adam is a male and we only talk about him 0 2 0.0000 Male, no female indicatorshttps://stackoverflow.com/questions/52607494
复制相似问题