我有两个数据框架如下:
df1
chr_number start end strand
0 chr1 111478338 111478339 +
1 chr1 111478370 111478371 +
2 chr1 111478372 111478373 +
3 chr1 157123306 157123307 -
4 chr1 157123307 157123308 -
5 chr1 212619741 212619742 +
6 chr1 212619742 212619743 +df2
Chromosome Start End Log2 Fold Change Strand Gene \
0 chr1 111478330 111478444 3.036912 + C1orf162
1 chr1 157123300 157123338 3.293174 - ETV3
2 chr1 207079296 207079412 3.916122 + PFKFB2
3 chr1 212619736 212619771 3.880546 + ATF3
Ensembl ID Feature
0 ENSG00000143110.11 3' UTR
1 ENSG00000117036.12 3' UTR
2 ENSG00000123836.15 3' UTR
3 ENSG00000162772.17 3' UTR 我需要查看df1中的start和End是否位于df2的开始和结束之间。如果是这样的话,我希望有一个新的数据框架,其中包含来自df1的起始值,并在df2中包含相应的行。
下面是我对df1中的每个开始值所需的示例:
CrossLink Chromosome Start End Log2 Fold Change Strand \
1 111478338 chr1 111478330.0 111478444.0 3.036912 +
Gene Ensembl ID Feature
1 C1orf162 ENSG00000143110.11 3' UTR 我写了这段代码:
df3 = pd.DataFrame([])
df3["CrossLink"] = np.nan
for v in df1["start"]:
df4 = df2[(df2["Start"] <= v) & (df2["End"] > v)]
df3 = df3.append(df4)
df3["CrossLink"] = df1["start"]我得到了这个输出:
CrossLink Chromosome Start End Log2 Fold Change Strand \
0 111478338 chr1 111478330.0 111478444.0 3.036912 +
0 111478338 chr1 111478330.0 111478444.0 3.036912 +
0 111478338 chr1 111478330.0 111478444.0 3.036912 +
1 111478370 chr1 157123300.0 157123338.0 3.293174 -
1 111478370 chr1 157123300.0 157123338.0 3.293174 -
3 157123306 chr1 212619736.0 212619771.0 3.880546 +
3 157123306 chr1 212619736.0 212619771.0 3.880546 +
Gene Ensembl ID Feature
0 C1orf162 ENSG00000143110.11 3' UTR
0 C1orf162 ENSG00000143110.11 3' UTR
0 C1orf162 ENSG00000143110.11 3' UTR
1 ETV3 ENSG00000117036.12 3' UTR
1 ETV3 ENSG00000117036.12 3' UTR
3 ATF3 ENSG00000162772.17 3' UTR
3 ATF3 ENSG00000162772.17 3' UTR 它不包含来自df1的所有开始值,它给了我副本。我在蟒蛇和熊猫方面很新,我找了很多东西,但我想不出来。
提前谢谢你的帮助!
发布于 2022-02-15 00:02:00
使用两个步骤的解决方案:
假设我们有
df = pd.DataFrame({'chr_number':['chr1', 'chr2'], 'start':[3, 5],})
df2 = pd.DataFrame({'index': ['chr1', 'chr3'], 'col': ['a', 'b'], 'start': [1, 2], 'end':[4, 5]})
print(df)
print(df2)
chr_number start
0 chr1 3
1 chr2 5
index col start end
0 chr1 a 1 4
1 chr3 b 2 5然后,我们可以应用聚合和爆炸来获得所需的输出。
df2.start = df2.apply(lambda x: df.loc[(x['start'] <= df.start) & (df.start <= x['end'])].start.agg(list), axis=1)
print(df2.explode('start'))
index col start end
0 chr1 a 3 4
1 chr3 b 3 5
1 chr3 b 5 5编辑:我意识到我是在做不正确的比较df2值而不是df的操作。现在,编辑的代码将df2.start替换为df.start值,这些值介于df2.start和df2.end之间,适用于行df2。
https://stackoverflow.com/questions/71119447
复制相似问题