我有两个数据,df1和df2,df1包含正确的数据,用于匹配df2中的数据。
我想在df2中找到与df1中的城市名称不匹配的纬度和经度。
此外,我还想在df2找到“位于”错误国家的城市。
id城市lat lng国家
1036323110凯瑟琳-14.4667 132.2667澳大利亚
1840015979南帕萨迪纳27.7526 -82.7394美国
1124755118加拿大比肯斯菲尔德45.4333 -73.8667
1250921305弗尼-伏尔泰46.2558 6.1081法国
1156346497江山28.7412 118.6225中国
1231393325 Dīla 6.4104 38.3100埃塞俄比亚
1192391794吉巴拉21.1072 -76.1367古巴
1840054954汉普斯特德42.8821 -71.1709美国
1840005111西伊斯利普40.7097 -73.2971美国
1076327352保罗尼亚-22.7611 -47.1542巴西
id区位城市乡村
16620625-5686 45.5333,-73.2833圣-巴西-乐-大加拿大
16310427-5502 52.0000,84.9833白俄罗斯
16501010-4957 -14.4667,136.2667凯瑟琳澳大利亚
16110430-8679 40.5626,-74.5743 Finderne美国
16990624-4174 27.7526,-90.7394南帕萨迪纳中国
16790311-9092 35.98157,-160.41182美国江山
16650927-9151 44.7667,39.8667西伊斯利普俄罗斯
16530328-2221 -22.8858,-48.4450巴西
16411229-7314 42.8821,-71.1709
坦桑尼亚基比提16060229-4175 -7.7296,38.9500
到目前为止,我的代码如下:
city_df = pd.merge(df1,df2,on ='city',how ='left')
发布于 2022-04-10 08:03:52
首先将lat和lng列添加到df2
df2[['lat', 'lng']] = df2['location'].str.split(', ', expand=True)
df2[['lat', 'lng']] = df2[['lat', 'lng']].astype(float)然后基于城市将df1与df2合并
city_df = pd.merge(df1[['lat', 'lng', 'city', 'country']], df2, on='city', how ='right', suffixes=('_correct', ''))查找df2中“位于”错误国家的城市
m = ~((city_df['country_correct'] == city_df['country']) | city_df['country_correct'].isna())print(city_df[m])
lat_correct lng_correct city country_correct id location country lat lng
4 27.7526 -82.7394 South Pasadena United States 16990624-4174 27.7526, -90.7394 China 27.75260 -90.73940
5 28.7412 118.6225 Jiangshan China 16790311-9092 35.98157, -160.41182 United States 35.98157 -160.41182
6 40.7097 -73.2971 West Islip United States 16650927-9151 44.7667, 39.8667 Russia 44.76670 39.86670发布于 2022-04-10 09:50:30
要比较这两种数据帧,首先更容易获得类似格式的df1和df2。例如,df1应该是这样的:
lat lng country
city
Katherine -14.4667 132.2667 Australia
South Pasadena 27.7526 -82.7394 United States
Beaconsfield 45.4333 -73.8667 Canada
Ferney-Voltaire 46.2558 6.1081 France
Jiangshan 28.7412 118.6225 China
Dīla 6.4104 38.3100 Ethiopia
Gibara 21.1072 -76.1367 Cuba
Hampstead 42.8821 -71.1709 United States
West Islip 40.7097 -73.2971 United States
Paulínia -22.7611 -47.1542 Brazil和df2:
country2 lng2 lat2
city
Saint-Basile-le-Grand Canada -73.2833 45.5333
Belokurikha Russia 84.9833 52.0000
Katherine Australia 132.2667 -14.4667
Finderne United States -74.5743 40.5626
South Pasadena United States -82.7394 27.7526
West Islip United States -160.41182 35.98157
Belorechensk Russia 39.8667 44.7667
Botucatu Brazil -48.4450 -22.8858
Hampstead United States -71.1709 42.8821
Kibiti Tanzania 38.9500 -7.7296然后您可以在pd.concat上使用axis=1方法,如下所示:
为了获得以下df,df3 = pd.concat([df1,df2],axis=1):
lat lng country country2 lng2 lat2
city
Katherine -14.4667 132.2667 Australia Australia 132.2667 -14.4667
South Pasadena 27.7526 -82.7394 United States United States -82.7394 27.7526
Beaconsfield 45.4333 -73.8667 Canada NaN NaN NaN
Ferney-Voltaire 46.2558 6.1081 France NaN NaN NaN
Jiangshan 28.7412 118.6225 China NaN NaN NaN
Dīla 6.4104 38.3100 Ethiopia NaN NaN NaN
Gibara 21.1072 -76.1367 Cuba NaN NaN NaN
Hampstead 42.8821 -71.1709 United States United States -71.1709 42.8821
West Islip 40.7097 -73.2971 United States United States -160.41182 35.98157
Paulínia -22.7611 -47.1542 Brazil NaN NaN NaN
Saint-Basile-le-Grand NaN NaN NaN Canada -73.2833 45.5333
Belokurikha NaN NaN NaN Russia 84.9833 52.0000
Finderne NaN NaN NaN United States -74.5743 40.5626
Belorechensk NaN NaN NaN Russia 39.8667 44.7667
Botucatu NaN NaN NaN Brazil -48.4450 -22.8858
Kibiti NaN NaN NaN Tanzania 38.9500 -7.7296最后,从连接的df3中可以得到df2中的纬度和经度与df1中的城市名称不匹配的行:
df3[(df3['lat']!=df3['lat2']) & (df3['lng']!=df3['lng2'])].dropna()
lat lng country country2 lng2 lat2
city
West Islip 40.7097 -73.2971 United States United States -160.41182 35.98157要查找df2中“位于”错误国家的城市:
df3[df3['country']!=df3['country2']]https://stackoverflow.com/questions/71814281
复制相似问题