我有一个文件,看起来像这样:
#This is TEST-data
2020-09-07T00:00:03.230+02:00,ID-10,3,London,Manchester,London,1,1,1
2020-09-07T00:00:03.230+02:00,ID-10,3,London,London,Manchester,1,1
2020-09-07T00:00:03.230+02:00,ID-20,2,London,London,1,1
2020-09-07T00:00:03.230+02:00,ID-20,2,London,London1,1
2020-09-07T00:00:03.230+02:00,ID-30,3,Madrid,Sevila,Sevilla,1,1,1
2020-09-07T00:00:03.230+02:00,ID-30,GGG,Madrid,Sevilla,Madrid,1
2020-09-07T00:00:03.230+02:00,ID-40,GGG,Madrid,Barcelona,1,1,1,1
2020-09-07T00:00:03.230+02:00
2020-09-07T00:00:03.230+02:00每行中的Index[2]显示该特定行中有多少个城市。因此,第一行的index[2]值为3,即London, Manchester, London.
我正在尝试做以下几件事:
对于每一行,我需要检查cities_to_filter中是否存在row [3] +后面提到的城市(基于城市的数量)。但是,只有当row2是一个数字时,才需要这样做。我还需要解决一些行包含的项少于2个的事实。
这是我的代码:
path = r'c:\data\ELK\Desktop\test_data_countries.txt'
cities_to_filter = ['Sevilla', 'Manchester']
def filter_row(row):
if row[2].isdigit():
amount_of_cities = int(row[2]) if len(row) > 2 else True
cities_to_check = row[3:3+amount_of_cities]
condition_1 = any(city in cities_to_check for city in cities_to_filter)
return condition_1
with open (path, 'r') as output_file:
reader = csv.reader(output_file, delimiter = ',')
next(reader)
for row in reader:
if filter_row(row):
print(row)现在我收到以下错误:
UnboundLocalError: local variable 'condition_1' `referenced before assignment`发布于 2021-02-02 23:00:31
你可以这样做:
import sys
def filter_row(row):
'''Returns True if the row should be removed'''
if len(row) > 2:
if row[2].isdigit():
amount_of_cities = int(row[2])
cities_to_check = row[3:3+amount_of_cities]
else:
# don't have valid city count, just try the rest of the row
cities_to_check = row[3:]
return any(city in cities_to_check for city in cities_to_filter)
print(f'Invalid row: {row}', file=sys.stderr))
return True
with open (path, 'r') as input_file:
reader = csv.reader(input_file, delimiter = ',')
next(reader)
for row in reader:
if filter_row(row):
print(row)在filter()中,检查行长以确保row[2]中存在可能的城市计数。如果计数是一个数字,它被用来计算上限,以提取要检查的城市。否则,将处理从索引3到行尾的行,其中将包括附加的数字值,但可能不包括城市名称。
如果字段太少,将通过返回True对该行进行过滤,并打印一条错误消息。
发布于 2021-02-02 23:31:45
我建议你在过滤之前对所有东西进行优化。这里是你应该探索的路径的开始:
test_data = pd.DataFrame({'ID':['ID-10','ID-10','ID-20','ID-20','ID-30','ID-30','ID-40'],'id':[3,3,2,2,3,'GGG','GGG'],'cities':[['London','Manchester','London',1,1,1],['London','Manchester','London',1,1],['London','London',1,1],['London','London',1,1],['Madrid','Sevilla','Sevilla',1,1,1],['Madrid','Sevilla','Sevilla',1],['Madrid','Barçelona',1]]})
cities_to_filter = ['Sevilla', 'Manchester']
_condition1 = test_data.index.isin(test_data[test_data.id.str.isnumeric() != False][test_data[test_data.id.str.isnumeric() != False].id > 2].index)
test_data['results'] = np.where( _condition1,1,0)
test_data输出:

然后你在中应用一个'any()‘来过滤城市,但是有很多方法。
https://stackoverflow.com/questions/66010510
复制相似问题