我有一个像这样的文件:
#This is TEST-data
2020-09-07T00:00:03.230+02:00,ID-10,3,London,Manchester,London,1,1,1
2020-09-07T00:00:03.230+02:00,ID-10,3,London,London,Manchester,1,1
2020-09-07T00:00:03.230+02:00,ID-20,2,London,London,1,1
2020-09-07T00:00:03.230+02:00,ID-20,2,London,London1,1
2020-09-07T00:00:03.230+02:00,ID-30,3,Madrid,Sevila,Sevilla,1,1,1
2020-09-07T00:00:03.230+02:00,ID-30,3,Madrid,Sevilla,Madrid,1
2020-09-07T00:00:03.230+02:00,ID-40,2,Madrid,Barcelona,1,1,1,1每一行中的Index[2]显示该特定行中有多少城市。因此,第一行的值为index2,即London, Manchester, London.
我想做以下几点:
对于每一行,我需要检查第3行+后面提到的城市(根据城市数量)是否存在于cities_to_filter.中。
这是我目前的代码:
path = r'c:\data\ELK\Desktop\test_data_countries.txt'
cities_to_filter = ['Sevilla', 'Manchester']
def filter_row(row):
# amount_of_cities = row[2]
condition_1 = any(city in row for city in cities_to_filter)
return condition_1
with open (path, 'r') as output_file:
reader = csv.reader(output_file, delimiter = ',')
next(reader)
for row in reader:
if filter_row(row):
print(row)对于这个数据集,我的代码工作得很好,但是它的风险很小,因为它查看每一列,甚至那些我知道的列都不是城市。我需要我的代码只检查列,这些列是基于每一行包含的城市数量的城市。
发布于 2021-02-02 11:52:21
城市“列表”总是以相同的偏移量开始,长度从row[2]中得知。因此,只需将其切片,并使用any()表达式检查cites是否要筛选,或者可以使用set操作,但这可能是过分的:
import csv
path = r'c:\data\ELK\Desktop\test_data_countries.txt'
cities_to_filter = ['Sevilla', 'Manchester']
def filter_row(row):
count = int(row[2])
cities = row[3:3+count]
return any(city in cities for city in cities_to_filter)
with open (path, 'r') as input_file:
reader = csv.reader(input_file, delimiter = ',')
next(reader)
for row in reader:
if filter_row(row):
print(row)另外,在读取文件时将output_file重命名为input_file,而不是写入文件。
输出
['2020-09-07T00:00:03.230+02:00', 'ID-10', '3', 'London', 'Manchester', 'London', '1', '1', '1']
['2020-09-07T00:00:03.230+02:00', 'ID-10', '3', 'London', 'London', 'Manchester', '1', '1']
['2020-09-07T00:00:03.230+02:00', 'ID-30', '3', 'Madrid', 'Sevila', 'Sevilla', '1', '1', '1']
['2020-09-07T00:00:03.230+02:00', 'ID-30', '3', 'Madrid', 'Sevilla', 'Madrid', '1']https://stackoverflow.com/questions/66008974
复制相似问题