我有一个描述数据中心项目的模型(例如,机架,PDU,补丁面板,开关,服务器,刀片,.)。数据以CSV的形式定期导出(每月一次),该导出目前有近5000行和每行54个字段。
对于我们的Django应用程序,大约80%的行是相关的:它们的设备类型必须在预定义的类型列表中,并且项目必须有与其关联的条形码(内部资产编号)。其余的行被跳过。
模型(见下文)捕获了所有可用字段,即使它们未被使用。这样做的原因是,一旦应用程序变得富有成效,它将有几个相互关联的模型,但来自不同的数据源。到目前为止,它们都是完全不相关的,目标是检测各种数据库中的线头,并随着时间的推移,清理数据。也许在将来,一些数据源甚至可能被声明为权威的,因此当主表示发生变化时,可以使用它们的数据为辅助数据库自动生成更新。
在应用程序中,我们目前通过表单上传CSV文件,下面是CSV (匿名数据)的示例:
"Name";"Serial Number";"Barcode";"Installation Date";"Model Name";"Description";"Manufacturer";"Part Number";"Manufacturer's Nameplate (W)";"Adjusted Nameplate (W)";"Location";"Room";"Row";"Floor Coordinates";"Rack";"U-Position";"U-Height";"Max Power Ports";"Product family";"Bay type";"Bay count";"Module type";"Weight";"Zero-based numbering";"Power supply";"Breaker Type";"Circuit Number";"Breaker Phase";"Breaker Panel";"Equipment type";"Port Description";"IP Address";"Order";"GeraeteSubTyp";"Datum";"Address";"RfC-Abbau";"Lodger";"GeraeteTyp";"SerialNr";"Techniker";"Bemerkung";"URL";"System-Status";"InventarNr";"SerialNrMotherboardCisco";"AssetID";"Connected to";"RfC-Aufbau";"Device Description";"Parent System";"Group";"Device Location";"Device Name";
"hostname1";"";"123456789";"01/01/2015";"MI-3230";"MyCorp Infrastructure 3230";"MyCorp";"";"300";"0";"U-16/Rack07/Row01/Room23/Provider2/Location1/Country1/";"23";"01";"";"07";"16";"2";;"";"";"";"";"17 kg";"";"";"";"";"";"";"Layer2 Network Gear";"";"";"";"";"01.01.2015";"";"";"";"Net";"3230545CZM4523";"TN2";"";"";"Active";"";"";"123456789";"987654321";"";"";"";"";"";"";
"hostname2";"";"987654321";"26/9/2014";"MI-3330";"MyCorp Infrastructure 3330";"MyCorp";"";"750";"750";"U-27/Rack16/Row13/Room05/Provider1/Location1/Country2/";"05";"13";"";"16";"27";"2";;"";"";"";"";"37.8 kg";"";"";"";"";"";"";"Generic Rack-mount Equipment";"";"";"";"";"2014.09.26";"";"";"";"Storage";"3330978BJI1037";"TN1";"2";"";"ReUse";"";"";"987654321";"123456789";"";"";"";"";"";"";数据一旦进入数据库,通常不会有太大的变化。更改通常发生在生命周期边界(硬件从主动使用中被拆除),或者如果被重用或替换,则可能会被移动--比如说,每个月大约有10次更改。
向数据库中添加新条目并识别现有条目(目前我正在跳过它们)就像一种魅力--但是它花费了很长时间(目前只有5到10分钟,我还没有说更新!)到目前为止,我的看法如下:
def update_sw_db(request):
extra_content = {}
if request.method == 'POST':
form = UpdateStruxurewareDatabaseForm(request.POST, request.FILES)
if form.is_valid():
result_dict = {}
add_list = []
upd_list = []
skip_list = []
err_list = []
all_fields = SWRecord().get_all_fields()
field_list = []
type_list = ['Blade', 'Blade Enclosure', 'Generic Rack-mount Equipment', 'Layer2 Network Gear', 'Layer3 Network Gear', 'Switch Enclosure', 'Switch Module']
re_en_old = re.compile('\d{1,2}/\d{1,2}/\d{1,2}')
re_de_old = re.compile('\d{1,4}\.\d{1,2}\.\d{1,4}')
re_en_new = re.compile('\d{4}-\d{2}-\d{2}')
for field in all_fields:
field_list.append(field['name'])
decoded_csv = codecs.iterdecode(form.cleaned_data['csv_file'], 'utf-8')
# skip the first three lines (header lines)
next(decoded_csv)
next(decoded_csv)
next(decoded_csv)
reader = csv.reader(decoded_csv, delimiter=';', quotechar='"')
for row in reader:
sw_record = SWRecord()
err_message = ''
if len(field_list) +1 != len(row):
# each row has a trailing empty element so it's off by one
# compared to the field list. If this condition is not met
# that's an error. Deal with it!
err_message = err_message + 'Error! row has %s elements (should be %s)! Offending row:\n%S\n' % (len(row), len(field_list) +1, row)
else:
has_error = False
for col_idx in range(len(field_list)):
value = row[col_idx]
if col_idx == 3 or col_idx == 34:
date_items = ['', '', '']
split_char = ''
if value == '':
value = None
continue
elif re_en_old.search(value):
split_char = '/'
date_items = value.split(split_char)
elif re_de_old.search(value):
split_char = '.'
date_items = value.split(split_char)
elif re_en_new.search(value):
split_char = '-'
date_items = value.split(split_char)
else:
err_message = err_message + 'Error: strange date encountered: "%s"\n' % value
has_error = True
if not has_error:
for date_item in range(len(date_items)):
date_items[date_item] = date_items[date_item].strip()
if len(date_items[0]) == 4 and len(date_items[2]) <= 2:
year = date_items[0]
month = date_items[1].zfill(2)
day = date_items[2].zfill(2)
elif len(date_items[2]) == 4 and len(date_items[0]) <= 2:
year = date_items[2]
month = date_items[1].zfill(2)
day = date_items[0].zfill(2)
elif len(date_items[2]) <= 2 and len(date_items[0]) <= 2:
year = '20' + date_items[2].zfill(2)
if split_char == '/':
month = date_items[0].zfill(2)
day = date_items[1].zfill(2)
elif split_char == '.':
month = date_items[1].zfill(2)
day = date_items[0].zfill(2)
else:
err_message = err_message + 'Error: unknown split char: "%s", raw value="%s"\n' % (split_char, row[col_idx])
has_error = True
elif date_items[2] == '207':
year = '2017'
month = date_items[1].zfill(2)
day = date_items[0].zfill(2)
else:
err_message = err_message + 'Error: strange date encountered: "%s"\n' % value
has_error = True
# basic sanity check...
int_year = int(year)
int_month = int(month)
int_day = int(day)
try:
new_date = datetime.datetime(year=int_year,month=int_month,day=int_day)
except ValueError as e:
err_message = err_message + 'Error: invalid date: raw value="%s" decoded to: year="%s" month="%s day="%s"\n%s\n' % (row[col_idx], year, month, day, e)
has_error = True
value = year + '-' + month + '-' + day
setattr(sw_record, field_list[col_idx], value)
if not has_error:
if sw_record.equipment_type in type_list and sw_record.barcode != '':
existing_set = SWRecord.objects.filter(barcode=sw_record.barcode)
if existing_set.count() == 0:
try:
add_list.append(sw_record)
except:
sw_record.remark = sw_record.remark + 'Error: could not save SWRecord "%s", content: "%s"' % (sw_record, sw_record.to_dict())
err_list.append(sw_record)
raise
elif existing_set.count() == 1:
#TODO: add updating code...
existing_record = SWRecord.objects.get(barcode=sw_record.barcode)
update_fields = []
print('Updated fields: "%s"' % update_fields)
sw_record.remark = sw_record.remark + 'Skipped: Entry already existing: sw_record="%s", content="%s"' % (sw_record, sw_record.to_dict())
skip_list.append(sw_record)
elif existing_set.count() > 1:
sw_record.remark = sw_record.remark + 'Error: duplicate AID: occurrences="%s"; SWRecord="%s", existing_set="%s"' % (existing_set.count(), sw_record, existing_set)
err_list.append(sw_record)
else:
sw_record.remark = sw_record.remark + 'Skipped: Entry not relevant: sw_record="%s", content="%s"' % (sw_record, sw_record.to_dict())
skip_list.append(sw_record)
else:
sw_record.remark = sw_record.remark + err_message
err_list.append(sw_record)
with transaction.atomic():
# Use a list of indices (integers) to record any failed save attempts.
# After adding remove those erred entry from add_list
erred_indices = []
# Loop over each result and invoke save() on each entry
for index, add_result in enumerate(add_list):
try:
# save() method called on each member to create record
add_result.save()
except ValueError as e:
sw_record.remark = sw_record.remark + 'ValueError encountered while trying to save SWRecord "%s", content: "%s"\n%s\n' % (sw_record, sw_record.to_dict(), e)
erred_indices.append(index)
err_list.append(sw_record)
except ValidationError as e:
sw_record.remark = sw_record.remark + 'ValidationError encountered while trying to save SWRecord "%s", content: "%s"\n%s\n' % (sw_record, sw_record.to_dict(), e)
erred_indices.append(index)
err_list.append(sw_record)
except:
sw_record.remark = sw_record.remark + 'Unexcpected error encountered while trying to save SWRecord "%s", content: "%s"\n%s\n' % (sw_record, sw_record.to_dict(), sys.exc_info()[0])
erred_indices.append(index)
err_list.append(sw_record)
raise
for index in reversed(erred_indices):
del add_list[erred_indices[index]]
for k, v in {
'with errors': err_list,
'added': add_list,
'updated': upd_list,
'skipped': skip_list,
}.items():
if v:
result_dict.update({k: v})
extra_content['result_dict'] = result_dict
else:
extra_content['form'] = form
else:
form = UpdateStruxurewareDatabaseForm()
extra_content['form'] = form
return render(request, 'damagecontrol/update_sw_db.html', extra_content)即使我再次使用相同的文件,也不会花费更少的时间,因为我正在遍历所有的行和所有的字段--记住,这里有将近5000行和54个字段,所以我们在这里讨论的是大约27万个步骤。我对Python & Django相当陌生(我已经使用Python&Django约3个月了),我一直在搜索'net‘,发现了各种各样有用的东西--但不幸的是,没有什么能回答我的问题:
你能告诉我(我坚信至少有一个)一个更好的方法来上传CSV文件,检查更改并将它们应用到数据库中吗?更好的是,在这种情况下,更简单、更快,但仍然可靠(看看我正在做的“日期”转换--不幸的是,源数据库中的字段是自由文本,过去它被“自由地”使用.)
由于上传将在现实生活中只发生一次每月,性能不是首要的-如果它需要10或15分钟,它需要10或15分钟。对我来说,最重要的问题是:如果我认识到从CSV行创建的SWRecord已经在数据库中(条形码可以在这里安全地假定为pk ),我如何比较新对象的内容和从数据库获取的内容,如果有差异,如何更新数据库?
我们可以假设CSV文件总是正确的。这并不漂亮,但这是事实。
这里,为了完整起见,摘自models.py的相关内容如下:
class PrintableModel(models.Model):
# define global fields
created = models.DateTimeField(editable=False)
modified = models.DateTimeField(blank=True)
# Override save() to allow automatic saving of created / modified times
def save(self, *args, **kwargs):
''' On save, update timestamps '''
if not self.id:
self.created = timezone.now()
self.modified = timezone.now()
return super(PrintableModel, self).save(*args, **kwargs)
def __repr__(self):
return str(self.to_dict())
def to_dict(self):
opts = self._meta
data = {}
for f in opts.concrete_fields + opts.many_to_many:
if isinstance(f, ManyToManyField):
if self.pk is None:
data[f.name] = []
else:
data[f.name] = list(f.value_from_object(self).values_list('pk', flat=True))
else:
data[f.name] = f.value_from_object(self)
return data
# get all fields of the model for iterations in templates
def get_all_fields(self, exclude_list):
fields = []
for f in self._meta.fields:
fname = f.name
# resolve picklists/choices, with get_xyz_display() function
get_choice = 'get_'+fname+'_display'
if hasattr( self, get_choice):
value = getattr( self, get_choice)()
else:
try :
value = getattr(self, fname)
except TCRecord.DoesNotExist:
value = None
# only display fields with values and skip some fields entirely
if f.editable and f.name not in exclude_list:
fields.append(
{
'label':f.verbose_name,
'name':f.name,
'value':value,
}
)
return fields
class Meta:
abstract = True
class SWRecord(PrintableModel):
class Meta:
verbose_name = 'StruxureWare Record'
verbose_name_plural = 'StruxureWare Records'
ordering = ('barcode', 'asset_id')
def __str__(self):
return '%s %s (%s)' % (self.manufacturer, self.model_name, self.barcode)
def get_all_fields(self):
return super(SWRecord, self).get_all_fields(['id', 'created', 'modified'])
# general management fields
# CSV fields
name = models.CharField(max_length=40, verbose_name='Name')
serial_number2 = models.CharField(max_length=40, verbose_name='Serial Number', blank=True)
barcode = models.CharField(max_length=40, verbose_name='Barcode', blank=True)
installation_date = models.DateField(verbose_name='Installation Date') # Date Field?
model_name = models.CharField(max_length=40, verbose_name='Model Name')
description = models.CharField(max_length=40, verbose_name='Description', blank=True)
manufacturer = models.CharField(max_length=40, verbose_name='Manufacturer')
part_number = models.CharField(max_length=40, verbose_name='Part Number', blank=True)
powerconsumption_manufacturer = models.CharField(max_length=40, verbose_name='Manufacturer\'s Nameplate (W)', blank=True)
powerconsumption_adjusted = models.CharField(max_length=40, verbose_name='Adjusted Nameplate (W)', blank=True)
location = models.CharField(max_length=40, verbose_name='Location')
room = models.CharField(max_length=40, verbose_name='Room')
row = models.CharField(max_length=40, verbose_name='Row', blank=True)
floor_coordinates = models.CharField(max_length=40, verbose_name='Floor Coordinates', blank=True)
rack = models.CharField(max_length=40, verbose_name='Rack', blank=True)
u_position = models.CharField(max_length=40, verbose_name='U-Position', blank=True)
u_height = models.CharField(max_length=40, verbose_name='U-Height', blank=True)
max_powerports = models.CharField(max_length=40, verbose_name='Max Power Ports', blank=True)
product_family = models.CharField(max_length=40, verbose_name='Product family', blank=True)
bay_type = models.CharField(max_length=40, verbose_name='Bay type', blank=True)
bay_count = models.CharField(max_length=40, verbose_name='Bay count', blank=True)
module_type = models.CharField(max_length=40, verbose_name='Module type', blank=True)
weight = models.CharField(max_length=40, verbose_name='Weight')
zerobased_numbering = models.CharField(max_length=40, verbose_name='Zero-based numbering', blank=True)
power_supply = models.CharField(max_length=40, verbose_name='Power supply', blank=True)
breaker_type = models.CharField(max_length=40, verbose_name='Breaker Type', blank=True)
circuit_number = models.CharField(max_length=40, verbose_name='Circuit Number', blank=True)
breaker_phase = models.CharField(max_length=40, verbose_name='Breaker Phase', blank=True)
breaker_panel = models.CharField(max_length=40, verbose_name='Breaker Panel', blank=True)
equipment_type = models.CharField(max_length=40, verbose_name='Equipment type')
port_description = models.CharField(max_length=40, verbose_name='Port Description', blank=True)
ip_address = models.GenericIPAddressField(protocol='IPv4', verbose_name='IP Address', blank=True, null=True)
order = models.CharField(max_length=40, verbose_name='Order', blank=True)
device_subtype = models.CharField(max_length=40, verbose_name='GeraeteSubTyp', blank=True)
date = models.DateField(max_length=40, verbose_name='Datum', blank=True, null=True) # Beware conversion!
address = models.CharField(max_length=40, verbose_name='Address', blank=True) # apparently never used...
rfc_dismantling = models.CharField(max_length=40, verbose_name='RfC-Abbau', blank=True)
lodger = models.CharField(max_length=40, verbose_name='Lodger', blank=True) # for customer-owned property
device_type = models.CharField(max_length=40, verbose_name='GeraeteTyp', blank=True)
serial_number = models.CharField(max_length=40, verbose_name='SerialNr', blank=True)
technician = models.CharField(max_length=40, verbose_name='Techniker', blank=True)
remark = models.TextField(verbose_name='Bemerkung', blank=True)
url = models.CharField(max_length=40, verbose_name='URL', blank=True)
system_status = models.CharField(max_length=40, verbose_name='System-Status', blank=True)
inventory_number = models.CharField(max_length=40, verbose_name='InventarNr', blank=True)
sn_ciscomb = models.CharField(max_length=40, verbose_name='SerialNrMotherboardCisco', blank=True)
asset_id = models.CharField(max_length=40, verbose_name='AssetID', blank=True)
connected_to = models.CharField(max_length=40, verbose_name='Connected to', blank=True)
rfc_installation = models.CharField(max_length=40, verbose_name='RfC-Aufbau', blank=True)
device_description = models.CharField(max_length=40, verbose_name='Device Description', blank=True)
parent_system = models.CharField(max_length=40, verbose_name='Parent System', blank=True)
group = models.CharField(max_length=40, verbose_name='Group', blank=True)
device_location = models.CharField(max_length=40, verbose_name='Device Location', blank=True)
device_name = models.CharField(max_length=40, verbose_name='Device Name', blank=True)谢谢你抽出时间阅读这一切..。
发布于 2018-02-07 19:33:59
如果您只是将条形码与SWRecord.objects.filter(barcode=sw_record.barcode)进行比较,那么在迭代行并检查该列表中每一行的条形码计数之前,创建一个显示在db 中的条形码列表不是更明智吗?有点像这样:
barcode_list = SWRecord.objects.values_list('barcode', flat=True)
for row in reader:
barcode_count = barcodes_list.count(row['barcode'])
if barcode_count==0:
add_list.append(create_SWRecord_from_row(row)) # creates unsaved SWRecord
elif barcode_count==1:
update_list(update_SWRecord_from_row(row)) # prepares the update_data for an update
else:
deal_with_it(row) # oh god duplicate primary keys
with transaction.atomic():
SWRecord.objects.bulk_create(add_list) # if you do not need the save() method
for pk, data in update_list.items():
SWRecord.objects.filter(pk=pk).update(**data) # save() isn't called here either在保存之前,这将准确地击中数据库一次,而不是每次行通过时都没有错误。bulk_create不调用save()方法,只访问数据库一次,因此lot速度更快。
https://stackoverflow.com/questions/48669326
复制相似问题