首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何用一个大的CSV文件(54个字段,~5000行)更新Django模型?

如何用一个大的CSV文件(54个字段,~5000行)更新Django模型?
EN

Stack Overflow用户
提问于 2018-02-07 16:59:28
回答 1查看 1K关注 0票数 0

我有一个描述数据中心项目的模型(例如,机架,PDU,补丁面板,开关,服务器,刀片,.)。数据以CSV的形式定期导出(每月一次),该导出目前有近5000行和每行54个字段。

对于我们的Django应用程序,大约80%的行是相关的:它们的设备类型必须在预定义的类型列表中,并且项目必须有与其关联的条形码(内部资产编号)。其余的行被跳过。

模型(见下文)捕获了所有可用字段,即使它们未被使用。这样做的原因是,一旦应用程序变得富有成效,它将有几个相互关联的模型,但来自不同的数据源。到目前为止,它们都是完全不相关的,目标是检测各种数据库中的线头,并随着时间的推移,清理数据。也许在将来,一些数据源甚至可能被声明为权威的,因此当主表示发生变化时,可以使用它们的数据为辅助数据库自动生成更新。

在应用程序中,我们目前通过表单上传CSV文件,下面是CSV (匿名数据)的示例:

代码语言:javascript
复制
"Name";"Serial Number";"Barcode";"Installation Date";"Model Name";"Description";"Manufacturer";"Part Number";"Manufacturer's Nameplate (W)";"Adjusted Nameplate (W)";"Location";"Room";"Row";"Floor Coordinates";"Rack";"U-Position";"U-Height";"Max Power Ports";"Product family";"Bay type";"Bay count";"Module type";"Weight";"Zero-based numbering";"Power supply";"Breaker Type";"Circuit Number";"Breaker Phase";"Breaker Panel";"Equipment type";"Port Description";"IP Address";"Order";"GeraeteSubTyp";"Datum";"Address";"RfC-Abbau";"Lodger";"GeraeteTyp";"SerialNr";"Techniker";"Bemerkung";"URL";"System-Status";"InventarNr";"SerialNrMotherboardCisco";"AssetID";"Connected to";"RfC-Aufbau";"Device Description";"Parent System";"Group";"Device Location";"Device Name";
"hostname1";"";"123456789";"01/01/2015";"MI-3230";"MyCorp Infrastructure 3230";"MyCorp";"";"300";"0";"U-16/Rack07/Row01/Room23/Provider2/Location1/Country1/";"23";"01";"";"07";"16";"2";;"";"";"";"";"17 kg";"";"";"";"";"";"";"Layer2 Network Gear";"";"";"";"";"01.01.2015";"";"";"";"Net";"3230545CZM4523";"TN2";"";"";"Active";"";"";"123456789";"987654321";"";"";"";"";"";"";
"hostname2";"";"987654321";"26/9/2014";"MI-3330";"MyCorp Infrastructure 3330";"MyCorp";"";"750";"750";"U-27/Rack16/Row13/Room05/Provider1/Location1/Country2/";"05";"13";"";"16";"27";"2";;"";"";"";"";"37.8 kg";"";"";"";"";"";"";"Generic Rack-mount Equipment";"";"";"";"";"2014.09.26";"";"";"";"Storage";"3330978BJI1037";"TN1";"2";"";"ReUse";"";"";"987654321";"123456789";"";"";"";"";"";"";

数据一旦进入数据库,通常不会有太大的变化。更改通常发生在生命周期边界(硬件从主动使用中被拆除),或者如果被重用或替换,则可能会被移动--比如说,每个月大约有10次更改。

向数据库中添加新条目并识别现有条目(目前我正在跳过它们)就像一种魅力--但是它花费了很长时间(目前只有5到10分钟,我还没有说更新!)到目前为止,我的看法如下:

代码语言:javascript
复制
def update_sw_db(request):
    extra_content = {}
    if request.method == 'POST':
        form = UpdateStruxurewareDatabaseForm(request.POST, request.FILES)
        if form.is_valid():
            result_dict = {}
            add_list = []
            upd_list = []
            skip_list = []
            err_list = []
            all_fields = SWRecord().get_all_fields()
            field_list = []
            type_list = ['Blade', 'Blade Enclosure', 'Generic Rack-mount Equipment', 'Layer2 Network Gear', 'Layer3 Network Gear', 'Switch Enclosure', 'Switch Module']
            re_en_old = re.compile('\d{1,2}/\d{1,2}/\d{1,2}')
            re_de_old = re.compile('\d{1,4}\.\d{1,2}\.\d{1,4}')
            re_en_new = re.compile('\d{4}-\d{2}-\d{2}')
            for field in all_fields:
                field_list.append(field['name'])
            decoded_csv = codecs.iterdecode(form.cleaned_data['csv_file'], 'utf-8')
            # skip the first three lines (header lines)
            next(decoded_csv)
            next(decoded_csv)
            next(decoded_csv)
            reader = csv.reader(decoded_csv, delimiter=';', quotechar='"')
            for row in reader:
                sw_record = SWRecord()
                err_message = ''
                if len(field_list) +1 != len(row):
                    # each row has a trailing empty element so it's off by one
                    # compared to the field list. If this condition is not met
                    # that's an error. Deal with it!
                    err_message = err_message + 'Error! row has %s elements (should be %s)! Offending row:\n%S\n' % (len(row), len(field_list) +1, row)
                else:
                    has_error = False
                    for col_idx in range(len(field_list)):
                        value = row[col_idx]
                        if col_idx == 3 or col_idx == 34:
                            date_items = ['', '', '']
                            split_char = ''
                            if value == '':
                                value = None
                                continue
                            elif re_en_old.search(value):
                                split_char = '/'
                                date_items = value.split(split_char)
                            elif re_de_old.search(value):
                                split_char = '.'
                                date_items = value.split(split_char)
                            elif re_en_new.search(value):
                                split_char = '-'
                                date_items = value.split(split_char)
                            else:
                                err_message = err_message + 'Error: strange date encountered: "%s"\n' % value
                                has_error = True
                            if not has_error:
                                for date_item in range(len(date_items)):
                                    date_items[date_item] = date_items[date_item].strip()
                                if len(date_items[0]) == 4 and len(date_items[2]) <= 2:
                                    year = date_items[0]
                                    month = date_items[1].zfill(2)
                                    day = date_items[2].zfill(2)
                                elif len(date_items[2]) == 4 and len(date_items[0]) <= 2:
                                    year = date_items[2]
                                    month = date_items[1].zfill(2)
                                    day = date_items[0].zfill(2)
                                elif len(date_items[2]) <= 2 and len(date_items[0]) <= 2:
                                    year = '20' + date_items[2].zfill(2)
                                    if split_char == '/':
                                        month = date_items[0].zfill(2)
                                        day = date_items[1].zfill(2)
                                    elif split_char == '.':
                                        month = date_items[1].zfill(2)
                                        day = date_items[0].zfill(2)
                                    else:
                                        err_message = err_message + 'Error: unknown split char: "%s", raw value="%s"\n' % (split_char, row[col_idx])
                                        has_error = True
                                elif date_items[2] == '207':
                                    year = '2017'
                                    month = date_items[1].zfill(2)
                                    day = date_items[0].zfill(2)
                                else:
                                    err_message = err_message + 'Error: strange date encountered: "%s"\n' % value
                                    has_error = True
                                # basic sanity check...
                                int_year = int(year)
                                int_month = int(month)
                                int_day = int(day)
                                try:
                                    new_date = datetime.datetime(year=int_year,month=int_month,day=int_day)
                                except ValueError as e:
                                    err_message = err_message + 'Error: invalid date: raw value="%s" decoded to: year="%s" month="%s day="%s"\n%s\n' % (row[col_idx], year, month, day, e)
                                    has_error = True
                                value = year + '-' + month + '-' + day
                        setattr(sw_record, field_list[col_idx], value)
                    if not has_error:
                        if sw_record.equipment_type in type_list and sw_record.barcode != '':
                            existing_set = SWRecord.objects.filter(barcode=sw_record.barcode)
                            if existing_set.count() == 0:
                                try:
                                    add_list.append(sw_record)
                                except:
                                    sw_record.remark = sw_record.remark + 'Error: could not save SWRecord "%s", content: "%s"' % (sw_record, sw_record.to_dict())
                                    err_list.append(sw_record)
                                    raise
                            elif existing_set.count() == 1:
                                    #TODO: add updating code...
                                    existing_record = SWRecord.objects.get(barcode=sw_record.barcode)
                                    update_fields = []
                                    print('Updated fields: "%s"' % update_fields)
                                    sw_record.remark = sw_record.remark + 'Skipped: Entry already existing: sw_record="%s", content="%s"' % (sw_record, sw_record.to_dict())
                                    skip_list.append(sw_record)
                            elif existing_set.count() > 1:
                                    sw_record.remark = sw_record.remark + 'Error: duplicate AID: occurrences="%s"; SWRecord="%s", existing_set="%s"' % (existing_set.count(), sw_record, existing_set)
                                    err_list.append(sw_record)
                        else:
                            sw_record.remark = sw_record.remark + 'Skipped: Entry not relevant: sw_record="%s", content="%s"' % (sw_record, sw_record.to_dict())
                            skip_list.append(sw_record)
                    else:
                        sw_record.remark = sw_record.remark + err_message
                        err_list.append(sw_record)
            with transaction.atomic():
                # Use a list of indices (integers) to record any failed save attempts.
                # After adding remove those erred entry from add_list
                erred_indices = []
                # Loop over each result and invoke save() on each entry
                for index, add_result in enumerate(add_list):
                    try:
                        # save() method called on each member to create record
                        add_result.save()
                    except ValueError as e:
                        sw_record.remark = sw_record.remark + 'ValueError encountered while trying to save SWRecord "%s", content: "%s"\n%s\n' % (sw_record, sw_record.to_dict(), e)
                        erred_indices.append(index)
                        err_list.append(sw_record)
                    except ValidationError as e:
                        sw_record.remark = sw_record.remark + 'ValidationError encountered while trying to save SWRecord "%s", content: "%s"\n%s\n' % (sw_record, sw_record.to_dict(), e)
                        erred_indices.append(index)
                        err_list.append(sw_record)
                    except:
                        sw_record.remark = sw_record.remark + 'Unexcpected error encountered while trying to save SWRecord "%s", content: "%s"\n%s\n' % (sw_record, sw_record.to_dict(), sys.exc_info()[0])
                        erred_indices.append(index)
                        err_list.append(sw_record)
                        raise
                for index in reversed(erred_indices):
                    del add_list[erred_indices[index]]
            for k, v in {
                'with errors': err_list,
                'added': add_list,
                'updated': upd_list,
                'skipped': skip_list,
            }.items():
                if v:
                    result_dict.update({k: v})
            extra_content['result_dict'] = result_dict
        else:
            extra_content['form'] = form
    else:
        form = UpdateStruxurewareDatabaseForm()
        extra_content['form'] = form
    return render(request, 'damagecontrol/update_sw_db.html', extra_content)

即使我再次使用相同的文件,也不会花费更少的时间,因为我正在遍历所有的行和所有的字段--记住,这里有将近5000行和54个字段,所以我们在这里讨论的是大约27万个步骤。我对Python & Django相当陌生(我已经使用Python&Django约3个月了),我一直在搜索'net‘,发现了各种各样有用的东西--但不幸的是,没有什么能回答我的问题:

你能告诉我(我坚信至少有一个)一个更好的方法来上传CSV文件,检查更改并将它们应用到数据库中吗?更好的是,在这种情况下,更简单、更快,但仍然可靠(看看我正在做的“日期”转换--不幸的是,源数据库中的字段是自由文本,过去它被“自由地”使用.)

由于上传将在现实生活中只发生一次每月,性能不是首要的-如果它需要10或15分钟,它需要10或15分钟。对我来说,最重要的问题是:如果我认识到从CSV行创建的SWRecord已经在数据库中(条形码可以在这里安全地假定为pk ),我如何比较新对象的内容和从数据库获取的内容,如果有差异,如何更新数据库?

我们可以假设CSV文件总是正确的。这并不漂亮,但这是事实。

这里,为了完整起见,摘自models.py的相关内容如下:

代码语言:javascript
复制
class PrintableModel(models.Model):
    # define global fields
    created     = models.DateTimeField(editable=False)
    modified    = models.DateTimeField(blank=True)

    # Override save() to allow automatic saving of created / modified times
    def save(self, *args, **kwargs):
        ''' On save, update timestamps '''
        if not self.id:
            self.created = timezone.now()
        self.modified = timezone.now()
        return super(PrintableModel, self).save(*args, **kwargs)

    def __repr__(self):
        return str(self.to_dict())

    def to_dict(self):
        opts = self._meta
        data = {}
        for f in opts.concrete_fields + opts.many_to_many:
            if isinstance(f, ManyToManyField):
                if self.pk is None:
                    data[f.name] = []
                else:
                    data[f.name] = list(f.value_from_object(self).values_list('pk', flat=True))
            else:
                data[f.name] = f.value_from_object(self)
        return data

    # get all fields of the model for iterations in templates
    def get_all_fields(self, exclude_list):
        fields = []
        for f in self._meta.fields:
            fname = f.name
            # resolve picklists/choices, with get_xyz_display() function
            get_choice = 'get_'+fname+'_display'
            if hasattr( self, get_choice):
                value = getattr( self, get_choice)()
            else:
                try :
                    value = getattr(self, fname)
                except TCRecord.DoesNotExist:
                    value = None
            # only display fields with values and skip some fields entirely
            if f.editable and f.name not in exclude_list:
                fields.append(
                    {
                    'label':f.verbose_name,
                    'name':f.name,
                    'value':value,
                    }
                )
        return fields

    class Meta:
        abstract = True

class SWRecord(PrintableModel):

    class Meta:
        verbose_name = 'StruxureWare Record'
        verbose_name_plural = 'StruxureWare Records'
        ordering = ('barcode', 'asset_id')

    def __str__(self):
        return '%s %s (%s)' % (self.manufacturer, self.model_name, self.barcode)

    def get_all_fields(self):
        return super(SWRecord, self).get_all_fields(['id', 'created', 'modified'])

    # general management fields
    # CSV fields
    name = models.CharField(max_length=40, verbose_name='Name')
    serial_number2 = models.CharField(max_length=40, verbose_name='Serial Number', blank=True)
    barcode = models.CharField(max_length=40, verbose_name='Barcode', blank=True)
    installation_date = models.DateField(verbose_name='Installation Date') # Date Field?
    model_name = models.CharField(max_length=40, verbose_name='Model Name')
    description = models.CharField(max_length=40, verbose_name='Description', blank=True)
    manufacturer = models.CharField(max_length=40, verbose_name='Manufacturer')
    part_number = models.CharField(max_length=40, verbose_name='Part Number', blank=True)
    powerconsumption_manufacturer = models.CharField(max_length=40, verbose_name='Manufacturer\'s Nameplate (W)', blank=True)
    powerconsumption_adjusted = models.CharField(max_length=40, verbose_name='Adjusted Nameplate (W)', blank=True)
    location = models.CharField(max_length=40, verbose_name='Location')
    room = models.CharField(max_length=40, verbose_name='Room')
    row = models.CharField(max_length=40, verbose_name='Row', blank=True)
    floor_coordinates = models.CharField(max_length=40, verbose_name='Floor Coordinates', blank=True)
    rack = models.CharField(max_length=40, verbose_name='Rack', blank=True)
    u_position = models.CharField(max_length=40, verbose_name='U-Position', blank=True)
    u_height = models.CharField(max_length=40, verbose_name='U-Height', blank=True)
    max_powerports = models.CharField(max_length=40, verbose_name='Max Power Ports', blank=True)
    product_family = models.CharField(max_length=40, verbose_name='Product family', blank=True)
    bay_type = models.CharField(max_length=40, verbose_name='Bay type', blank=True)
    bay_count = models.CharField(max_length=40, verbose_name='Bay count', blank=True)
    module_type = models.CharField(max_length=40, verbose_name='Module type', blank=True)
    weight = models.CharField(max_length=40, verbose_name='Weight')
    zerobased_numbering = models.CharField(max_length=40, verbose_name='Zero-based numbering', blank=True)
    power_supply = models.CharField(max_length=40, verbose_name='Power supply', blank=True)
    breaker_type = models.CharField(max_length=40, verbose_name='Breaker Type', blank=True)
    circuit_number = models.CharField(max_length=40, verbose_name='Circuit Number', blank=True)
    breaker_phase = models.CharField(max_length=40, verbose_name='Breaker Phase', blank=True)
    breaker_panel = models.CharField(max_length=40, verbose_name='Breaker Panel', blank=True)
    equipment_type = models.CharField(max_length=40, verbose_name='Equipment type')
    port_description = models.CharField(max_length=40, verbose_name='Port Description', blank=True)
    ip_address = models.GenericIPAddressField(protocol='IPv4', verbose_name='IP Address', blank=True, null=True)
    order = models.CharField(max_length=40, verbose_name='Order', blank=True)
    device_subtype = models.CharField(max_length=40, verbose_name='GeraeteSubTyp', blank=True)
    date = models.DateField(max_length=40, verbose_name='Datum', blank=True, null=True) # Beware conversion!
    address = models.CharField(max_length=40, verbose_name='Address', blank=True) # apparently never used...
    rfc_dismantling = models.CharField(max_length=40, verbose_name='RfC-Abbau', blank=True)
    lodger = models.CharField(max_length=40, verbose_name='Lodger', blank=True) # for customer-owned property
    device_type = models.CharField(max_length=40, verbose_name='GeraeteTyp', blank=True)
    serial_number = models.CharField(max_length=40, verbose_name='SerialNr', blank=True)
    technician = models.CharField(max_length=40, verbose_name='Techniker', blank=True)
    remark = models.TextField(verbose_name='Bemerkung', blank=True)
    url = models.CharField(max_length=40, verbose_name='URL', blank=True)
    system_status = models.CharField(max_length=40, verbose_name='System-Status', blank=True)
    inventory_number = models.CharField(max_length=40, verbose_name='InventarNr', blank=True)
    sn_ciscomb = models.CharField(max_length=40, verbose_name='SerialNrMotherboardCisco', blank=True)
    asset_id = models.CharField(max_length=40, verbose_name='AssetID', blank=True)
    connected_to = models.CharField(max_length=40, verbose_name='Connected to', blank=True)
    rfc_installation = models.CharField(max_length=40, verbose_name='RfC-Aufbau', blank=True)
    device_description = models.CharField(max_length=40, verbose_name='Device Description', blank=True)
    parent_system = models.CharField(max_length=40, verbose_name='Parent System', blank=True)
    group = models.CharField(max_length=40, verbose_name='Group', blank=True)
    device_location = models.CharField(max_length=40, verbose_name='Device Location', blank=True)
    device_name = models.CharField(max_length=40, verbose_name='Device Name', blank=True)

谢谢你抽出时间阅读这一切..。

EN

回答 1

Stack Overflow用户

发布于 2018-02-07 19:33:59

如果您只是将条形码与SWRecord.objects.filter(barcode=sw_record.barcode)进行比较,那么在迭代行并检查该列表中每一行的条形码计数之前,创建一个显示在db 中的条形码列表不是更明智吗?有点像这样:

代码语言:javascript
复制
barcode_list = SWRecord.objects.values_list('barcode', flat=True)
for row in reader:
    barcode_count = barcodes_list.count(row['barcode'])
    if barcode_count==0:
        add_list.append(create_SWRecord_from_row(row)) # creates unsaved SWRecord
    elif barcode_count==1:
        update_list(update_SWRecord_from_row(row)) # prepares the update_data for an update
    else:
        deal_with_it(row) # oh god duplicate primary keys

with transaction.atomic():
    SWRecord.objects.bulk_create(add_list) # if you do not need the save() method
for pk, data in update_list.items():
    SWRecord.objects.filter(pk=pk).update(**data) # save() isn't called here either

在保存之前,这将准确地击中数据库一次,而不是每次行通过时都没有错误。bulk_create不调用save()方法,只访问数据库一次,因此lot速度更快。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/48669326

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档