文章/答案/技术大牛

发布

问Python解析非结构化数据
EN

Stack Overflow用户

提问于 2021-09-19 02:14:49

回答 2查看 231关注 0票数 0

我有一个文件，其中包含以下数据。我正在分析数据。

08/23/21 04:00:05 AM


/* ----------------- data1----------------- */ 

make: honda   model: civic
year: 2019
trim: "lx"
owner: phillip

/* ----------------- data2----------------- */ 

make: toyota  model: highlander
year: 2021
trim: "Platinum"

我想看到这样的数据：

Make, Model, Year, trim, Owner
Honda, civic, 2019, lx, phillip
toyota, highlander, 2021, platinum, Rex

这里是我的代码：，我尝试创建字典，然后加载到熊猫数据。我想我的方向不对。

def fix_line(record):
    #split every field and value into a seperate line
    results = []
    mini_collection = []
    if not record.startswith("/*"):
        #for data in record.rstrip('\n').strip().split('   '):
        for data in record.rstrip('\n').split('   '):
            if ':' not in data:
                mini_collection.append(data)
            else:
                results.append(data)
    return results
                    
def create_dictionary(data):   
    record = {}                
    for line in fix_line(data):
        line = line.strip()
        name, value = line.split(':', 1)
        record[name.strip()] = value.strip()
    return record

pandas

python

回答 2

Stack Overflow用户

发布于 2021-09-19 04:06:46

以下是一种方法：

import re
import yaml #python -m pip install pyyaml
import pandas as pd 

s = """08/23/21 04:00:05 AM


/* ----------------- data1----------------- */ 

make: honda
model: civic
year: 2019
trim: lx
owner: phillip

/* ----------------- data2----------------- */ 

make: toyota
model: highlander
year: 2021
trim: Platinum
owner: Rex
"""

lines = re.split("/*\s*/", s)
records = [yaml.load(line) for line in lines if "make:" in line]
df = pd.DataFrame(records)

产出：

     make       model  year      trim    owner
0   honda       civic  2019        lx  phillip
1  toyota  highlander  2021  Platinum      Rex

票数 2

Stack Overflow用户

发布于 2021-09-19 03:02:52

尝试使用re.finditer和下面的pattern创建基于查找的dictionary。然后附加到数据文件中。

import re

pattern = """
    (?P<make>(?<=(make:\ ))\w+) #use lookbehind regex to get make
    (\s + model: \ )            #Skip to model
    (?P<model>\w+)              #Get Model
    (\s year: \ )               #Skip to year
    (?P<year>\d+)               #Get year
    (\s + trim: \ ")            #Skip to trim
    (?P<trim>\w+)               #Get trim
    ("\s)                       #Skip to owner
    (?P<owner>.*)               #Get owner
"""

df = pd.DataFrame([item.groupdict() for item in re.finditer(pattern, data, re.VERBOSE)])
df["owner"] = df["owner"].str.replace("owner: ", "")
df
Out[563]: 
     make       model  year      trim    owner
0   honda       civic  2019        lx  phillip
1  toyota  highlander  2021  Platinum

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/69239802

复制

相似问题

问Python解析非结构化数据
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python解析非结构化数据EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python解析非结构化数据
EN