首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Python: append()和extend()

Python: append()和extend()
EN

Stack Overflow用户
提问于 2019-07-11 11:30:18
回答 1查看 67关注 0票数 2

我有一个三百万行的.txt文件。该文件包含如下所示的数据:

代码语言:javascript
复制
# RSYNC: 0 1 1 0 512 0
#$SOA 5m localhost. hostmaster.localhost. 1906022338 1h 10m 5d 1s
# random_number_ofspaces_before_this text $TTL 60s
#more random information
:127.0.1.2:https://www.spamhaus.org/query/domain/$
test
:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com
.0-1-hub.com
.zzzy1129.cn
:127.0.1.4:https://www.spamhaus.org/query/domain/$
.0-il.ml
.005verf-desj.com
.01accesfunds.com

我正在尝试解析它,使其看起来像:

代码语言:javascript
复制
+--------------------+--------------+-------------+-----------------------------------------------------+
|    domain_name     | period_count | parsed_code |                      raw_code                       |
+--------------------+--------------+-------------+-----------------------------------------------------+
| test               |            0 | 127.0.1.2   |  :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-0m5tk.com       |            2 | 127.0.1.2   |  :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-1-hub.com       |            2 | 127.0.1.2   |  :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .zzzy1129.cn       |            2 | 127.0.1.2   |  :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-il.ml           |            2 | 127.0.1.4   |  :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .005verf-desj.com  |            2 | 127.0.1.4   |  :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .01accesfunds.com  |            2 | 127.0.1.4   |  :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
+--------------------+--------------+-------------+-----------------------------------------------------+

为此,我提出了以下建议:

代码语言:javascript
复制
rows = []
raw_code = None
parsed_code = None
with open('dbl-sr-2019-06-02T23_38_27Z.txt', 'r') as f: # assumes the file name is input.txt
    for line in f:
        line = line.rstrip('\n')
        if line.startswith(':127'):
            raw_code = line
            parsed_code = re.split(":", line)[1]
            continue
        if line.startswith('#'):
            continue
        rows.append((line, parsed_code))
#         rows.append((raw_code))
#         rows.extend((line, parsed_code, raw_code))
#         rows.extend((raw_code))

import pandas as pd 
df = pd.DataFrame(rows, columns=['domain_name', "parsed_code" 'raw_spamhaus_return_code'])
print(df)

上面代码块中注释掉的行没有产生我想要的输出,或者给出了一个错误。我正在努力构建一个超过2列的Pandas数据帧。我可以得到domain_name和另外一个专栏。我似乎无法获得正确使用.append.extend函数的代码。有没有人能指点一下?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-07-11 11:39:50

问题的可能来源是缺少逗号。

这一点:

代码语言:javascript
复制
df = pd.DataFrame(rows, columns=[
    'domain_name', 'parsed_code', 'raw_spamhaus_return_code'])

与以下内容不同:

代码语言:javascript
复制
df = pd.DataFrame(rows, columns=[
    'domain_name', "parsed_code" 'raw_spamhaus_return_code'])

因为(注意缺少的逗号):

代码语言:javascript
复制
"parsed_code" 'raw_spamhaus_return_code'

变成一个字符串。

测试代码:

代码语言:javascript
复制
import re

data = [x.strip() for x in """
    # RSYNC: 0 1 1 0 512 0
    #$SOA 5m localhost. hostmaster.localhost. 1906022338 1h 10m 5d 1s
    # random_number_ofspaces_before_this text $TTL 60s
    #more random information
    :127.0.1.2:https://www.spamhaus.org/query/domain/$
    test
    :127.0.1.2:https://www.spamhaus.org/query/domain/$
    .0-0m5tk.com
    .0-1-hub.com
    .zzzy1129.cn
    :127.0.1.4:https://www.spamhaus.org/query/domain/$
    .0-il.ml
    .005verf-desj.com
    .01accesfunds.com
""".split('\n')[1:-1]]

rows = []
raw_code = None
parsed_code = None
for line in data:
    line = line.rstrip('\n')
    if line.startswith(':127'):
        raw_code = line
        parsed_code = re.split(":", line)[1]
        continue
    if line.startswith('#'):
        continue
    rows.append((line, line.count('.'), parsed_code, raw_code))

import pandas as pd

df = pd.DataFrame(rows, columns=[
    'domain_name', 'period_count ', 'parsed_code',
    'raw_spamhaus_return_code'])
print(df)

结果:

代码语言:javascript
复制
         domain_name  period_count  parsed_code  \
0               test              0   127.0.1.2   
1       .0-0m5tk.com              2   127.0.1.2   
2       .0-1-hub.com              2   127.0.1.2   
3       .zzzy1129.cn              2   127.0.1.2   
4           .0-il.ml              2   127.0.1.4   
5  .005verf-desj.com              2   127.0.1.4   
6  .01accesfunds.com              2   127.0.1.4   

                            raw_spamhaus_return_code  
0  :127.0.1.2:https://www.spamhaus.org/query/doma...  
1  :127.0.1.2:https://www.spamhaus.org/query/doma...  
2  :127.0.1.2:https://www.spamhaus.org/query/doma...  
3  :127.0.1.2:https://www.spamhaus.org/query/doma...  
4  :127.0.1.4:https://www.spamhaus.org/query/doma...  
5  :127.0.1.4:https://www.spamhaus.org/query/doma...  
6  :127.0.1.4:https://www.spamhaus.org/query/doma...  
票数 5
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/56981256

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档