文章/答案/技术大牛

发布

社区首页 >问答首页 >将字符串拆分为空格列表，如果下一个字符不是破折号，则不包括单个空格。

问将字符串拆分为空格列表，如果下一个字符不是破折号，则不包括单个空格。
EN

Stack Overflow用户

提问于 2020-12-16 05:13:19

回答 2查看 406关注 0票数 2

我正在刮一个有卫星价值表(https://planet4589.org/space/gcat/data/cat/satcat.html)的网站。

因为每个条目都只有空格分隔，所以我需要一种将数据条目字符串拆分为数组的方法。

但是，.split()函数不适合我的需要，因为有些数据条目有空格(例如，能力3)，所以我不能仅仅用空格分隔所有东西。

然而，它变得更加棘手。在有些情况下，如果没有可用的数据，则使用破折号("-")。如果两个数据条目只有一个空格分隔，其中一个是破折号，我不想把它作为一个条目。

例如，我们有两个条目"Able 3“和"-"，仅用一个空格隔开。在文件中，它们将显示为"Able 3 -“。我想将这个字符串分割成单独的数据条目，"Able 3“和"-”(作为一个列表，这将是["Able 3", "-"])。

另一个例子是需要将"data1 -“拆分为["data1", "-"]。

基本上，我需要把一个字符串分割成一个或多个由空格分隔的单词，除非在单词之间只有一个空格，而且其中一个不是破折号。

而且，正如你所看到的，桌子是巨大的。我曾想过要遍历每一个角色，但那太慢了，我需要运行数千次。

下面是文件开头的一个示例：

JCAT         Satcat Piece          Type         Name                         PLName                       LDate        Parent        SDate              Primary       DDate              Status Dest           Owner        State        Manufacturer       Bus              Motor        Mass          DryMass      TotMass       Length    Diamete  Span       Shape                            ODate              Perigee   Apogee    Inc     OpOrbitOQU AltNames
S00001       00001  1957 ALP 1     R2           8K71A M1-10                  8K71A M1-10 (M1-1PS)         1957 Oct  4  -             1957 Oct  4 1933   Earth         1957 Dec  1 1000?  R      -              OKB1         SU           OKB1               Blok-A           -                7790         7790          7800    ?   28.0      2.6       28.0    Cyl                              1957 Oct  4             214       938   65.10  LLEO/I -
S00002       00002  1957 ALP 2     P            1-y ISZ                      PS-1                         1957 Oct  4  S00001        1957 Oct  4 1933   Earth         1958 Jan  4?       R      -              OKB1         SU           OKB1               PS               -                  84           84            84         0.6      0.6        2.9    Sphere + Ant                     1957 Oct  4             214       938   65.10  LLEO/I -
S00003       00003  1957 BET 1     P A          2-y ISZ                      PS-2                         1957 Nov  3  A00002        1957 Nov  3 0235   Earth         1958 Apr 14 0200?  AR     -              OKB1         SU           OKB1               PS               -                 508          508          8308    ?    2.0      1.0        2.0    Cone                             1957 Nov  3             211      1659   65.33  LEO/I  -
S00004       00004  1958 ALP       P A          Explorer 1                   Explorer 1                   1958 Feb  1  A00004        1958 Feb  1 0355   Earth         1970 Mar 31 1045?  AR     -              ABMA/JPL     US           JPL                Explorer         -                   8            8            14         0.8      0.1        0.8    Cyl                              1958 Feb  1             359      2542   33.18  LEO/I  -
S00005       00005  1958 BET 2     P            Vanguard I                   Vanguard Test Satellite      1958 Mar 17  S00016        1958 Mar 17 1224   Earth         -                  O      -              NRL          US           NRL                NRL 6"           -                   2            2             2         0.1      0.1        0.1    Sphere                           1959 May 23             657      3935   34.25  MEO    -
S00006       00006  1958 GAM       P A          Explorer 3                   Explorer 3                   1958 Mar 26  A00005        1958 Mar 26 1745   Earth         1958 Jun 28        AR     -              ABMA/JPL     US           JPL                Explorer         -                   8            8            14         0.8      0.1        0.8    Cyl                              1958 Mar 26             195      2810   33.38  LEO/I  -
S00007       00007  1958 DEL 1     R2           8K74A                        8K74A                        1958 May 15  -             1958 May 15 0705   Earth         1958 Dec  3        R      -              OKB1         SU           OKB1               Blok-A           -                7790         7790          7820    ?   28.0      2.6       28.0    Cyl                              1958 May 15             214      1860   65.18  LEO/I  -
S00008       00008  1958 DEL 2     P            3-y Sovetskiy ISZ            D-1 No. 2                    1958 May 15  S00007        1958 May 15 0706   Earth         1960 Apr  6        R      -              OKB1         SU           OKB1               Object D         -                1327         1327          1327         3.6      1.7        3.6    Cone                             1959 May  7             207      1247   65.12  LEO/I  -
S00009       00009  1958 EPS       P A          Explorer 4                   Explorer 4                   1958 Jul 26  A00009        1958 Jul 26 1507   Earth         1959 Oct 23        AR     -              ABMA/JPL     US           JPL                Explorer         -                  12           12            17         0.8      0.1        0.8    Cyl                              1959 Apr 24             258      2233   50.40  LEO/I  -
S00010       00010  1958 ZET       P A          SCORE                        SCORE                        1958 Dec 18  A00015        1958 Dec 18 2306   Earth         1959 Jan 21        AR     -              ARPA/SRDL    US           SRDL               SCORE            -                  68           68          3718         2.5  ?   1.5  ?     2.5    Cone                             1958 Dec 30             159      1187   32.29  LEO/I  -
S00011       00011  1959 ALP 1     P            Vanguard II                  Cloud cover satellite        1959 Feb 17  S00012        1959 Feb 17 1605   Earth         -                  O      -              BSC          US           NRL                NRL 20"          -                  10           10            10         0.5      0.5        0.5    Sphere                           1959 May 15             564      3304   32.88  MEO    -
S00012       00012  1959 ALP 2     R3           GRC 33-KS-2800               GRC 33-KS-2800 175-15-21     1959 Feb 17  R02749        1959 Feb 17 1604   Earth         -                  O      -              BSC          US           GCR                33-KS-2800       -                 195           22            22         1.5      0.7        1.5    Cyl                              1959 Apr 28             564      3679   32.88  MEO    -
S00013       00013  1959 BET       P A          Discoverer 1                 CORONA Test Vehicle 2        1959 Feb 28  A00017        1959 Feb 28 2156   Earth         1959 Mar  5        AR     -              ARPA/CIA     US           LMSD               CORONA           -                  78    ?      78    ?      668    ?    2.0      1.5        2.0    Cone                             1959 Feb 28             163?      968?  89.70  LLEO/P -
S00014       00014  1959 GAM       P A          Discoverer 2                 CORONA BIO 1                 1959 Apr 13  A00021        1959 Apr 13 2126   Earth         1959 Apr 26        AR     -              ARPA/CIA     US           LMSD               CORONA           -                 110    ?     110    ?      788         1.3      1.5        1.3    Frust                            1959 Apr 13             239       346   89.90  LLEO/P -
S00015       00015  1959 DEL 1     P            Explorer 6                   NASA S-2                     1959 Aug  7  S00017        1959 Aug  7 1430   Earth         1961 Jul  1        R?     -              GSFC         US           TRW                Able Probe       ARC 420            40           40            42    ?    0.7      0.7        2.2    Sphere + 4 Pan                   1959 Sep  8             250     42327   46.95  HEO    -   Able 3
S00016       00016  1958 BET 1     R3           GRC 33-KS-2800               GRC 33-KS-2800  144-79-22    1958 Mar 17  R02064        1958 Mar 17 1223   Earth         -                  O      -              NRL          US           GCR                33-KS-2800       -                 195           22            22         1.5      0.7        1.5    Cyl                              1959 Sep 30             653      4324   34.28  MEO    -
S00017       00017  1959 DEL 2     R3           Altair                       Altair X-248                 1959 Aug  7  A00024        1959 Aug  7 1428   Earth         1961 Jun 30        R?     -              USAF         US           ABL                Altair           -                  24           24            24         1.5      0.5        1.5    Cyl                              1961 Jan  8             197     40214   47.10  GTO    -
S00018       00018  1959 EPS 1     P A          Discoverer 5                 CORONA C-2                   1959 Aug 13  A00028        1959 Aug 13 1906   Earth         1959 Sep 28        AR     -              ARPA/CIA     US           LMSD               CORONA           -                 140          140           730         1.3      1.5        1.3    Frust                            1959 Aug 14             215       732   80.00  LLEO/I -   NRO Mission 9002

python

python-3.x

string

list

split

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-12-16 06:24:27

一种较少杂乱无章的方法是将第一行的标题解释为列指示符，并在这些宽度上进行分割。

import sys
import re

def col_widths(s):
    # Shamelessly adapted from https://stackoverflow.com/a/33090071/874188
    cols = re.findall(r'\S+\s+', s)
    return [len(col) for col in cols]

widths = col_widths(next(sys.stdin))

for line in sys.stdin:
    line = line.rstrip('\n')
    fields = []
    for col_max in widths[:-1]:
        fields.append(line[0:col_max].strip())
        line = line[col_max:]
    fields.append(line)
    print(fields)

演示：https://ideone.com/ASANjn

这似乎为LDate列提供了更好的解释，在该列中，日期有时会填充多个空格。倒数第二列将最后的破折号保留为列值的一部分；这似乎更符合原始表作者的明显意图，但如果您不喜欢，也许可以将其与该特定列分开。

如果您不想阅读sys.stdin，只需将其包装在with open(filename) as handle:中，并将sys.stdin替换为无处不在的handle。

票数 5

Stack Overflow用户

发布于 2020-12-16 06:04:56

一种方法是使用pandas.read_fwf()，它以固定宽度的格式读取文本文件.函数返回Pandas DataFrames，这对于处理大型数据集非常有用。

作为一种快速的体验，下面是这段简单代码所做的工作：

import pandas as pd

data = pd.read_fwf("data.txt")
print(data.columns)  # Prints an index of all columns.
print()
print(data.head(5))  # Prints the top 5 rows.

# Index(['JCAT', 'Satcat', 'Piece', 'Type', 'Name', 'PLName', 'LDate',
#        'Unnamed: 7', 'Parent', 'SDate', 'Unnamed: 10', 'Unnamed: 11',
#        'Primary', 'DDate', 'Unnamed: 14', 'Status', 'Dest', 'Owner', 'State',
#        'Manufacturer', 'Bus', 'Motor', 'Mass', 'Unnamed: 23', 'DryMass',
#        'Unnamed: 25', 'TotMass', 'Unnamed: 27', 'Length', 'Unnamed: 29',
#        'Diamete', 'Span', 'Unnamed: 32', 'Shape', 'ODate', 'Unnamed: 35',
#        'Perigee', 'Apogee', 'Inc', 'OpOrbitOQU', 'AltNames'],
#       dtype='object')
# 
#      JCAT  Satcat       Piece Type  ... Apogee    Inc OpOrbitOQU  AltNames
# 0  S00001       1  1957 ALP 1   R2  ...    938  65.10   LLEO/I -       NaN
# 1  S00002       2  1957 ALP 2    P  ...    938  65.10   LLEO/I -       NaN
# 2  S00003       3  1957 BET 1  P A  ...   1659  65.33   LEO/I  -       NaN
# 3  S00004       4    1958 ALP  P A  ...   2542  33.18   LEO/I  -       NaN
# 4  S00005       5  1958 BET 2    P  ...   3935  34.25   MEO    -       NaN

您将注意到，其中一些列未命名。我们可以通过确定文件的字段宽度来解决这个问题，指导read_fwf()的解析。我们将通过读取文件的第一行并对其进行迭代来实现这一点。

field_widths = []  # We'll append column widths into this list.
last_i = 0
new_field = False
for i, x in enumerate(first_line):
    if x != ' ' and new_field:
        # Register a new field.
        new_field = False
        field_widths.append(i - last_i)  # Get the field width by subtracting
                                         #   the index from previous field's index.
        last_i = i  # Set the new field index.

    elif not new_field and x == ' ':
        # We've encountered a space.
        new_field = True # Set true so that the next
                         #   non-space encountered is
                         #   recognised as a new field
else:
    field_widths.append(64)  # Append last field. Set to a high number, 
                             #   so that all strings are eventually read.

只是一个简单的循环。没什么花哨的。

剩下的就是通过field_widths关键字arg传递widths=列表：

data = pd.read_fwf("data.txt", widths=field_widths)
print(data.columns)

# Index(['JCAT', 'Satcat', 'Piece', 'Type', 'Name', 'PLName', 'LDate', 'Parent',
#        'SDate', 'Primary', 'DDate', 'Status', 'Dest', 'Owner', 'State',
#        'Manufacturer', 'Bus', 'Motor', 'Mass', 'DryMass', 'TotMass', 'Length',
#        'Diamete', 'Span', 'Shape', 'ODate', 'Perigee', 'Apogee', 'Inc',
#        'OpOrbitOQU'],
#       dtype='object')

data是一个dataframe，但是通过一些工作，您可以将它更改为list of lists或list of dicts。或者您也可以直接使用dataframe。

所以说，你想要第一排。那你就可以

datalist = data.values.tolist()
print(datalist[0])

# ['S00001', 1, '1957 ALP 1', 'R2', '8K71A M1-10', '8K71A M1-10 (M1-1PS)', '1957 Oct  4', '-', '1957 Oct  4 1933', 'Earth', '1957 Dec  1 1000?', 'R', '-', 'OKB1', 'SU', 'OKB1', 'Blok-A', '-', '7790', '7790', '7800    ?', '28.0', '2.6', '28.0', 'Cyl', '1957 Oct  4', '214', '938', '65.10', 'LLEO/I -']

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/65317704

复制

相似问题

问将字符串拆分为空格列表，如果下一个字符不是破折号，则不包括单个空格。
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将字符串拆分为空格列表，如果下一个字符不是破折号，则不包括单个空格。EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将字符串拆分为空格列表，如果下一个字符不是破折号，则不包括单个空格。
EN