文章/答案/技术大牛

发布

社区首页 >问答首页 >为什么熊猫2分钟桶打印NaN，尽管我所有的行值都是数字(而不是NaN)？

问为什么熊猫2分钟桶打印NaN，尽管我所有的行值都是数字(而不是NaN)？
EN

Stack Overflow用户

提问于 2018-06-26 07:39:33

回答 1查看 182关注 0票数 1

我知道在我的数据response_bytes列中没有NaN值，因为当我运行：data[data.response_bytes.isna()].count()时，结果是0。

当我运行2分钟平均桶，然后头，我得到NaN：

print(data.reset_index().set_index('time').resample('2min').mean().head())

                     index  identity  user  http_code  response_bytes  unknown
time                                                                          
2018-01-31 09:26:00    0.5       NaN   NaN      200.0           264.0      NaN
2018-01-31 09:28:00    NaN       NaN   NaN        NaN             NaN      NaN
2018-01-31 09:30:00    NaN       NaN   NaN        NaN             NaN      NaN
2018-01-31 09:32:00    NaN       NaN   NaN        NaN             NaN      NaN
2018-01-31 09:34:00    NaN       NaN   NaN        NaN             NaN      NaN

为什么响应字节时间存储意味着有NaN值？

我想试一试，了解时间如何在熊猫身上起作用。因此，我使用日志文件：response_bytes：http://www.cs.tufts.edu/comp/116/access.log作为输入数据，然后将其加载到DataFrame中，然后应用时间桶2分钟(这是我有生以来的第一次)并运行DataFrame()，因为所有的值都不是NaN，所以我没有想到会在NaN列中看到任何NaN。

这是我的完整代码：

import urllib.request
import pandas as pd
import re
from datetime import datetime
import pytz

pd.set_option('max_columns',10)

def parse_str(x):
    """
    Returns the string delimited by two characters.

    Example:
        `>>> parse_str('[my string]')`
        `'my string'`
    """
    return x[1:-1]

def parse_datetime(x):
    '''
    Parses datetime with timezone formatted as:
        `[day/month/year:hour:minute:second zone]`

    Example:
        `>>> parse_datetime('13/Nov/2015:11:45:42 +0000')`
        `datetime.datetime(2015, 11, 3, 11, 45, 4, tzinfo=<UTC>)`

    Due to problems parsing the timezone (`%z`) with `datetime.strptime`, the
    timezone will be obtained using the `pytz` library.
    '''
    dt = datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
    dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
    return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))

# data = pd.read_csv(StringIO(accesslog))
url = "http://www.cs.tufts.edu/comp/116/access.log"
accesslog =  urllib.request.urlopen(url).read().decode('utf-8')
fields = ['host', 'identity', 'user', 'time_part1', 'time_part2', 'cmd_path_proto', 
          'http_code', 'response_bytes', 'referer', 'user_agent', 'unknown']

data = pd.read_csv(url, sep=' ', header=None, names=fields, na_values=['-'])

# Panda's parser mistakenly splits the date into two columns, so we must concatenate them
time = data.time_part1 + data.time_part2
time_trimmed = time.map(lambda s: re.split('[-+]', s.strip('[]'))[0]) # Drop the timezone for simplicity
data['time'] = pd.to_datetime(time_trimmed, format='%d/%b/%Y:%H:%M:%S')

data.head()

print(data.reset_index().set_index('time').resample('2min').mean().head())

我原以为response_bytes专栏的平均排名不会是NaN。

python

pandas

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-06-26 07:45:29

这是预期的行为，因为resampling转换为一个正常的时间间隔，所以如果没有样本，就会得到NaN。

因此，这意味着在大约2分钟的迭代之间没有日期时间，例如2018-01-31 09:28:00和2018-01-31 09:30:00，因此mean不能计数并得到NaNs。

print (data[data['time'].between('2018-01-31 09:28:00','2018-01-31 09:30:00')])
Empty DataFrame
Columns: [host, identity, user, time_part1, time_part2, cmd_path_proto,
          http_code, response_bytes, referer, user_agent, unknown, time]
Index: []

[0 rows x 12 columns]

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/51037433

复制

相似问题

问为什么熊猫2分钟桶打印NaN，尽管我所有的行值都是数字(而不是NaN)？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么熊猫2分钟桶打印NaN，尽管我所有的行值都是数字(而不是NaN)？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么熊猫2分钟桶打印NaN，尽管我所有的行值都是数字(而不是NaN)？
EN