文章/答案/技术大牛

发布

社区首页 >问答首页 >JSON文件中的预处理tweet

问JSON文件中的预处理tweet
EN

Stack Overflow用户

提问于 2017-06-27 16:38:30

回答 2查看 1.8K关注 0票数 3

下面是一篇名为：用Python挖掘Twitter数据的小文章

实际上，我在第二部分，也就是文本预处理。这是标记tweet文本的示例。

import re
import json

emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
regex_str = [
    emoticons_str,
    r'<[^>]+>',  # HTML Tags
    r'(?:@[\w_]+)',  # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)",  # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+',  # URLs
    r'(?:(?:\d+,?)+(?:\.?\d+)?)',  # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])",  # words with - and '
    r'(?:[\w_]+)',  # other words
    r'(?:\S)'  # anything else
]

tokens_re = re.compile(r'(' + '|'.join(regex_str) + ')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^' + emoticons_str + '$', re.VERBOSE | re.IGNORECASE)


def tokenize(s):
    return tokens_re.findall(s)


def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens

现在，当您直接插入如下字符串时，它会正常工作：

tweet = 'RT @marcobonzanini: just an example! :D http://example.com #NLP'
print(preprocess(tweet))

但是，一旦我试图导入一个JSON来标记文件中的所有tweets文本，它就会出现一个错误。

它应该是这样工作的。

with open('tweets.json', 'r') as f:
    for line in f:
        tweet = json.loads(line)
        tokens = preprocess(tweet['text'])

这是显示的错误：

Traceback (most recent call last):
  File "C:/Users/fmigg/PycharmProjects/untitled/Data Mining/tweetTextProcessing.py", line 43, in <module>
    tweet = json.loads(line)
  File "C:\Program Files\Anaconda3\lib\json\__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "C:\Program Files\Anaconda3\lib\json\decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Program Files\Anaconda3\lib\json\decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)

最后，这是一个名为tweets.json的JSON文件，上面有Tweets ( Tweet的数量有点大，所以我只放一个Tweet来分析它的结构)。

{"created_at":"Tue Jun 27 16:05:01 +0000 2017","id":879732307992739840,"id_str":"879732307992739840","text":"RT @PythonQnA: Python List Comprehension Vs. Map #python #list-comprehension #map-function https:\/\/t.co\/YtxeSt64pd","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":704974573985525760,"id_str":"704974573985525760","name":"UNIVERSAL TGSI","screen_name":"universaltgsi","location":"Magny-le-Hongre, France, SM","url":"http:\/\/www.tgsi.eu","description":"Find everything you want to know about business Technology by ONE TGSI","protected":false,"verified":false,"followers_count":424,"friends_count":343,"listed_count":273,"favourites_count":4250,"statuses_count":2958,"created_at":"Wed Mar 02 10:20:11 +0000 2016","utc_offset":7200,"time_zone":"Paris","geo_enabled":false,"lang":"fr","contributors_enabled":false,"is_translator":false,"profile_background_color":"1B95E0","profile_background_image_url":"http:\/\/pbs.twimg.com\/profile_background_images\/705020861909225472\/psLvMIAP.jpg","profile_background_image_url_https":"https:\/\/pbs.twimg.com\/profile_background_images\/705020861909225472\/psLvMIAP.jpg","profile_background_tile":true,"profile_link_color":"0084B9","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/866410987880099840\/HT8fZKLO_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/866410987880099840\/HT8fZKLO_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/704974573985525760\/1495404137","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Tue Jun 27 08:24:00 +0000 2017","id":879616290700263424,"id_str":"879616290700263424","text":"Python List Comprehension Vs. Map #python #list-comprehension #map-function https:\/\/t.co\/YtxeSt64pd","source":"\u003ca href=\"http:\/\/jarvis.ratankumar.org\/\" rel=\"nofollow\"\u003ePythonQnA\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":747460774998605825,"id_str":"747460774998605825","name":"PythonQnA","screen_name":"PythonQnA","location":"Bengaluru, India","url":null,"description":"I tweet Python questions from stackoverflow.","protected":false,"verified":false,"followers_count":632,"friends_count":64,"listed_count":277,"favourites_count":0,"statuses_count":85791,"created_at":"Mon Jun 27 16:05:10 +0000 2016","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"F5F8FA","profile_background_image_url":"","profile_background_image_url_https":"","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/747461193653092352\/Mz9NjeE__normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/747461193653092352\/Mz9NjeE__normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/747460774998605825\/1467044067","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":2,"favorite_count":1,"entities":{"hashtags":[{"text":"python","indices":[34,41]},{"text":"list","indices":[42,47]},{"text":"map","indices":[62,66]}],"urls":[{"url":"https:\/\/t.co\/YtxeSt64pd","expanded_url":"https:\/\/goo.gl\/OZxWIC","display_url":"goo.gl\/OZxWIC","indices":[76,99]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en"},"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"python","indices":[49,56]},{"text":"list","indices":[57,62]},{"text":"map","indices":[77,81]}],"urls":[{"url":"https:\/\/t.co\/YtxeSt64pd","expanded_url":"https:\/\/goo.gl\/OZxWIC","display_url":"goo.gl\/OZxWIC","indices":[91,114]}],"user_mentions":[{"screen_name":"PythonQnA","name":"PythonQnA","id":747460774998605825,"id_str":"747460774998605825","indices":[3,13]}],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":true,"filter_level":"low","lang":"en","timestamp_ms":"1498579501518"}

我想知道为什么会发生这种事。非常感谢大家！

这是文章的链接：用Python挖掘Twitter数据(第2部分:文本预处理)

更新：

我在一个JSON文件中使用了一个简单的JSON tweet和两个简单的JSON tweet来尝试代码，它起了作用。因此，问题似乎是当我打开包含所有Tweets的整个文件时。

如果有人需要该文件，您可以在我的Microsoft中下载或观看它。https://1drv.ms/f/s!AjHPHWCBEuf7ux3uLmSVEaSCPWIE

tweepy

python

json

python-3.x

twitter

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-06-29 15:33:26

正如@balki所说，这是因为在这种模式中，每个JSON对象都有空行：

1 JSON Object
2 empty line
3 JSON Object
4 empty line

因此，我从以下问题删除文件中的特定行(python)中提取了解决方案，并将其更改为擦除空行，如下所示：

def erase_empty_lines(file_name):
    file = open(file_name, 'r')
    lines = file.readlines()
    file.close()

    file = open(file_name, 'w')
    for line in lines:
        if line != '\n':
            file.write(line)
    file.close()

票数 1

Stack Overflow用户

发布于 2017-06-27 18:24:05

您的json文件可能只有一行，其中包含整个json字符串。因此，没有必要对文件的行进行迭代。相反，您希望通过tweets = json.load(f)加载json文件的内容。假设不同的tweet存储在列表中，您可以这样迭代它们：

with open('tweets.json') as fp:
    tweets = json.load(fp)

for tweet in tweets:
    tokens = preprocess(tweet['text'])

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/44785611

复制

相似问题

问JSON文件中的预处理tweet
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问JSON文件中的预处理tweetEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问JSON文件中的预处理tweet
EN