首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >JSON文件中的预处理tweet

JSON文件中的预处理tweet
EN

Stack Overflow用户
提问于 2017-06-27 16:38:30
回答 2查看 1.8K关注 0票数 3

下面是一篇名为:用Python挖掘Twitter数据的小文章

实际上,我在第二部分,也就是文本预处理。这是标记tweet文本的示例。

代码语言:javascript
复制
import re
import json

emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
regex_str = [
    emoticons_str,
    r'<[^>]+>',  # HTML Tags
    r'(?:@[\w_]+)',  # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)",  # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+',  # URLs
    r'(?:(?:\d+,?)+(?:\.?\d+)?)',  # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])",  # words with - and '
    r'(?:[\w_]+)',  # other words
    r'(?:\S)'  # anything else
]

tokens_re = re.compile(r'(' + '|'.join(regex_str) + ')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^' + emoticons_str + '$', re.VERBOSE | re.IGNORECASE)


def tokenize(s):
    return tokens_re.findall(s)


def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens

现在,当您直接插入如下字符串时,它会正常工作:

代码语言:javascript
复制
tweet = 'RT @marcobonzanini: just an example! :D http://example.com #NLP'
print(preprocess(tweet))

但是,一旦我试图导入一个JSON来标记文件中的所有tweets文本,它就会出现一个错误。

它应该是这样工作的。

代码语言:javascript
复制
with open('tweets.json', 'r') as f:
    for line in f:
        tweet = json.loads(line)
        tokens = preprocess(tweet['text'])

这是显示的错误:

代码语言:javascript
复制
Traceback (most recent call last):
  File "C:/Users/fmigg/PycharmProjects/untitled/Data Mining/tweetTextProcessing.py", line 43, in <module>
    tweet = json.loads(line)
  File "C:\Program Files\Anaconda3\lib\json\__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "C:\Program Files\Anaconda3\lib\json\decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Program Files\Anaconda3\lib\json\decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)

最后,这是一个名为tweets.json的JSON文件,上面有Tweets ( Tweet的数量有点大,所以我只放一个Tweet来分析它的结构)。

代码语言:javascript
复制
{"created_at":"Tue Jun 27 16:05:01 +0000 2017","id":879732307992739840,"id_str":"879732307992739840","text":"RT @PythonQnA: Python List Comprehension Vs. Map #python #list-comprehension #map-function https:\/\/t.co\/YtxeSt64pd","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":704974573985525760,"id_str":"704974573985525760","name":"UNIVERSAL TGSI","screen_name":"universaltgsi","location":"Magny-le-Hongre, France, SM","url":"http:\/\/www.tgsi.eu","description":"Find everything you want to know about business Technology by ONE TGSI","protected":false,"verified":false,"followers_count":424,"friends_count":343,"listed_count":273,"favourites_count":4250,"statuses_count":2958,"created_at":"Wed Mar 02 10:20:11 +0000 2016","utc_offset":7200,"time_zone":"Paris","geo_enabled":false,"lang":"fr","contributors_enabled":false,"is_translator":false,"profile_background_color":"1B95E0","profile_background_image_url":"http:\/\/pbs.twimg.com\/profile_background_images\/705020861909225472\/psLvMIAP.jpg","profile_background_image_url_https":"https:\/\/pbs.twimg.com\/profile_background_images\/705020861909225472\/psLvMIAP.jpg","profile_background_tile":true,"profile_link_color":"0084B9","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/866410987880099840\/HT8fZKLO_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/866410987880099840\/HT8fZKLO_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/704974573985525760\/1495404137","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Tue Jun 27 08:24:00 +0000 2017","id":879616290700263424,"id_str":"879616290700263424","text":"Python List Comprehension Vs. Map #python #list-comprehension #map-function https:\/\/t.co\/YtxeSt64pd","source":"\u003ca href=\"http:\/\/jarvis.ratankumar.org\/\" rel=\"nofollow\"\u003ePythonQnA\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":747460774998605825,"id_str":"747460774998605825","name":"PythonQnA","screen_name":"PythonQnA","location":"Bengaluru, India","url":null,"description":"I tweet Python questions from stackoverflow.","protected":false,"verified":false,"followers_count":632,"friends_count":64,"listed_count":277,"favourites_count":0,"statuses_count":85791,"created_at":"Mon Jun 27 16:05:10 +0000 2016","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"F5F8FA","profile_background_image_url":"","profile_background_image_url_https":"","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/747461193653092352\/Mz9NjeE__normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/747461193653092352\/Mz9NjeE__normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/747460774998605825\/1467044067","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":2,"favorite_count":1,"entities":{"hashtags":[{"text":"python","indices":[34,41]},{"text":"list","indices":[42,47]},{"text":"map","indices":[62,66]}],"urls":[{"url":"https:\/\/t.co\/YtxeSt64pd","expanded_url":"https:\/\/goo.gl\/OZxWIC","display_url":"goo.gl\/OZxWIC","indices":[76,99]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en"},"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"python","indices":[49,56]},{"text":"list","indices":[57,62]},{"text":"map","indices":[77,81]}],"urls":[{"url":"https:\/\/t.co\/YtxeSt64pd","expanded_url":"https:\/\/goo.gl\/OZxWIC","display_url":"goo.gl\/OZxWIC","indices":[91,114]}],"user_mentions":[{"screen_name":"PythonQnA","name":"PythonQnA","id":747460774998605825,"id_str":"747460774998605825","indices":[3,13]}],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":true,"filter_level":"low","lang":"en","timestamp_ms":"1498579501518"}

我想知道为什么会发生这种事。非常感谢大家!

这是文章的链接:用Python挖掘Twitter数据(第2部分:文本预处理)

更新:

我在一个JSON文件中使用了一个简单的JSON tweet和两个简单的JSON tweet来尝试代码,它起了作用。因此,问题似乎是当我打开包含所有Tweets的整个文件时。

如果有人需要该文件,您可以在我的Microsoft中下载或观看它。https://1drv.ms/f/s!AjHPHWCBEuf7ux3uLmSVEaSCPWIE

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-06-29 15:33:26

正如@balki所说,这是因为在这种模式中,每个JSON对象都有空行:

代码语言:javascript
复制
1 JSON Object
2 empty line
3 JSON Object
4 empty line

因此,我从以下问题删除文件中的特定行(python)中提取了解决方案,并将其更改为擦除空行,如下所示:

代码语言:javascript
复制
def erase_empty_lines(file_name):
    file = open(file_name, 'r')
    lines = file.readlines()
    file.close()

    file = open(file_name, 'w')
    for line in lines:
        if line != '\n':
            file.write(line)
    file.close()
票数 1
EN

Stack Overflow用户

发布于 2017-06-27 18:24:05

您的json文件可能只有一行,其中包含整个json字符串。因此,没有必要对文件的行进行迭代。相反,您希望通过tweets = json.load(f)加载json文件的内容。假设不同的tweet存储在列表中,您可以这样迭代它们:

代码语言:javascript
复制
with open('tweets.json') as fp:
    tweets = json.load(fp)

for tweet in tweets:
    tokens = preprocess(tweet['text'])
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/44785611

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档