首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >有没有weka JSONLoader的用法示例?

有没有weka JSONLoader的用法示例?
EN

Stack Overflow用户
提问于 2018-05-23 02:21:36
回答 1查看 370关注 0票数 1

我使用的是GUI程序,weka knowledge explorer。来设置我的训练分类模型的管道。下面是我的一小部分数据。唯一的属性将是文本中的值。因为它是有监督的学习,所以我在那里为每个tweet /文档都有一个标签/类别。

代码语言:javascript
复制
[
  {
    "id": 8.7361726140328e+17,
    "text": "The Joki's on you! Unless you take advantage of 25% off Scarlet Court Chests - on sale now! https:\/\/t.co\/vc1ttPxJWm",
    "category": [
      "dont_care"
    ]
  },
  {
    "id": 8.7329941695388e+17,
    "text": "Don't be a drag - dress like a queen! Scarlet Court Chest Rolls are 25% off! https:\/\/t.co\/O0Ig5bEZdD",
    "category": [
      "dont_care"
    ]
  },
  {
    "id": 8.7328034547452e+17,
    "text": "Join @Inukii and @MezmoreyezTV for Top 5 Console Plays! https:\/\/t.co\/3JmreXSTWp",
    "category": [
      "dont_care"
    ]
  }
]

我在日志中得到的异常

代码语言:javascript
复制
11:16:12: [Low] FlowRunner$1697181913|FlowRunner: Launching start point: JSONLoader
11:16:12: [Basic] JSONLoader$17081058|Loading /home/j/_Github-Projects/GameMediaBot/SmiteGame_classified_data.json
11:16:12: [ERROR] JSONLoader$17081058|java.lang.Exception: Can't recover from previous error(s)
weka.core.WekaException: java.lang.Exception: Can't recover from previous error(s)
    at weka.knowledgeflow.steps.Loader.start(Loader.java:178)
    at weka.knowledgeflow.StepManagerImpl.startStep(StepManagerImpl.java:1035)
    at weka.knowledgeflow.BaseExecutionEnvironment$3.run(BaseExecutionEnvironment.java:440)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.lang.Exception: Can't recover from previous error(s)
    at weka.core.converters.JSONLoader.getStructure(JSONLoader.java:242)
    at weka.core.converters.JSONLoader.getDataSet(JSONLoader.java:267)
    at weka.knowledgeflow.steps.Loader.start(Loader.java:172)
    ... 7 more
Caused by: java.lang.Exception: Can't recover from previous error(s)
    at java_cup.runtime.lr_parser.report_fatal_error(lr_parser.java:392)
    at java_cup.runtime.lr_parser.unrecovered_syntax_error(lr_parser.java:539)
    at java_cup.runtime.lr_parser.parse(lr_parser.java:731)
    at weka.core.json.JSONNode.read(JSONNode.java:634)
    at weka.core.converters.JSONLoader.getStructure(JSONLoader.java:234)
    ... 9 more

11:16:12: [Low] JSONLoader$17081058|Interrupted

我的管道:

EN

回答 1

Stack Overflow用户

发布于 2018-05-23 03:51:14

不管怎样,我只是写了一个脚本来将我的数据(json)转换成arff。不确定确定文本数据属性的约定是什么。我只是使用了我所关心的推文类别中最常用的前40个单词。我在末尾添加了一个名为class的属性,它类似于一个枚举,这似乎是训练模型的惯例。

请参阅github上的代码或在https://github.com/jtara1/GameMediaBot/blob/master/transform_to_arff.py上查看相同的代码

代码语言:javascript
复制
import re
import json
from os.path import join, dirname, abspath, basename
from collections import Counter, OrderedDict
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import arff
import click


@click.command()
@click.argument('file')
@click.option('--dont-care-category',
              type=click.STRING,
              default='dont_care')
@click.option('-a',
              type=click.INT,
              default=40,
              help='Number of attributes. Attrs are the most frequent words '
                   'in the text of the target category')
def transform(file, dont_care_category, a):
    """input example
    [ {id: 123, text: "this is text body", category: ["dont_care"]} ]
    output example
    @relation game_media_bot

    @attribute

    :return:
    """
    classes = set()
    data = json.load(open(file, 'r'))

    master_vector = Counter()

    for tweet in data:
        classes.add(tweet['category'][0])
        if tweet['category'][0] != dont_care_category:
            master_vector += get_word_vector(tweet)

    print(master_vector)

    # most common words in the text of the target category
    attrs = [(word, 'INTEGER') for word, _ in master_vector.most_common(a)]
    attrs.append(('class', [value for value in classes]))

    arff_data = {
        'attributes': attrs,
        'data': [],
        'description': '',
        'relation': '{}'.format(dont_care_category)
    }

    for tweet in data:
        word_vector = get_word_vector(tweet)
        tweet_data = [word_vector[attr[0]] for attr in attrs[:-1]]
        tweet_data.append(tweet['category'][0])
        arff_data['data'].append(tweet_data)

    out_file = file.replace('.json', '.arff')
    data = arff.dumps(arff_data)
    with open(out_file, 'w') as f:
        f.write(data)


def get_word_vector(tweet):
    stop_words = stopwords.words('english')
    stop_words += ['!', ':', ',', '-', 'https', '/', '\u2026', "'s", "n't",
                   '#', '.', ';', ')', '(', "'re", '&', '?', '%', '@', "'",
                   '...']

    uri = re.compile(r'(https)?:?//t\.co/.*')

    # remove whitespace characters and put each word in a list
    words = word_tokenize(tweet['text'])

    # make each word lowercase
    words = [word.lower() for word in words]

    words = list(
        filter(
            lambda word: word not in stop_words and not uri.match(word),
            words
        )
    )

    return Counter(words)


if __name__ == '__main__':
    transform()
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50474169

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档