文章/答案/技术大牛

发布

社区首页 >问答首页 >使用TensorFlow提取包含数字的元数据

问使用TensorFlow提取包含数字的元数据
EN

Stack Overflow用户

提问于 2020-02-26 15:57:34

回答 1查看 312关注 0票数 0

我对ML/DL很陌生，但我正在寻找一种从文本中提取元数据的方法，我认为ML可能是一个很好的解决方案。

目标

输入:包含字段描述符和值/值的句子，例如：

非流动资产5 675 5 512 4 789 4 586

现金及现金等价物909 861 912 630

存货、贸易和其他应收款及其他流动资产3 756 2 998 2 864 2 834

资产总额10 340 9 372 8 565 8 051

股本5 649 4 560 2 365 1 969

非流动负债2 438 2 403 3 270 2 407

流动负债2 253 2 409 2 931 3 675

我已经做了一些研究，并知道单词需要嵌入(使用Word2Vec或类似的东西)。但是数字是如何处理的呢？

输出: Tuple {field: value}

{non_current_assets: 5675}

{cash_and_cash_equivalents: 909}

{total_assets: 10340}

{股本: 5649}

{non_current_liabilities: 2438}

{current_liabilities: 3756}

{库存: 3756}

问题

可以用ML来解决吗？如果是这样的话：
1. ，我应该如何格式化输入数据？
2. ，什么算法最适合这个

python

tensorflow

machine-learning

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-03-05 15:42:14

你的问题还不完全清楚。

如果您只有带有文本的字符串，然后是数字，并且希望{ text : number}您应该只对int字符进行拆分而不执行ML，则您的字符串在更完整的文本文档中，但是这样可以更容易地获得完整的示例。

如果你的句子在一个文本中，比如敌人的例子：

text = " If your import is failing due to a missing package, you can use pip. Non-current assets 5 675 5 512 4 789 4 586. We also expect cash equivalents 909 861 912 630 in toal"

您可以使用部分词性标记和分块来检测在数字之前的名词组：

就像这样：https://medium.com/@acrosson/extracting-names-emails-and-phone-numbers-5d576354baa

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
stop = stopwords.words('english')
document = ' '.join([i for i in text.split() if i not in stop])
sentences = nltk.sent_tokenize(document)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
nltk.download('averaged_perceptron_tagger')
sentences = [nltk.pos_tag(sent) for sent in sentences]

然后句子是：

[[('If', 'IN'),
  ('import', 'NN'),
  ('failing', 'VBG'),
  ('due', 'JJ'),
  ('missing', 'VBG'),
  ('package', 'NN'),
  (',', ','),
  ('use', 'NN'),
  ('pip', 'NN'),
  ('.', '.')],
[('Non-current', 'JJ'),
  ('assets', 'NNS'),
  ('5', 'CD'),
  ('675', 'CD'),
  ('5', 'CD'),
  ('512', 'CD'),
  ('4', 'CD'),
  ('789', 'CD'),
  ('4', 'CD'),
  ('586', 'CD'),
  ('.', '.')],
[('We', 'PRP'),
  ('also', 'RB'),
  ('expect', 'VBP'),
  ('cash', 'NN'),
  ('equivalents', 'NNS'),
  ('909', 'CD'),
  ('861', 'CD'),
  ('912', 'CD'),
  ('630', 'CD'),
  ('toal', 'NN')]]

您可以根据语法定义regex，以检测标称组，后面跟着数字，例如：

grammar = """MATCH:{<JJ><NNS><CD>}""" #grammar would need to be completed
cp = nltk.RegexpParser(grammar)
for sentence in sentences:
  print(cp.parse(sentence))

(S
  If/IN
  import/NN
  failing/VBG
  due/JJ
  missing/VBG
  package/NN
  ,/,
  use/NN
  pip/NN
  ./.)
(S
  (MATCH Non-current/JJ assets/NNS 5/CD)
  675/CD
  5/CD
  512/CD
  4/CD
  789/CD
  4/CD
  586/CD
  ./.)
(S
  We/PRP
  also/RB
  expect/VBP
  cash/NN
  equivalents/NNS
  909/CD
  861/CD
  912/CD
  630/CD
  toal/NN)

如果你不是专家，用tensorflow从头到尾做这件事要困难得多。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60417835

复制

相似问题

问使用TensorFlow提取包含数字的元数据
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用TensorFlow提取包含数字的元数据EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用TensorFlow提取包含数字的元数据
EN