首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用正则表达式从谷歌BigQuery中的字符串中提取数字

使用正则表达式从谷歌BigQuery中的字符串中提取数字
EN

Stack Overflow用户
提问于 2015-12-15 21:38:08
回答 2查看 9.8K关注 0票数 2

我想知道是否可以在BigQuery中使用正则表达式从字符串中提取所有数字。

我认为下面的方法可以工作,但只返回第一次点击--有没有一种方法可以提取所有的点击。

我在这里的用例是,我基本上希望从url中获得最大的数字,因为这更像是我需要加入的post_id。

下面是我正在讨论的一个例子:

代码语言:javascript
复制
SELECT
  mystr,
  REGEXP_EXTRACT(mystr, r'(\d+)') AS nums
FROM
  (SELECT 'this is a string with some 666 numbers 999 in it 333' AS mystr),
  (SELECT 'just one number 123 in this one ' AS mystr),
  (SELECT '99' AS mystr),
  (SELECT 'another -2 example 99' AS mystr),
  (SELECT 'another-8766 example 99' AS mystr),
  (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999' AS mystr),
  (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/gallery/001' AS mystr),
  (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/print-preview' AS mystr)

我从中得到的结果是:

代码语言:javascript
复制
[
  {
    "mystr": "this is a string with some 666 numbers 999 in it 333",
    "nums": "666"
  },
  {
    "mystr": "just one number 123 in this one ",
    "nums": "123"
  },
  {
    "mystr": "99",
    "nums": "99"
  },
  {
    "mystr": "another -2 example 99",
    "nums": "2"
  },
  {
    "mystr": "another-8766 example 99",
    "nums": "8766"
  },
  {
    "mystr": "http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999",
    "nums": "2015"
  },
  {
    "mystr": "http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/gallery/001",
    "nums": "2015"
  },
  {
    "mystr": "http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/print-preview",
    "nums": "2015"
  }
]
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2015-12-15 22:02:36

经过一番挖掘,我最终得到了这个解决方案:

代码语言:javascript
复制
SELECT
  mystr,
  GROUP_CONCAT(SPLIT(REGEXP_REPLACE(mystr, r'[^\d]+', ','))) AS nums
FROM
  (SELECT 'this is a string with some 666 numbers 999 in it 333' AS mystr),
  (SELECT 'just one number 123 in this one ' AS mystr),
  (SELECT '99' AS mystr),
  (SELECT 'another -2 example 99' AS mystr),
  (SELECT 'another-8766 example 99' AS mystr),
  (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999' AS mystr),
  (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/gallery/001' AS mystr),
  (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/print-preview' AS mystr)

它的工作原理:

我首先使用正则表达式来匹配任何非数字,并将其替换为逗号

  • ,然后使用split来获得结果,
  • is discarded
  • group_concat is

只是在这里显示结果

票数 9
EN

Stack Overflow用户

发布于 2015-12-16 08:13:57

虽然您将越来越多地在BigQuery中使用正则表达式,但您会意识到,到目前为止,它的实现相当有限

BigQuery Regular expression functions

re2 Syntax

所以很有可能你很快就会做如下所示的事情

请注意-对于您当前的特定示例-以下代码与@Cybril提供的简单解决方案相比绝对没有任何好处

此解决方案更适合您近期的潜在需求

它使用javascript UDF,从而为您提供了javascript regexp实现的能力。

BigQuery User-Defined Functions

代码语言:javascript
复制
SELECT mystr, MAX(number) as max_number FROM JS(
  // input table
  (SELECT mystr FROM
    (SELECT 'this is a string with some 666 numbers 999 in it 333' AS mystr),
    (SELECT 'just one number 123 in this one ' AS mystr),
    (SELECT '99' AS mystr),
    (SELECT 'another -2 example 99' AS mystr),
    (SELECT 'another-8766 example 99' AS mystr),
    (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999' AS mystr),
    (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/gallery/001' AS mystr),
    (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/print-preview' AS mystr)
  ) ,
  // input columns
    mystr,
  // output schema
  "[
  {name: 'mystr', type: 'string'},
  {name: 'number', type: 'string'}
  ]",
  // function
  "function(r, emit){
    var numbers = r.mystr.match(/(\d+)/g);
    for (var i=0; i < numbers.length; i++) {
      emit({
        mystr: r.mystr,
        number: numbers[i]
      });
    };  
  }"
)
GROUP BY 1

当然,您也可以在UDF中移动确定最大值的逻辑,以消除额外的分组

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/34290723

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档