我正试着把一些数据加载到猪身上:
记录:
11::American President, The (1995)::Comedy|Drama|Romance
12::Dracula: Dead and Loving It (1995)::Comedy|Horror使用的脚本:
loadMoviesDs = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat'
USING PigStorage(':')
AS (Movieid:long, dummy1, Title:chararray, dummy2, Genere:chararray);输出
11,,American President, The (1995),,Comedy|Drama|Romance
12,,Dracula,, Dead and Loving It (1995)如何处理德古拉之后的冒号(:)-?
由于冒号,第二列被分割为2列,而且由于我们总共有3列,comedy|horror 12的最后一列没有加载。
发布于 2015-12-13 09:57:38
您可以使用REGEX_EXTRACT_ALL实现这一点。
以下是实现这一目标的代码:
A = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat'
AS (f1:chrarray);
B = FOREACH A GENERATE REGEX_EXTRACT_ALL(f1, '(.*)::(.*)::(.*)');
C = FOREACH B GENERATE FLATTEN($0);
D = FOREACH C GENERATE $0 AS (MovieID:long), $1 AS (Title:chararray), $2 AS (Genre:chararray);
DUMP D;我得到了以下输出(这是一个元组)。":“”德古拉“之后是完整的。
(11,American President, The (1995),Comedy|Drama|Romance)
(12,Dracula: Dead and Loving It (1995),Comedy|Horror)https://stackoverflow.com/questions/34248335
复制相似问题