blocks|key|1344348|text|你曲解了他在这次演讲中所说的话。他的意思是，在PIG中实现的“平均”是300行java代码，而不是由宏谓词功能实现的5行cascalog。他想强调作文的力量。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1344349|PD:很抱歉我的英语不好，我正在学习;-)|1344350|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|F|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|G|8|@]|9|@]|A|$]]|$1|D|3|-4|5|6|7|H|8|@]|9|@]|A|$]]]|E|$]]

You've done a misinterpretation of what he says in this presentation. 
What he means is that the implementation de "average" in PIG is 300 lines de java code, versus the 5 lines of cascalog implemented by macro predicate functionality. He wants to emphasize the power of the composition.

PD: Sorry for my bad english, I'm learning ;-)

blocks|key|435956|text|我不认为在PIG中会有300行代码。PIG已经有了过滤器构造和AVG计算功能。PIG中的代码类似于：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|435957|A+=+LOAD+'student.txt'+AS+(name:chararray,+age:int);
B+=+FILTER+A+BY+age+>+AVG(A.age);|code-block|syntax|javascript|435958|注意:我还没有尝试过这个代码，因为我的机器上没有PIG设置。|435959|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|L|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|M|8|@]|9|@]|A|$]]|$1|I|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|J|$]]

I don't think that it would be 300 lines of code in PIG. PIG already have filter construct and AVG eval function. The code in PIG would be something like:

<pre><code>A = LOAD 'student.txt' AS (name:chararray, age:int);
B = FILTER A BY age &gt; AVG(A.age);
</code></pre>

NOTE: I haven't tried this code as I don't have PIG setup on my machine.

blocks|key|5176425|text|在常规SQL中，它是微不足道的-+select+count(*)+from+TableName+where+age>(select+avg+from+TableName)|type|unstyled|depth|inlineStyleRanges|entityRanges|data|5176426|但它要求底层引擎能够检测到最新的select是独立子查询(否则它将永远工作)。|5176427|把它分成两个运算符应该很简单-一个是选择平均年龄，第二个是计算上面的这些。|5176428|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|H|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|I|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|J|8|@]|9|@]|A|$]]|$1|F|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|G|$]]

In regular SQL it is trivial - select count(*) from TableName where age>(select avg age from TableName) 
But it require that underlying engine will be able to detect that latest select is independent subquery (otherwise it will work forever). 
It should be trivial to divide it into two operators - one select avg age, and second - count these above it.

blocks|key|4694036|text|选择一个已经在PIG中实现的聚合操作可能会混淆消息。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4694037|正如@marivas11指出的那样，这些幻灯片的一个主题是，谓词的可组合性是用户定义函数(UDF)方法的强大替代，用户定义函数(UDF)在其他Hadoop抽象中很流行。|4694038|可组合性的好处远远超出了代码量的相对差异：|4694039|谓词的|4694040|可组合性降低了Moseley/Marks2006中定义的“意外复杂性”，这有利于软件工程成本|ordered-list-item|4694041|简洁的代码，其结果也非常接近规定的要求；这几乎直接来自于测试驱动开发(TDD)的实践，因为Cascalog子查询实际上变成了测试语句--+Sam+Ritchie在Cascalog-Midje中添加的事实和模拟非常好|offset|length|4694042|摆脱了必须开发复杂工作流的数据团队中的一个非常麻烦的问题:跨越从Java到PigDML再回到Java语言的边界意味着异常处理、通知和其他工具变得更加困难--特别是对于大型应用程序，这些应用程序在大型集群上无论如何都很难排除故障...在Cascalog中，所有的扩展都保留在相同的语言中(甚至Leiningen构建脚本也是Clojure的)，因此编译器可以完整地查看工作流程，并且可以在PIG之前推断出问题。|4694043|4694044|后一点很微妙，但在实践中可以转换为$$。在PIG中，直到你的应用程序在集群上运行，你才会发现许多问题。对于一个大规模的应用程序，这意味着要烧钱来测试bug，这些bug可以在编译时或在提交之前在Hadoop客户端上推断出来。|4694045|entityMap|0|LINK|mutability|MUTABLE|url|http://sritchie.github.com/2012/01/22/cascalog-testing-20.html^0|0|0|0|0|0|29|E|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|11|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|12|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|13|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|14|8|@]|9|@]|A|$]]|$1|H|3|I|5|J|7|15|8|@]|9|@]|A|$]]|$1|K|3|L|5|J|7|16|8|@]|9|@$M|17|N|18|1|19]]|A|$]]|$1|O|3|P|5|J|7|1A|8|@]|9|@]|A|$]]|$1|Q|3|-4|5|6|7|1B|8|@]|9|@]|A|$]]|$1|R|3|S|5|6|7|1C|8|@]|9|@]|A|$]]|$1|T|3|-4|5|6|7|1D|8|@]|9|@]|A|$]]]|U|$V|$5|W|X|Y|A|$Z|10]]]]

Choosing an aggregate operation which is already implemented in PIG probably confused the message.

One theme of those slides, as @marivas11 pointed out, is that composability of predicates is a powerful alternative to the approach of user-defined functions (UDFs) which are popular in other Hadoop abstractions.

The benefits of composability extend far beyond a relative difference in code volume: 

<ol>
<li>composabilty of predicates reduces "accidental complexity" as defined in Moseley/Marks 2006, which benefits software engineering costs</li>
<li>the concise code which results is also quite close to stated requirements; this follows almost directly from the practice of test-driven development (TDD) since Cascalog subqueries effectively become test statements -- the <a href="http://sritchie.github.com/2012/01/22/cascalog-testing-20.html" rel="nofollow">Cascalog-Midje</a> addition of facts and mocks by Sam Ritchie is quite good </li>
<li>getting rid of UDFs relieves a very troublesome problem on Data teams which must develop complex workflows: crossing a language boundary from Java to Pig's DML and back to Java implies that exception handling, notifications, and other instrumentation become significantly more difficult -- especially for large-scale apps, which are difficult to troubleshoot anyway on a large cluster... in Cascalog, all the extensions stay within the same language (even the Leiningen build script is in Clojure) so the compiler has a complete view of the workflow and can infer problems earlier than PIG.</li>
</ol>

The latter point is subtle but translates to $$ in practice. In PIG, you won't find out a number of problems until your app is running on the cluster. For a large-scale app, that implies burning money to test bugs which could have been inferred at compile time or on the Hadoop client prior to submit.

In <a href="http://www.slideshare.net/nathanmarz/cascalog-at-strange-loop" rel="nofollow">this presentation</a> at slides 36 and 37 - the author of Cascalog asserts that given a data set of names and ages like:
[name age]
that the query to return all the results that are greater than the average age is 300 lines of PIG. 

Is this a valid assertion? How many lines of PIG is it really?

Or is the problem he's describing bigger than what I've described?

(Disclaimer - I'm a big fan of Nathan's work, of Clojure and Cascalog - I'm just trying to get some facts straight).

Clojure Hadoop - 5 Lines of Cascalog equivalent to 300 lines of PIG?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

EdgeOne AI 安全实战专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云AI代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

功能1上新10个字符

功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符。

功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符。

功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符

功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符

功能4上新

文章&问答评论现已支持表情

全新交互，全新视觉，新增快捷键、悬浮工具栏、高亮块等功能并同时优化现有功能，全面提升创作效率和体验

社区富文本编辑器全新改版！诚邀体验～ 

精选全网热门MCP server，让你的AI更好用 🚀

💥开发者 MCP广场重磅上线！

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

在幻灯片36和37的中- Cascalog的作者断言，给定一个包含姓名和年龄的数据集:姓名、年龄，返回大于平均年龄的所有结果的查询是300行猪。这是一个有效的断言吗？猪到底有几行？或者他所描述的问题比我所描述的更大？(免责声明-我是Nathan的工作，Clojure和Cascalog的超级粉丝-我只是想弄清楚一些事实)。

问Clojure Hadoop -5行Cascalog相当于300行猪？
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Clojure Hadoop -5行Cascalog相当于300行猪？EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Clojure Hadoop -5行Cascalog相当于300行猪？
EN