首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从不同的标签和标签外部提取全文

从不同的标签和标签外部提取全文
EN

Stack Overflow用户
提问于 2022-08-24 15:16:43
回答 1查看 69关注 0票数 0

我希望从github中已经废弃的自述文件中提取所有文本信息。Html标签之间有文本,但在标签之间也有很多文本。标记是不同的,因为它们是不同的读写器,所以作者不遵循任何特定的规则。我想从标签中提取文本,但其他的也要从标签之外提取。

例子:

代码语言:javascript
复制
/*
* 提示:该行代码过长,系统自动注释不进行高亮。一键复制会移除系统注释 
* <p align="center"> <a href=" target="_blank" rel="noopener"><img alt="Black Hat Rust logo" src="./black_hat_rust_cover.png" height="300" /></a> <h1 align="center">Black Hat Rust</h1> <h3 align="center">Applied offensive security with the Rust programming language</h3> <h3 align="center"> <a href=" the book now!</a> </h3> </p> While the [Rust Book]( does an excellent job teaching **What is** Rust, a book about **Why** and **How** to Rust was missing. ## Summary Whether in movies or mainstream media, hackers are often romanticized: they are painted as black magic wizards, nasty criminals, or, in the worst cases, as thieves with a hood and a crowbar. In reality, the spectrum of the profile of the attackers is extremely large, from the bored teenager exploring the internet to sovereign State's armies as well as the unhappy former employee. What are the motivations of the attackers? How can they break seemingly so easily into any network? What do they do to their victims? We will put on our black hat and explore the world of offensive security, whether it be cyber attacks, cybercrimes, or cyberwar. Scanners, exploits, phishing toolkit, implants... From theory to practice, we will explore the arcane of offensive security and build our own offensive tools with the Rust programming language, Stack Overflow's most loved language for five years in a row. Which programming language allows to craft shellcodes, build servers, create phishing pages? Before Rust, none! Rust is the long-awaited one-size-fits-all programming language meeting all those requirements thanks to its unparalleled guarantees and feature set. Here is why. <!-- The security programming field is defined by its extremely large scope (from shellcodes to servers and web apps). Rust is the long-awaited one-size-fits-all programming language meeting all those requirements thanks to its unparalleled guarantees and feature set. Here is why. Rust is turning a new page in the history of programming languages by providing unparalleled guarantees and features, whether it be for defensive or offensive security. I will venture to say that Rust is the long awaited one-size-fits-all programming language. Here is why. --> Free Updates and DRM Free, of course :) ## Who this book is for This is NOT a 1000th tutorial about sqlmap and Metasploit, nor will it teach you the fundamentals of programming. Instead, it's a from-theory-to-practice guide and you may enjoy it if any of the following: - You keep screaming "show me the code!" when reading about cyber attacks and malwares - You are a developer and want to learn security - You are a security engineer and want to learn Rust programming - You want to learn real-world and idiomatic rust practices - You believe that the best defense is thinking like an attacker - You learn by building and love to look under the hood - You value simplicity and pragmatism - You develop your own tools and exploits with Python, Ruby, C, Java... - You want to learn real-world offensive security, not just pentesting - You want to start making money with bug bounty programs - You prefer getting things done over analysis paralysis But I repeat, this book is NOT a computer science book. <h3> <a href=" the book now!</a> </h3> ## Table of contents #### 1 - Introduction ### Part I: Reconnaissance #### 2 - Multi-threaded attack surface discovery How to perform effective reconnaissance? In this chapter, we will build a multi-threaded scanner in order to automate the mapping of the target. #### 3 - Going full speed with async Unfortunately, when a program spends most of its time in I/O operations, multi-threading is not a panacea. We will learn how async makes Rust code really, really fast and refactor our scanner to async code. #### 4 - Adding modules with Trait objects We will add more heterogeneous modules to our scanner and will learn how Rust's type system helps create properly designed large software projects. #### 5 - Crawling the web for OSINT Leveraging all we learned previously, we will build an extremely fast web crawler to help us find the needles in the haystack the web is. ### Part II: Exploitation #### 6 - Finding vulnerabilities Once the external reconnaissance performed, it's time to find entry points. In this chapter we will learn how automated fuzzing can help us to find vulnerabilities that can be exploited to then gain access to our target's systems. #### 7 - Exploit development Rust may not be as fast as python when it comes to iterating on quick scripts such as exploits, but as we will see, its powerful type and modules system make it nonetheless a weapon of choice. #### 8 - Writing shellcodes in Rust Shellcode development is an ungrateful task. Writing assembly by hand is definitely not sexy. Fortunately for us, Rust, one more time, got our back! In this chapter we will learn how to write shellcodes in plain Rust with no_std. #### 9 - Phishing with WebAssembly When they can't find exploitable hardware or software vulnerability, attackers usually fall back to what is often the weakest link in the chain: Humans. Again, Rust comes handy and will let us create advanced phishing pages by compiling to WebAssembly. ### Part III: Implant development #### 10 - A modern RAT A RAT (for Remote Access Tool), also known as implant or beacon, is a kind of software used to perform offensive operations on a target's machines. In this chapter we will build our own RAT communicating to a remote server and database. #### 11 - Securing communications with end-to-end encryption The consequences of our own infrastructure being compromised or seized can be disastrous. We will add end-to-end encryption to our RAT's communication in order to secure its communications and avoid leaving traces on our servers. #### 12 - Going multi-platforms Today's computing landscape is extremely fragmented. From Windows to macOS, we can't target only one Operating System to ensure the success of our operations. In this section we will see how Rust's ecosystem is extremely useful when it comes to cross-compilation. #### 13 - Turning into a worm to increase reach Once the initial targets compromised, we will capitalize on Rust's excellent reusability to incorporate some parts of our initial scanner to turn our RAT into a worm and reach more targets only accessible from the target's internal network. #### 14 Conclusion Now it's **your** turn to get things done! <h3> <a href=" the book now!</a> </h3> ## FAQ ### Can I pay with PayPal, Apple Pay or Google Pay? Yes! You can now buy Black Hat Rust with PayPal, Apple Pay or Google Pay. [Go Here to proceed]( <!-- ### The book is too expensive! Black Hat Rust is designed to save you a lot of time in your learning journey of Rust and offensive security. The maths are simple: if the book saves you 20 hours, and you are paid 25$ / hour, you just saved 25 * 18 = 450$ Of course, I expect that the book will save you even more time! --> ### What to do if I don't have a VAT number? A European VAT number is optional, and you can skip the field or leave it empty if asked. <!-- ## Getting started **Knowledge has no value if you don't practice!** Where to start? I've got you covered! I've extracted the security scanner we build in the book from chapters 2, 3, 4, and 7 into [phaser]( an automated attack surface mapper and vulnerability scanner. You can then contribute to your first Rust security project or participate in your first bug bounty program. --> ## Community Hey! Welcome you to the Black Hat Rustaceans gang! If you think something in the book or the code can be improved, please [open an issue]( Pull requests are also welcome :) ## Newsletter Want to stay updated? I'll write you once a week about avoiding complexity, hacking, and entrepreneurship. ** *I hate spam even more than you do. I'll never share your email, and you can unsubscribe at anytime. Also, there is no tracking or ads.* ## Changelog You'll find all the updates in the Changelog:
*/

我想提取所有的文字以便:

代码语言:javascript
复制
While the [Rust Book]( does an excellent job teaching **What is** Rust, a book about **Why** and **How** to Rust was missing. ## Summary Whether in movies or mainstream media, hackers are often romanticized: they are painted as black magic wizards, nasty criminals, or, in the worst cases, as thieves with a hood and a crowbar. In reality, the spectrum of the profile of the attackers is extremely large, from the bored teenager exploring the internet to sovereign State's armies as well as the unhappy former employee. What are the motivations of the attackers? How can they break seemingly so easily into any network? What do they do to their victims? We will put on our black hat and explore the world of offensive security, whether it be cyber attacks, cybercrimes, or cyberwar. Scanners, exploits, phishing toolkit, implants... From theory to practice, we will explore the arcane of offensive security and build our own offensive tools with the Rust programming language, Stack Overflow's most loved language for five years in a row. Which programming language allows to craft shellcodes, build servers, create phishing pages? Before Rust, none! Rust is the long-awaited one-size-fits-all programming language meeting all those requirements thanks to its unparalleled guarantees and feature set. Here is why. The security programming field is defined by its extremely large scope (from shellcodes to servers and web apps). Rust is the long-awaited one-size-fits-all programming language meeting all those requirements thanks to its unparalleled guarantees and feature set. Here is why. Rust is turning a new page in the history of programming languages by providing unparalleled guarantees and features, whether it be for defensive or offensive security. I will venture to say that Rust is the long awaited one-size-fits-all programming language. Here is why. --> Free Updates and DRM Free, of course :)

等等..。我试过用BeatifulSoup和get_text()或者只是soup.text

代码语言:javascript
复制
def preprocess(text_all):
    soup = BeautifulSoup(text_all,"lxml")
    #text =soup.get_text() doesn't work
    text = ''.join(soup.text) #doesn't work either
    
    # fetch only alphabetic characters
    #text = re.sub("[^a-zA-Z]", " ", text)

    # split text into tokens to remove whitespaces
    tokens = text.split()

    return " ".join(tokens)

但是它不起作用,我只收到标签中的文本:

代码语言:javascript
复制
Black Hat Rust Applied offensive security with the Rust programming language FAQ Can I pay with PayPal Apple Pay or Google Pay Yes You can now buy Black Hat Rust with PayPal Apple Pay or Google Pay Go Here to proceed What to do if I don t have a VAT number A European VAT number is optional and you can skip the field or leave it empty if asked Community Hey Welcome you to the Black Hat Rustaceans gang If you think something in the book or the code can be improved please open an issue Pull requests are also welcome Newsletter Want to stay updated I ll write you once a week about avoiding complexity hacking and entrepreneurship I hate spam even more than you do I ll never share your email and you can unsubscribe at anytime Also there is no tracking or ads Changelog You ll find all the updates in the Changelog
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-08-24 16:45:47

.text不同,您可以使用参数seperatorsplit.get_text(),还可以替换<!-- -->以获取注释,如果需要的话:

代码语言:javascript
复制
html.replace('<!--','').replace('-->','')
soup = BeautifulSoup(html,'lxml')
' '.join(soup.get_text(' ', strip=True).split())

示例

代码语言:javascript
复制
from bs4 import BeautifulSoup
html='''<p align="center">
  <a href="https://kerkour.com/black-hat-rust" target="_blank" rel="noopener"><img alt="Black Hat Rust logo" src="./black_hat_rust_cover.png" height="300" /></a>
  <h1 align="center">Black Hat Rust</h1>
  <h3 align="center">Applied offensive security with the Rust programming language</h3>
  <h3 align="center">
    <a href="https://kerkour.com/black-hat-rust">Buy the book now!</a>
  </h3>
</p>

While the [Rust Book](https://doc.rust-lang.org/book/) does an excellent job teaching **What is** Rust, a book about **Why** and **How** to Rust was missing.


## Summary

Whether in movies or mainstream media, hackers are often romanticized: they are painted as black magic wizards, nasty criminals, or, in the worst cases, as thieves with a hood and a crowbar.
In reality, the spectrum of the profile of the attackers is extremely large, from the bored teenager exploring the internet to sovereign State's armies as well as the unhappy former employee.

What are the motivations of the attackers? How can they break seemingly so easily into any network? What do they do to their victims?
We will put on our black hat and explore the world of offensive security, whether it be cyber attacks, cybercrimes, or cyberwar.
Scanners, exploits, phishing toolkit, implants... From theory to practice, we will explore the arcane of offensive security and build our own offensive tools with the Rust programming language, Stack Overflow's most loved language for five years in a row.


Which programming language allows to craft shellcodes, build servers, create phishing pages? Before Rust, none! Rust is the long-awaited one-size-fits-all programming language meeting all those requirements thanks to its unparalleled guarantees and feature set. Here is why.

<!--
The security programming field is defined by its extremely large scope (from shellcodes to servers and web apps). Rust is the long-awaited one-size-fits-all programming language meeting all those requirements thanks to its unparalleled guarantees and feature set. Here is why.


Rust is turning a new page in the history of programming languages by providing unparalleled guarantees and features, whether it be for defensive or offensive security. I will venture to say that Rust is the long awaited one-size-fits-all programming language. Here is why. -->

Free Updates and DRM Free, of course :)
'''

html.replace('<!--','').replace('-->','')

soup = BeautifulSoup(html,'lxml')

' '.join(soup.get_text(' ', strip=True).split())

输出

代码语言:javascript
复制
Black Hat Rust Applied offensive security with the Rust programming language Buy the book now! While the [Rust Book](https://doc.rust-lang.org/book/) does an excellent job teaching **What is** Rust, a book about **Why** and **How** to Rust was missing. ## Summary Whether in movies or mainstream media, hackers are often romanticized: they are painted as black magic wizards, nasty criminals, or, in the worst cases, as thieves with a hood and a crowbar. In reality, the spectrum of the profile of the attackers is extremely large, from the bored teenager exploring the internet to sovereign State's armies as well as the unhappy former employee. What are the motivations of the attackers? How can they break seemingly so easily into any network? What do they do to their victims? We will put on our black hat and explore the world of offensive security, whether it be cyber attacks, cybercrimes, or cyberwar. Scanners, exploits, phishing toolkit, implants... From theory to practice, we will explore the arcane of offensive security and build our own offensive tools with the Rust programming language, Stack Overflow's most loved language for five years in a row. Which programming language allows to craft shellcodes, build servers, create phishing pages? Before Rust, none! Rust is the long-awaited one-size-fits-all programming language meeting all those requirements thanks to its unparalleled guarantees and feature set. Here is why. Free Updates and DRM Free, of course :)
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73475721

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档