文章/答案/技术大牛

发布

社区首页 >问答首页 >阿拉伯PDF文本抽取器

问阿拉伯PDF文本抽取器
EN

Stack Overflow用户

提问于 2018-06-05 23:14:12

回答 1查看 1.2K关注 0票数 1

有没有从pdf中提取阿拉伯文本的pdf文本提取器api。

我正在使用itextpdf，它在提取英语时工作得很好，但它不提取阿拉伯文本。

这是我在pdf中提取文本的代码：

private String extractPDF(String path) throws IOException {

        String parsedText = "";
        PdfReader reader = new PdfReader(path);
        int n = reader.getNumberOfPages();
        for (int page = 0; page < n; page++) {
            parsedText = parsedText + PdfTextExtractor.getTextFromPage(reader, page + 1).trim() + "\n"; //Extracting the content from the different pages
        }
        reader.close();

        return parsedText;
}

这是输入pdf :arabic.pdf

最新情况：

我能够提取阿拉伯文本，但它不能保持行的顺序，这是我的代码：

private String extractPDF(String name) throws IOException {

    PdfReader reader = new PdfReader(name);
    StringBuilder text = new StringBuilder();
    for (int i=1;i<=reader.getNumberOfPages();i++){
        String data = PdfTextExtractor.getTextFromPage(reader,i,new SimpleTextExtractionStrategy());
        text.append(Bidi.BidiText(data,1).getText());
    }
    return text.toString();
}

pdf文本如下：

بسماللهالرحمنالرحيم

السلامعليكمورحمةاللهوبركاته

سبحانالله

产出如下：

سبحانالله

السلامعليكمورحمةاللهوبركاته

بسماللهالرحمنالرحيم

这是方法BidiText的代码：

public static BidiResult BidiText(String str, int startLevel)
{
    boolean isLtr = true;
    int strLength = str.length();
    if (strLength == 0)
    {
        return new BidiResult(str, false);
    }

    // get types, fill arrays

    char[] chars = new char[strLength];
    String[] types = new String[strLength];
    String[] oldtypes = new String[strLength];
    int numBidi = 0;

    for (int i = 0; i < strLength; ++i)
    {
        chars[i] = str.charAt(i);

        char charCode = str.charAt(i);
        String charType = "L";
        if (charCode <= 0x00ff)
        {
            charType = BaseTypes[charCode];
        }
        else if (0x0590 <= charCode && charCode <= 0x05f4)
        {
            charType = "R";
        }
        else if (0x0600 <= charCode && charCode <= 0x06ff)
        {
            charType = ArabicTypes[charCode & 0xff];
        }
        else if (0x0700 <= charCode && charCode <= 0x08AC)
        {
            charType = "AL";
        }

        if (charType.equals("R") || charType.equals("AL") || charType.equals("AN"))
        {
            numBidi++;
        }

        oldtypes[i] = types[i] = charType;
    }

    if (numBidi == 0)
    {
        return new BidiResult(str, true);
    }

    if (startLevel == -1)
    {
        if ((strLength / numBidi) < 0.3)
        {
            startLevel = 0;
        }
        else
        {
            isLtr = false;
            startLevel = 1;
        }
    }

    int[] levels = new int[strLength];

    for (int i = 0; i < strLength; ++i)
    {
        levels[i] = startLevel;
    }



    String e = IsOdd(startLevel) ? "R" : "L";
    String sor = e;
    String eor = sor;


    String lastType = sor;
    for (int i = 0; i < strLength; ++i)
    {
        if (types[i].equals("NSM"))
        {
            types[i] = lastType;
        }
        else
        {
            lastType = types[i];
        }
    }

    lastType = sor;
    for (int i = 0; i < strLength; ++i)
    {

        String t = types[i];
        if (t.equals("EN"))
        {
            types[i] = (lastType.equals("AL")) ? "AN" : "EN";
        }
        else if (t.equals("R") || t.equals("L") || t.equals("AL"))
        {
            lastType = t;
        }
    }



    for (int i = 0; i < strLength; ++i)
    {

        String t = types[i];
        if (t.equals("AL"))
        {
            types[i] = "R";
        }
    }



    for (int i = 1; i < strLength - 1; ++i)
    {
        if (types[i].equals("ES") && types[i - 1].equals("EN") && types[i + 1].equals("EN"))
        {
            types[i] = "EN";
        }
        if (types[i].equals("CS") && (types[i - 1].equals("EN") || types[i - 1].equals("AN")) && types[i + 1] == types[i - 1])
        {
            types[i] = types[i - 1];
        }
    }



    for (int i = 0; i < strLength; ++i)
    {
        if (types[i].equals("EN"))
        {
            // do before
            for (int j = i - 1; j >= 0; --j)
            {
                if (!types[j].equals("ET"))
                {
                    break;
                }
                types[j] = "EN";
            }
            // do after
            for (int j = i + 1; j < strLength; --j)
            {
                if (!types[j].equals("ET"))
                {
                    break;
                }
                types[j] = "EN";
            }
        }
    }



    for (int i = 0; i < strLength; ++i)
    {

        String t = types[i];
        if (t.equals("WS") || t.equals("ES") || t.equals("ET") || t.equals("CS"))
        {
            types[i] = "ON";
        }
    }


    lastType = sor;
    for (int i = 0; i < strLength; ++i)
    {

        String t = types[i];
        if (t.equals("EN"))
        {
            types[i] = (lastType.equals("L")) ? "L" : "EN";
        }
        else if (t.equals("R") || t.equals("L"))
        {
            lastType = t;
        }
    }


    for (int i = 0; i < strLength; ++i)
    {
        if (types[i].equals("ON"))
        {

            int end = FindUnequal(types, i + 1, "ON");

            String before = sor;
            if (i > 0)
            {
                before = types[i - 1];
            }

            String after = eor;
            if (end + 1 < strLength)
            {
                after = types[end + 1];
            }
            if (!before.equals("L"))
            {
                before = "R";
            }
            if (!after.equals("L"))
            {
                after = "R";
            }
            if (before == after)
            {
                SetValues(types, i, end, before);
            }
            i = end - 1; // reset to end (-1 so next iteration is ok)
        }
    }



    for (int i = 0; i < strLength; ++i)
    {
        if (types[i].equals("ON"))
        {
            types[i] = e;
        }
    }



    for (int i = 0; i < strLength; ++i)
    {

        String t = types[i];
        if (IsEven(levels[i]))
        {
            if (t.equals("R"))
            {
                levels[i] += 1;
            }
            else if (t.equals("AN") || t.equals("EN"))
            {
                levels[i] += 2;
            }
        }
        else
        { 
            if (t.equals("L") || t.equals("AN") || t.equals("EN"))
            {
                levels[i] += 1;
            }
        }
    }


    int highestLevel = -1;
    int lowestOddLevel = 99;
    int ii = levels.length;
    for (int i = 0; i < ii; ++i)
    {

        int level = levels[i];
        if (highestLevel < level)
        {
            highestLevel = level;
        }
        if (lowestOddLevel > level && IsOdd(level))
        {
            lowestOddLevel = level;
        }
    }



    for (int level = highestLevel; level >= lowestOddLevel; --level)
    {

        int start = -1;
        ii = levels.length;
        for (int i = 0; i < ii; ++i)
        {
            if (levels[i] < level)
            {
                if (start >= 0)
                {
                    chars = ReverseValues(chars, start, i);
                    start = -1;
                }
            }
            else if (start < 0)
            {
                start = i;
            }
        }
        if (start >= 0)
        {
            chars = ReverseValues(chars, start, levels.length);
        }
    }


    String result = "";
    ii = chars.length;
    for (int i = 0; i < ii; ++i)
    {

        char ch = chars[i];
        if (ch != '<' && ch != '>')
        {
            result += ch;
        }
    }

    return new BidiResult(result, isLtr);
}

android

itext

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-06-07 14:27:23

您的示例PDF根本不包含任何文本，它只是包含一个嵌入的文本位图图像。

当谈到“从PDF中提取文本”(以及“文本提取器API”和PdfTextExtractor类等)时，通常意味着在PDF中查找文本绘图指令( PDF查看器使用嵌入在PDF中或在当前系统上可用的字体程序来显示文本)，并根据字符串参数和字体编码定义确定它们的文本内容。

在您的示例中，没有这样的文本绘制指令，只有位图绘制指令和位图本身，从文档中提取的文本将返回一个空字符串。

要检索文档中显示的文本，必须查找OCR (光学字符识别)解决方案。PDF库(如iText)可以帮助您提取嵌入的位图图像，以转发到OCR解决方案，如果OCR解决方案不直接支持PDF，而是只支持位图格式。

如果您还有PDF文档，这些文档使用含有足够编码信息的文本绘图指令来显示阿拉伯文本，而不是位图，那么您可能需要使用iText的文本提取输出，就像Amedee在对您的问题的评论中指出的那样，使用类似于这个答案中所建议的方法来处理文本提取输出。(是的，它是用C#编写的，但是移植到Java非常容易。)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/50710156

复制

相似问题

问阿拉伯PDF文本抽取器
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问阿拉伯PDF文本抽取器EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问阿拉伯PDF文本抽取器
EN