有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

java单词在句子中的共现

我的档案里有一大组句子(10000句)。每个文件包含一个句子。在整个集合中,我想找出一个句子中哪些单词出现在一起,以及它们的频率

例句:

"Proposal 201 has been accepted by the Chief today.", 
"Proposal 214 and 221 are accepted, as per recent Chief decision",     
"This proposal has been accepted by the Chief.",
"Both proposal 3 MazerNo and patch 4 have been accepted by the Chief.",     
"Proposal 214, ValueMania, has been accepted by the Chief."};

我想对以下输出进行编码。我应该能够提供三个起始词作为程序的参数:“主管,接受,提议”

Chief accepted Proposal            5
Chief accepted Proposal has        3
Chief accepted Proposal has been   3

... 
...
for all combinations.

我知道组合可能很大

我在网上搜索过,但找不到。我已经写了一些代码,但头脑还不清醒。也许了解这个领域的人可能知道

ReadFileLinesIntoArray rf = new ReadFileLinesIntoArray();

            try {
                String[] tmp = rf.readFromFile("c:/scripts/SelectedSentences.txt");
                for (String t : tmp){
                      String[] keys = t.split(" ");
                      String[] uniqueKeys;
                      int count = 0;
                      System.out.println(t);
                      uniqueKeys = getUniqueKeys(keys);
                        for(String key: uniqueKeys)
                        {
                            if(null == key)
                            {
                                break;
                            }           
                            for(String s : keys)
                            {
                                if(key.equals(s))
                                {
                                    count++;
                                }               
                            }
                            System.out.println("Count of ["+key+"] is : "+count);
                            count=0;
                        }
                }
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }

private static String[] getUniqueKeys(String[] keys) {
        String[] uniqueKeys = new String[keys.length];

        uniqueKeys[0] = keys[0];
        int uniqueKeyIndex = 1;
        boolean keyAlreadyExists = false;

        for (int i = 1; i < keys.length; i++) {
            for (int j = 0; j <= uniqueKeyIndex; j++) {
                if (keys[i].equals(uniqueKeys[j])) {
                    keyAlreadyExists = true;
                }
            }

            if (!keyAlreadyExists) {
                uniqueKeys[uniqueKeyIndex] = keys[i];
                uniqueKeyIndex++;
            }
            keyAlreadyExists = false;
        }
        return uniqueKeys;
    }

有人能帮我编码一下吗


共 (1) 个答案

  1. # 1 楼答案

    您可以应用标准的信息检索数据结构,尤其是反向索引。以下是你如何做到这一点

    考虑一下你原来的句子。用一些整数标识符给它们编号,如下所示:

    1. "Proposal 201 has been accepted by the Chief today.",
    2. "Proposal 214 and 221 are accepted, as per recent Chief decision",
    3. "This proposal has been accepted by the Chief.",
    4. "Both proposal 3 MazerNo and patch 4 have been accepted by the Chief.",
    5. "Proposal 214, ValueMania, has been accepted by the Chief."

    对于一个句子中遇到的每一对单词,将其添加到一个反向索引中,该索引将这对单词映射到一组句子标识符(一组唯一项)。对于长度为N的句子,有N-choose-2对

    适当的Java数据结构将是Map<String, Map<String, Set<Integer>>。按字母顺序排列这些对,使对“has”和“Proposal”仅以(“has”、“Proposal”)而不是(“Proposal”、“has”)的形式出现

    此地图将包含以下内容:

    "has", "Proposal"  > Set(1, 5)
    "accepted", "Proposal"  > Set(1, 2, 5)
    "accepted", "has"  > Set(1, 3, 5)
    etc.
    

    例如,单词对“has”和“Proposal”有一组(1,5),这意味着它们出现在句子1和5中

    现在假设您想查找“已接受”、“已接受”和“提议”列表中的单词共现次数。从这个列表中生成所有对,并与它们各自的列表相交(使用Java的^{)。这里的结果将是(1,5)的最终结果。它的大小是2,这意味着有两个句子包含“接受”、“拥有”和“提议”

    要生成所有对,只需根据需要迭代地图。要生成大小为N的所有单词元组,需要进行迭代,并根据需要使用递归