在java中有效地过滤字符串

2 周，6 日 Questions & Answers 53

我现在正试着做一个迷你搜索引擎。我的目标是在hashmap中索引一组文件，但首先我需要执行两个操作，包括降低大写字母，删除所有不必要的单词，以及删除除a-z/a-z之外的所有字符。现在我的实现如下所示：

String article = ""; for (File file : dir.listFiles()) { //for each file (001.txt, 002.txt...) Scanner s = null; try { s = new Scanner(file); while (s.hasNext()) article += s.next().toLowerCase(Locale.ROOT) + " "; //converting all characters to lower case article = currentWord.replaceAll(delimiters.get()," "); //removing punctuations (?, -, !, * etc...) String splittedWords = article.split(" "); //splitting each word into a string array for(int i = 0; i < splittedWords.length; i++) { s = new Scanner(stopwords); boolean flag = true; while(s.hasNextLine()) if (splittedWords[i].equals(s.nextLine())) { //comparing each word with all the stop words (words like a, the, already, these etc...) taken from another big txt file and removing them, because we dont need to fill our map with unnecessary words, to provide faster search times later on flag = false; break; } if(flag) map.put(splittedWords[i], file.getName()); //if current word in splittedWords array does not match any stop word, put it in the hashmap } s.close(); } catch (FileNotFoundException e) { e.printStackTrace(); } s.close(); System.out.println(file); }

这只是我代码中的一个块，它可能包含缺失的部分，我用注释粗略地解释了我的算法。使用。contains方法检查stopWords是否包含任何currentWord，尽管这是一种更快的方法，但它不会映射像“death”这样的词，因为它包含stopWords列表中的“at”。我正在尽我最大的努力使它更有效，但我没有取得多大进展。每个包含约300个单词的文件都需要约3秒的时间来编制索引，考虑到我有上万个文件，这并不理想。关于如何改进算法以使其运行更快，有什么想法吗

# 1 楼答案

有一些改进：

首先，请不要使用new Scanner(File)构造函数，因为它使用无缓冲I/O。小型磁盘读取操作（尤其是在硬盘上）非常无效。例如，使用具有65 KB缓冲区的BufferedInputStream：

try (Scanner s = new Scanner(new BufferedInputStream(new FileInputStream(f), 65536))) {
    // your code
}

第二：你的电脑很可能有一个多代码CPU。因此，您可以并行扫描多个文件。为此，您必须确保使用多线程感知map。将地图的定义更改为：

Map<String,String> map = new ConcurrentHashMap<>();

然后可以使用以下代码：

Files.list(dir.toPath()).parallel().forEach(f -> {
    try (Scanner s = new Scanner(new BufferedInputStream(Files.newInputStream(f), 65536))) {
        // your code
    } catch (IOException e) {
        e.printStackTrace();
    }
});

根据系统中的CPU内核，它将同时处理多个文件。特别是如果您处理大量文件，这将大大减少程序的运行时间

最后，您的实现相当复杂。使用Scanner的输出创建一个新字符串，然后再次拆分该字符串。相反，最好是配置扫描仪来直接考虑你想要的分隔符：

try (Scanner s = new Scanner(....).useDelimiter("[ ,\\!\\-\\.\\?\\*]")) {

然后，您可以直接使用Scanner创建的令牌，而不必构建article字符串，然后再拆分它

Pattern p = Pattern.compile("[A-Za-z]+"); try (Scanner s = new Scanner(file)) { while (s.hasNext(p)) { String word = s.next(p); word = word.toLowerCase(Locale.ROOT); ... } }

共 (2) 个答案

# 1 楼答案
有一些改进：

首先，请不要使用new Scanner(File)构造函数，因为它使用无缓冲I/O。小型磁盘读取操作（尤其是在硬盘上）非常无效。例如，使用具有65 KB缓冲区的BufferedInputStream：
```
try (Scanner s = new Scanner(new BufferedInputStream(new FileInputStream(f), 65536))) {
    // your code
}
```
第二：你的电脑很可能有一个多代码CPU。因此，您可以并行扫描多个文件。为此，您必须确保使用多线程感知map。将地图的定义更改为：
```
Map<String,String> map = new ConcurrentHashMap<>();
```
然后可以使用以下代码：
```
Files.list(dir.toPath()).parallel().forEach(f -> {
    try (Scanner s = new Scanner(new BufferedInputStream(Files.newInputStream(f), 65536))) {
        // your code
    } catch (IOException e) {
        e.printStackTrace();
    }
});
```
根据系统中的CPU内核，它将同时处理多个文件。特别是如果您处理大量文件，这将大大减少程序的运行时间

最后，您的实现相当复杂。使用Scanner的输出创建一个新字符串，然后再次拆分该字符串。相反，最好是配置扫描仪来直接考虑你想要的分隔符：
```
try (Scanner s = new Scanner(....).useDelimiter("[ ,\\!\\-\\.\\?\\*]")) {
```
然后，您可以直接使用Scanner创建的令牌，而不必构建article字符串，然后再拆分它
# 2 楼答案
你自己实现搜索引擎的原因是什么

对于生产，我推荐现有的解决方案——ApacheLucene，它完全符合您的任务

如果您只是在培训，那么有几个标准点可以改进您的代码
1. 避免像这样的循环中的字符串连接article +=。最好创建一个单词regexp并将其传递给Scanner
```
    Pattern p = Pattern.compile("[A-Za-z]+");
    try (Scanner s = new Scanner(file)) {
        while (s.hasNext(p)) {
            String word = s.next(p);
            word = word.toLowerCase(Locale.ROOT);
            ...
        }
    }
```
1. 将所有stopwords放入hashmap，并使用containsKey方法检查每个新出现的单词

Python中文网

有 Java 编程相关的问题?

在java中有效地过滤字符串

共 (2) 个答案

# 1 楼答案

# 2 楼答案