指定文件目录遍历所有子目录统计文档的单词出现数量

package javaClassHomework;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.text.DecimalFormat;
import java.util.Comparator;
import java.util.Scanner;
import java.util.TreeMap;
import java.util.TreeSet;

public class test3
{
    //C:\Users\Administrator\Desktop\hhh
   public static void main(String[] args) throws IOException
{
       Scanner sc=new Scanner(System.in);
        File file=getfile();
        bianli(file);
        System.out.println("请选择要查询的文档");
        String name=sc.next();
        display(name);

}
    public static void bianli(File file) {
        File [] yue=file.listFiles();
        for(File fi:yue) {
            if(!fi.isDirectory()) {
            if(fi.getName().endsWith(".txt")) {
                System.out.println(fi.getPath());
            }
            }
            else {
                bianli(fi);
            }
        }

    }
   public static void display(String path) throws IOException {
       Scanner sc=new Scanner (System.in);
       BufferedReader br=new BufferedReader(new FileReader(path));
        int c;
        TreeMap<String,Integer> hm=new TreeMap<>();
        String line;
        int kt=0;
        while((line=br.readLine())!=null) {
            String [] str=line.split("[^a-zA-Z]");
            for(int i=0;i<str.length;i++) {
                if(!str[i].equals("")) {
            hm.put(str[i],hm.containsKey(str[i])?hm.get(str[i])+1:1);}
            }
            }
        br.close();
        int max=0;
        int sum=0;
        int t=0;
        for(String k: hm.keySet()) {
            sum=sum+hm.get(k);
            if(max<=hm.get(k)) {
                max=hm.get(k);
            }
          }

       TreeSet<String> ts=new TreeSet<>(new Comparator<String>()
        {
           public int compare(String a,String b) {
               int num=hm.get(a)-hm.get(b);
               return num==0?1:(-num);
           }
        });
       for(String k: hm.keySet()) {
           ts.add(k);
       }
       DecimalFormat df = new DecimalFormat("0.00%");
        System.out.println("请输入要查询的个数");
       int count=sc.nextInt();
       int q=0;
       for (String s : ts)
        {
           if(q==count) {
               break;
           }
           else {
               q++;
               float bai=(float)hm.get(s)/sum;
               System.out.println(s+" "+hm.get(s)+" "+df.format(bai));
           }

        }

        System.out.println(sum);
   }
   public static File getfile() {
       Scanner sc=new Scanner(System.in);
       while(true) {
       String line=sc.nextLine();
       File kk=new File(line);
       if(!kk.exists()) {
           System.out.println("输入的不是文件夹，请重新输入");
       }
       else if(kk.isFile()) {
           System.out.println("输入的是文件路径，请重新输入");
       }
       else {
           return kk;
       }

}
}
}

代码思路：

三个方法，1.通过给出路径查找文件

2.通过递归遍历文件的子目录，找到后缀名为.txt的文档

3.为统计单词数量的方法通过输入输出流进行操作

原文地址：https://www.cnblogs.com/yanwenhui/p/11794805.html

时间： 2024-07-31 02:25:17

指定文件目录遍历所有子目录统计文档的单词出现数量的相关文章

统计文档中单词出现频率

一.先贴出自己的代码 1 import java.io.BufferedReader; 2 import java.io.File; 3 import java.io.FileReader; 4 import java.io.IOException; 5 import java.util.Arrays; 6 import java.util.HashMap; 7 import java.util.Iterator; 8 import java.util.Map; 9 import java.ut

C语言K&R习题系列——统计文档中每个单词所占字母个数，以直方图形式输出

原题: Write a program to print a histogram of the lengths of words in its input. It is easy to draw the histogram with the bars horizontal; a vertical orientation is more challenging. 这也是我第一个过百行的代码(带注释,空格什么的) 主要分两个部分:输入和输出 #include < stdio.h > #define

python统计文档中词频

python统计文档中词频的小程序 python版本2.7 程序如下,测试文件与完整程序在我的github中 1 #统计空格数与单词数本函数只返回了空格数需要的可以自己返回多个值 2 def count_space(path): 3 number_counts = 0 4 space_counts = 0 5 number_list = [] 6 7 with open(path, 'r') as f: 8 for line in f: 9 line = line.strip() 10 sp

Photoshop脚本 > 遍历最近打开的文档

源自:http://coolketang.com/tutorials/menu1lesson5.php 本节将演示如何使用脚本,显示最近打开的文档名称. 新建一个脚本文件,并输入脚本代码: var recentFiels = app.recentFiles; var message = ""; 定义一个变量[recntFiles],表示Photoshop最近打开的文件.定义一个变量[message],用来在之后的代码中,存储所有曾经打开的文件的名称. for(var i=0;i<

统计文档中前5个高频词个数并输出

import jieba ls="中国是一个伟大的国家,是一个好的国家" print('原始文档为:',ls) counts={} # 定义统计字典 words=jieba.lcut(ls) print('分好的词组为:',words) for word in words: counts[word]=counts.get(word,0)+1 print('生成的字典为:',counts) print('字典的元素为:',counts.items()) #字典元组转换为列表 items=

MongoDB统计文档(Document)的数组(Array)中的各个元素出现的次数

一,问题描述 [使用 unwind unpack Document 里面的Array中的每个元素,然后使用 group 分组统计,最后使用 sort 对分组结果排序] 从 images.json 文件中导入数据到MongoDB服务器 mongoimport --drop -d test -c images images.json 其中Document的示例如下: > db.images.find() { "_id" : 3, "height" : 480, &

<meta>指定浏览器模式(browser mode)或文档模式(document mode)无效

<!DOCTYPE html> <html> <head> <title>My Web</title> <meta http-equiv="X-UA-Compatible" content="IE=8" > ... 红色的代码必须在 <title>标签的后面, 后者会引起设置无效!!!

用lucene.net根据关键字检索本地word文档

目前在做一个winform小软件,其中有一个功能是能根据关键字检索本地保存的word文档.第一次是用com读取word方式(见上一篇文章),先遍历文件夹下的word文档,读取每个文档时循环关键字查找,结果可想而知效率很慢.检索结果是一条接一条显示出来的o(>_<)o ~~.连菜鸟级别的自己看到这效率都觉得很无语.然后想到计算机的本地搜索及google,百度搜索引擎,它们能做到在海量文件中快速搜到匹配某些关键字的文件,应该是运用其它比较先进成熟的技术来实现.于是上网搜了好多资料,发现有一种叫lu

使用gensim和sklearn搭建一个文本分类器（一）：文档向量化

总的来讲,一个完整的文本分类器主要由两个阶段,或者说两个部分组成:一是将文本向量化,将一个字符串转化成向量形式:二是传统的分类器,包括线性分类器,SVM, 神经网络分类器等等. 之前看的THUCTC的技术栈是使用 tf-idf 来进行文本向量化,使用卡方校验(chi-square)来降低向量维度,使用liblinear(采用线性核的svm) 来进行分类.而这里所述的文本分类器,使用lsi (latent semantic analysis, 隐性语义分析) 来进行向量化, 不需要降维, 因为可以