使用lucene query的CharFilter 去掉字符中的script脚本和html标签



    public void init(){
        SQLService sqlService = new SQLService();
        BaseDao bd = new BaseDao();
        String sql = "select * from t where title like ‘% 每天读一遍,舌头更无敌%‘";
        lists = bd.getList(sql);
        content = lists.get(0).get("content").toString();
//        System.out.println(content);


2. 使用字符过滤器-HTMLStripCharFilter 和 MappingCharFilter.由于这些字符过滤器都是继承Reader的.所以可以像读取reader那样处理.


    public void test2() throws IOException{

        StringBuilder sb = new StringBuilder();
        // html过滤
        HTMLStripCharFilter htmlscript = new HTMLStripCharFilter(new StringReader(content));

        //增加映射过滤  主要过滤掉换行符
        NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
        builder.add( "\r", "" );//回车
        builder.add( "\t", "" );//横向跳格
        builder.add( "\n", "" );//换行
        CharFilter cs = new MappingCharFilter( builder.build(),htmlscript );

        char[] buffer = new char[10240];
        int count;
        while ((count = cs.read(buffer)) != -1) {
            sb.append(new String(buffer, 0, count));

//        String keywords = HanLP.extractKeyword(sb.toString(), 20).toString();
//        System.out.println(keywords);


时间: 2025-01-13 10:11:13

