How to remove duplicate lines in a large text file?

How would you remove duplicate lines from a file that is  much too large to fit in memory? The duplicate lines are not necessarily adjacent, and say the file is 10 times bigger than RAM.

A better solution is to use HashSet to store each line of input.txt. As set ignores duplicate values, so while storing a line, check if it already present in hashset. Write it to output.txt only if not present in hashset.

Java:

// Efficient Java program to remove
// duplicates from input.txt and
// save output to output.txt 

import java.io.*;
import java.util.HashSet; 

public class FileOperation
{
    public static void main(String[] args) throws IOException
    {
        // PrintWriter object for output.txt
        PrintWriter pw = new PrintWriter("output.txt"); 

        // BufferedReader object for input.txt
        BufferedReader br = new BufferedReader(new FileReader("input.txt")); 

        String line = br.readLine(); 

        // set store unique values
        HashSet<String> hs = new HashSet<String>(); 

        // loop for each line of input.txt
        while(line != null)
        {
            // write only if not
            // present in hashset
            if(hs.add(line))
                pw.println(line); 

            line = br.readLine(); 

        } 

        pw.flush(); 

        // closing resources
        br.close();
        pw.close(); 

        System.out.println("File operation performed successfully");
    }
}

  

原文地址:https://www.cnblogs.com/lightwindy/p/9650718.html

时间: 2024-11-05 20:30:20

How to remove duplicate lines in a large text file?的相关文章

[LeetCode][JavaScript]Remove Duplicate Letters

Remove Duplicate Letters Given a string which contains only lowercase letters, remove duplicate letters so that every letter appear once and only once. You must make sure your result is the smallest in lexicographical order among all possible results

316. Remove Duplicate Letters

316. Remove Duplicate Letters Total Accepted: 2367 Total Submissions: 12388 Difficulty: Medium Given a string which contains only lowercase letters, remove duplicate letters so that every letter appear once and only once. You must make sure your resu

[leetcode] Remove Duplicate Letters

题目: Given a string which contains only lowercase letters, remove duplicate letters so that every letter appear once and only once. You must make sure your result is the smallest in lexicographical order among all possible results. Example: Given "bca

leetcode 316. Remove Duplicate Letters

Given a string which contains only lowercase letters, remove duplicate letters so that every letter appear once and only once. You must make sure your result is the smallest in lexicographical order among all possible results. Example: Given "bcabc&q

Remove Duplicate Letters I &amp; II

Remove Duplicate Letters I Given a string which contains only lowercase letters, remove duplicate letters so that every letter appear once and only once. Example: Given "bcabc"Return "abc" Given "cbacdcbc"Return "abcd&qu

[LeetCode] Remove Duplicate Letters 移除重复字母

Given a string which contains only lowercase letters, remove duplicate letters so that every letter appear once and only once. You must make sure your result is the smallest in lexicographical order among all possible results. Example: Given "bcabc&q

[Leetcode] Remove duplicate from sorted list ii 从已排序的链表中删除重复结点

Given a sorted linked list, delete all nodes that have duplicate numbers, leaving only distinct numbers from the original list. For example,Given1->2->3->3->4->4->5, return1->2->5.Given1->1->1->2->3, return2->3. 这题和R

[Swift]LeetCode316. 去除重复字母 | Remove Duplicate Letters

Given a string which contains only lowercase letters, remove duplicate letters so that every letter appear once and only once. You must make sure your result is the smallest in lexicographical order among all possible results. Example 1: Input: "bcab

Remove Duplicate Letters

1 public class Solution { 2 public String removeDuplicateLetters(String s) { 3 if (s.length() < 2) { 4 return s; 5 } 6 int[] letters = new int[26]; 7 for (char c : s.toCharArray()) { 8 letters[c - 'a']++; 9 } 10 boolean[] visited = new boolean[26]; 1