Question
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
Solution -- Bit Manipulation
Original idea is to use a set to store each substring. Time complexity is O(n) and space cost is O(n). But for details of space cost, a char is 2 bytes, so we need 20 bytes to store a substring and therefore (20n) space.
If we represent DNA substring by integer, the space is cut down to (4n).
1 public List<String> findRepeatedDnaSequences(String s) { 2 List<String> result = new ArrayList<String>(); 3 4 int len = s.length(); 5 if (len < 10) { 6 return result; 7 } 8 9 Map<Character, Integer> map = new HashMap<Character, Integer>(); 10 map.put(‘A‘, 0); 11 map.put(‘C‘, 1); 12 map.put(‘G‘, 2); 13 map.put(‘T‘, 3); 14 15 Set<Integer> temp = new HashSet<Integer>(); 16 Set<Integer> added = new HashSet<Integer>(); 17 18 int hash = 0; 19 for (int i = 0; i < len; i++) { 20 if (i < 9) { 21 //each ACGT fit 2 bits, so left shift 2 22 hash = (hash << 2) + map.get(s.charAt(i)); 23 } else { 24 hash = (hash << 2) + map.get(s.charAt(i)); 25 //make length of hash to be 20 26 hash = hash & (1 << 20) - 1; 27 28 if (temp.contains(hash) && !added.contains(hash)) { 29 result.add(s.substring(i - 9, i + 1)); 30 added.add(hash); //track added 31 } else { 32 temp.add(hash); 33 } 34 } 35 36 } 37 38 return result; 39 }
时间: 2024-12-24 10:39:24