Leetcode: Repeated DNA Sequence

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

Naive 方法就是两层循环,外层for(int i=0; i<=s.length()-10; i++), 内层for(int j=i+1; j<=s.length()-10; j++), 比较两个字符串s.substring(i, i+10)和s.substring(j, j+10)是否equal, 是的话,加入到result里,这样两层循环再加equals()时间复杂度应该到O(N^3)了,TLE了

方法2:进一步的方法是用HashSet, 每次取长度为10的字符串,O(N)时间遍历数组,重复就加入result,但这样需要O(N)的space。String一大就MLE

最优解:是在方法2基础上用bit operation,大概思想是把字符串映射为整数,对整数进行移位以及位与操作,以获取相应的子字符串。众所周知,位操作耗时较少,所以这种方法能节省运算时间。

首先考虑将ACGT进行二进制编码

A -> 00

C -> 01

G -> 10

T -> 11

在编码的情况下,每10位字符串的组合即为一个数字,且10位的字符串有20位;一般来说int有4个字节,32位,即可以用于对应一个10位的字符串。例如

ACGTACGTAC -> 00011011000110110001

AAAAAAAAAA -> 00000000000000000000

每次向右移动1位字符,相当于字符串对应的int值左移2位,再将其最低2位置为新的字符的编码值,最后将高2位置0。

Cost分析:

时间复杂度O(N)

空间复杂度:20位的二进制数,至多有2^20种组合,因此HashSet的大小为2^20,即1024 * 1024,O(1)

 1 public class Solution {
 2     public List<String> findRepeatedDnaSequences(String s) {
 3         ArrayList<String> res = new ArrayList<String>();
 4         if (s==null || s.length()<=10) return res;
 5         HashMap<Character, Integer> dict = new HashMap<Character, Integer>();
 6         dict.put(‘A‘, 0);
 7         dict.put(‘C‘, 1);
 8         dict.put(‘G‘, 2);
 9         dict.put(‘T‘, 3);
10         HashSet<Integer> set = new HashSet<Integer>();
11         HashSet<String> result = new HashSet<String>(); //directly use arraylist to store result may not avoid duplicates, so use hashset to preselect
12         int hashcode = 0;
13         for (int i=0; i<s.length(); i++) {
14             if (i < 9) {
15                 hashcode = (hashcode<<2) + dict.get(s.charAt(i));
16             }
17             else {
18                 hashcode = (hashcode<<2) + dict.get(s.charAt(i));
19                 hashcode &= (1<<20) - 1;
20                 if (!set.contains(hashcode)) {
21                     set.add(hashcode);
22                 }
23                 else {
24                     //duplicate hashcode, decode the hashcode, and add the string to result
25                     String temp = s.substring(i-9, i+1);
26                     result.add(temp);
27                 }
28             }
29         }
30         for (String item : result) {
31             res.add(item);
32         }
33         return res;
34     }
35 }

naive方法:

 1 public class Solution {
 2     public List<String> findRepeatedDnaSequences(String s) {
 3         ArrayList<String> res = new ArrayList<String>();
 4         if (s==null || s.length()<=10) return res;
 5         for (int i=0; i<=s.length()-10; i++) {
 6             String cur = s.substring(i, i+10);
 7             for (int j=i+1; j<=s.length()-10; j++) {
 8                 String comp = s.substring(j, j+10);
 9                 if (cur.equals(comp)) {
10                     res.add(cur);
11                     break;
12                 }
13             }
14         }
15         return res;
16     }
17 }
时间: 2024-10-25 19:55:13

Leetcode: Repeated DNA Sequence的相关文章

[LeetCode]Repeated DNA Sequences

题目:Repeated DNA Sequences 给定包含A.C.G.T四个字符的字符串找出其中十个字符的重复子串. 思路: 首先,string中只有ACGT四个字符,因此可以将string看成是1,3,7,20这三个数字的组合串: 并且可以发现{ACGT}%5={1,3,2,0};于是可以用两个位就能表示上面的四个字符: 同时,一个子序列有10个字符,一共需要20bit,即int型数据类型就能表示一个子序列: 这样可以使用计数排序的思想来统计重复子序列: 这个思路时间复杂度只有O(n),但是

[LeetCode]Repeated DNA Sequences,解题报告

目录 目录 前言 题目 Native思路 二进制思路 AC 前言 最近在LeetCode上能一次AC的概率越来越低了,我这里也是把每次不能一次AC的题目记录下来,把解题思路分享给大家. 题目 All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to

LeetCode() Repeated DNA Sequences 看的非常的过瘾!

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA. Write a function to find all the 10-letter-long seq

[LeetCode] Repeated DNA Sequences hash map

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA. Write a function to find all the 10-letter-long seq

[LeetCode]Repeated DNA Sequences Total

题意:题目意思很简单就是有一个由 A C G T 组成的字符串,要求找出字符窜中出现次数不止1次的字串 思路1: 遍历字符串,用hashmap存储字串,判断即可 代码1: public List<String> findRepeatedDnaSequences(String s) { List<String> rs = new LinkedList<String>(); Map<String, Integer> map = new HashMap<St

leetcode 204/187/205 Count Primes/Repeated DNA Sequences/Isomorphic Strings

一:leetcode 204 Count Primes 题目: Description: Count the number of prime numbers less than a non-negative number, n 分析:此题的算法源码可以参看这里,http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes 代码: class Solution { public: int countPrimes(int n) { // 求小于一个数n的素数个

【LeetCode】187. Repeated DNA Sequences

Repeated DNA Sequences All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA. Write a function to find all

Repeated DNA Sequences

package cn.edu.xidian.sselab.hashtable; import java.util.ArrayList;import java.util.HashSet;import java.util.List;import java.util.Set; /** *  * @author zhiyong wang * title: Repeated DNA Sequences * content: *  All DNA is composed of a series of nuc

【POJ】2278 DNA Sequence

各种wa后,各种TLE.注意若AC非法,则ACT等一定非法.而且尽量少MOD. 1 #include <iostream> 2 #include <cstdio> 3 #include <cstring> 4 #include <queue> 5 using namespace std; 6 7 #define MAXN 105 8 #define NXTN 4 9 10 char str[15]; 11 12 typedef struct Matrix {