[LeetCode] 187. Repeated DNA Sequences 求重复的DNA序列

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

Example:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

Output: ["AAAAACCCCC", "CCCCCAAAAA"]

所有的DNA都是由一系列核苷酸组成的,简写为 A, C, G, and T,比如:"ACGAATTCCG"。当研究DNA时,识别DNA里的子序列是很有帮助的。写一个函数找出10个字母长度的出现过多次的子序列。

解法1:hash table + hash set

解法2: hash set

解法3:hash table + bit manipulte

Java:

public List<String> findRepeatedDnaSequences(String s) {
        Set<String> result = new HashSet();
        if(s ==null || s.length() <2)
            return new ArrayList();
        Set<String> temp = new HashSet();
        for(int i=0; i<s.length()-9; i++){
            String x = s.substring(i,i+10);
            if(temp.contains(x)){
                result.add(x);
            } else
                temp.add(x);

        }
        return new ArrayList(result);
    } 

Java:

public List<String> findRepeatedDnaSequences(String s) {
    Set seen = new HashSet(), repeated = new HashSet();
    for (int i = 0; i + 9 < s.length(); i++) {
        String ten = s.substring(i, i + 10);
        if (!seen.add(ten))
            repeated.add(ten);
    }
    return new ArrayList(repeated);
}

Java: hashmap + bits manipulation

public List<String> findRepeatedDnaSequences(String s) {
    Set<Integer> words = new HashSet<>();
    Set<Integer> doubleWords = new HashSet<>();
    List<String> rv = new ArrayList<>();
    char[] map = new char[26];
    //map[‘A‘ - ‘A‘] = 0;
    map[‘C‘ - ‘A‘] = 1;
    map[‘G‘ - ‘A‘] = 2;
    map[‘T‘ - ‘A‘] = 3;

    for(int i = 0; i < s.length() - 9; i++) {
        int v = 0;
        for(int j = i; j < i + 10; j++) {
            v <<= 2;
            v |= map[s.charAt(j) - ‘A‘];
        }
        if(!words.add(v) && doubleWords.add(v)) {
            rv.add(s.substring(i, i + 10));
        }
    }
    return rv;
}  

Python:

class Solution(object):
    def findRepeatedDnaSequences(self, s):
        """
        :type s: str
        :rtype: List[str]
        """
        dict, rolling_hash, res = {}, 0, []

        for i in xrange(len(s)):
            rolling_hash = ((rolling_hash << 3) & 0x3fffffff) | (ord(s[i]) & 7)
            if rolling_hash not in dict:
                dict[rolling_hash] = True
            elif dict[rolling_hash]:
                res.append(s[i - 9: i + 1])
                dict[rolling_hash] = False
        return res

Python:

def findRepeatedDnaSequences2(self, s):
        """
        :type s: str
        :rtype: List[str]
        """
        l, r = [], []
        if len(s) < 10: return []
        for i in range(len(s) - 9):
            l.extend([s[i:i + 10]])
        return [k for k, v in collections.Counter(l).items() if v > 1]

C++:

class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s) {
        unordered_set<int> seen;
        unordered_set<int> dup;
        vector<string> result;
        vector<char> m(26);
        m[‘A‘ - ‘A‘] = 0;
        m[‘C‘ - ‘A‘] = 1;
        m[‘G‘ - ‘A‘] = 2;
        m[‘T‘ - ‘A‘] = 3;

        for (int i = 0; i + 10 <= s.size(); ++i) {
            string substr = s.substr(i, 10);
            int v = 0;
            for (int j = i; j < i + 10; ++j) { //20 bits < 32 bit int
                v <<= 2;
                v |= m[s[j] - ‘A‘];
            }
            if (seen.count(v) == 0) { //not seen
                seen.insert(v);
            } else if (dup.count(v) == 0) { //seen but not dup
                dup.insert(v);
                result.push_back(substr);
            } //dup
        }
        return result;
    }
};

  

原文地址:https://www.cnblogs.com/lightwindy/p/9770417.html

时间: 2024-09-29 17:19:49

[LeetCode] 187. Repeated DNA Sequences 求重复的DNA序列的相关文章

[LeetCode] 187. Repeated DNA Sequences 解题思路

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA. Write a function to find all the 10-letter-long seq

LeetCode 187. Repeated DNA Sequences 20170706 第三十次作业

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA. Write a function to find all the 10-letter-long seq

Java for LeetCode 187 Repeated DNA Sequences

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA. Write a function to find all the 10-letter-long seq

LeetCode 187. Repeated DNA Sequences(位运算,hash)

题目 题意:判断一个DNA序列中,长度为10的子序列,重复次数超过1次的序列! 题解:用一个map 就能搞定了,但是出于时间效率的优化,我们可以用位运算和数组代替map,首先只有四个字母,就可以用00,01,10,11 四个二进制表示,长度为10的序列,可以用长度为20的二进制序列表示.这样每中组合都对应一个数字,然后用数组表示每个数字出现的次数就好了. class Solution { public: int m[1<<21]; int m3[1<<21]; int m2[127

[leetcode] 187 Repeated DNA Sequences

(一)最一开始的做法是使用 map<string,int> 记录每个10个字符的字符串的个数,超过2就push_back进ans.但是MLE了,说明采用string并不是一个好方法. 下面是MLE的代码: class Solution { public: vector<string> findRepeatedDnaSequences(string s) { vector <string> ans; map<string,int> mp; if(s.lengt

leetcode 204/187/205 Count Primes/Repeated DNA Sequences/Isomorphic Strings

一:leetcode 204 Count Primes 题目: Description: Count the number of prime numbers less than a non-negative number, n 分析:此题的算法源码可以参看这里,http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes 代码: class Solution { public: int countPrimes(int n) { // 求小于一个数n的素数个

【LeetCode】187. Repeated DNA Sequences

Repeated DNA Sequences All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA. Write a function to find all

[LeetCode]Repeated DNA Sequences

题目:Repeated DNA Sequences 给定包含A.C.G.T四个字符的字符串找出其中十个字符的重复子串. 思路: 首先,string中只有ACGT四个字符,因此可以将string看成是1,3,7,20这三个数字的组合串: 并且可以发现{ACGT}%5={1,3,2,0};于是可以用两个位就能表示上面的四个字符: 同时,一个子序列有10个字符,一共需要20bit,即int型数据类型就能表示一个子序列: 这样可以使用计数排序的思想来统计重复子序列: 这个思路时间复杂度只有O(n),但是

Repeated DNA Sequences

package cn.edu.xidian.sselab.hashtable; import java.util.ArrayList;import java.util.HashSet;import java.util.List;import java.util.Set; /** *  * @author zhiyong wang * title: Repeated DNA Sequences * content: *  All DNA is composed of a series of nuc