Find repeating strings in text python

I'm looking for a way to test whether or not a given string repeats itself for the entire string or not.

Examples:

[ '0045662100456621004566210045662100456621', # '00456621' '0072992700729927007299270072992700729927', # '00729927' '001443001443001443001443001443001443001443', # '001443' '037037037037037037037037037037037037037037037', # '037' '047619047619047619047619047619047619047619', # '047619' '002457002457002457002457002457002457002457', # '002457' '001221001221001221001221001221001221001221', # '001221' '001230012300123001230012300123001230012300123', # '00123' '0013947001394700139470013947001394700139470013947', # '0013947' '001001001001001001001001001001001001001001001001001', # '001' '001406469760900140646976090014064697609', # '0014064697609' ]

are strings which repeat themselves, and

[ '004608294930875576036866359447', '00469483568075117370892018779342723', '004739336492890995260663507109', '001508295625942684766214177978883861236802413273', '007518796992481203', '0071942446043165467625899280575539568345323741', '0434782608695652173913', '0344827586206896551724137931', '002481389578163771712158808933', '002932551319648093841642228739', '0035587188612099644128113879', '003484320557491289198606271777', '00115074798619102416570771', ]

are examples of ones that do not.

The repeating sections of the strings I'm given can be quite long, and the strings themselves can be 500 or more characters, so looping through each character trying to build a pattern then checking the pattern vs the rest of the string seems awful slow. Multiply that by potentially hundreds of strings and I can't see any intuitive solution.

I've looked into regexes a bit and they seem good for when you know what you're looking for, or at least the length of the pattern you're looking for. Unfortunately, I know neither.

How can I tell if a string is repeating itself and if it is, what the shortest repeating subsequence is?

asked Apr 6, 2015 at 23:02

JohnJohn

14.7k11 gold badges43 silver badges64 bronze badges

24

Here's a concise solution which avoids regular expressions and slow in-Python loops:

def principal_period(s): i = (s+s).find(s, 1, -1) return None if i == -1 else s[:i]

See the Community Wiki answer started by @davidism for benchmark results. In summary,

David Zhang's solution is the clear winner, outperforming all others by at least 5x for the large example set.

(That answer's words, not mine.)

This is based on the observation that a string is periodic if and only if it is equal to a nontrivial rotation of itself. Kudos to @AleksiTorhamo for realizing that we can then recover the principal period from the index of the first occurrence of s in (s+s)[1:-1], and for informing me of the optional start and end arguments of Python's string.find.

answered Apr 7, 2015 at 10:58

David ZhangDavid Zhang

4,2841 gold badge13 silver badges16 bronze badges

29

Here's a solution using regular expressions.

import re REPEATER = re.compile(r"(.+?)\1+$") def repeated(s): match = REPEATER.match(s) return match.group(1) if match else None

Iterating over the examples in the question:

examples = [ '0045662100456621004566210045662100456621', '0072992700729927007299270072992700729927', '001443001443001443001443001443001443001443', '037037037037037037037037037037037037037037037', '047619047619047619047619047619047619047619', '002457002457002457002457002457002457002457', '001221001221001221001221001221001221001221', '001230012300123001230012300123001230012300123', '0013947001394700139470013947001394700139470013947', '001001001001001001001001001001001001001001001001001', '001406469760900140646976090014064697609', '004608294930875576036866359447', '00469483568075117370892018779342723', '004739336492890995260663507109', '001508295625942684766214177978883861236802413273', '007518796992481203', '0071942446043165467625899280575539568345323741', '0434782608695652173913', '0344827586206896551724137931', '002481389578163771712158808933', '002932551319648093841642228739', '0035587188612099644128113879', '003484320557491289198606271777', '00115074798619102416570771', ] for e in examples: sub = repeated(e) if sub: print("%r: %r" % (e, sub)) else: print("%r does not repeat." % e)

... produces this output:

'0045662100456621004566210045662100456621': '00456621' '0072992700729927007299270072992700729927': '00729927' '001443001443001443001443001443001443001443': '001443' '037037037037037037037037037037037037037037037': '037' '047619047619047619047619047619047619047619': '047619' '002457002457002457002457002457002457002457': '002457' '001221001221001221001221001221001221001221': '001221' '001230012300123001230012300123001230012300123': '00123' '0013947001394700139470013947001394700139470013947': '0013947' '001001001001001001001001001001001001001001001001001': '001' '001406469760900140646976090014064697609': '0014064697609' '004608294930875576036866359447' does not repeat. '00469483568075117370892018779342723' does not repeat. '004739336492890995260663507109' does not repeat. '001508295625942684766214177978883861236802413273' does not repeat. '007518796992481203' does not repeat. '0071942446043165467625899280575539568345323741' does not repeat. '0434782608695652173913' does not repeat. '0344827586206896551724137931' does not repeat. '002481389578163771712158808933' does not repeat. '002932551319648093841642228739' does not repeat. '0035587188612099644128113879' does not repeat. '003484320557491289198606271777' does not repeat. '00115074798619102416570771' does not repeat.

The regular expression (.+?)\1+$ is divided into three parts:

  1. (.+?) is a matching group containing at least one (but as few as possible) of any character (because +? is non-greedy).

  2. \1+ checks for at least one repetition of the matching group in the first part.

  3. $ checks for the end of the string, to ensure that there's no extra, non-repeating content after the repeated substrings (and using re.match() ensures that there's no non-repeating text before the repeated substrings).

In Python 3.4 and later, you could drop the $ and use re.fullmatch() instead, or (in any Python at least as far back as 2.3) go the other way and use re.search() with the regex ^(.+?)\1+$, all of which are more down to personal taste than anything else.

answered Apr 6, 2015 at 23:18

Zero PiraeusZero Piraeus

53.5k27 gold badges149 silver badges159 bronze badges

7

You can make the observation that for a string to be considered repeating, its length must be divisible by the length of its repeated sequence. Given that, here is a solution that generates divisors of the length from 1 to n / 2 inclusive, divides the original string into substrings with the length of the divisors, and tests the equality of the result set:

from math import sqrt, floor def divquot(n): if n > 1: yield 1, n swapped = [] for d in range(2, int(floor(sqrt(n))) + 1): q, r = divmod(n, d) if r == 0: yield d, q swapped.append((q, d)) while swapped: yield swapped.pop() def repeats(s): n = len(s) for d, q in divquot(n): sl = s[0:d] if sl * q == s: return sl return None

EDIT: In Python 3, the / operator has changed to do float division by default. To get the int division from Python 2, you can use the // operator instead. Thank you to @TigerhawkT3 for bringing this to my attention.

The // operator performs integer division in both Python 2 and Python 3, so I've updated the answer to support both versions. The part where we test to see if all the substrings are equal is now a short-circuiting operation using all and a generator expression.

UPDATE: In response to a change in the original question, the code has now been updated to return the smallest repeating substring if it exists and None if it does not. @godlygeek has suggested using divmod to reduce the number of iterations on the divisors generator, and the code has been updated to match that as well. It now returns all positive divisors of n in ascending order, exclusive of n itself.

Further update for high performance: After multiple tests, I've come to the conclusion that simply testing for string equality has the best performance out of any slicing or iterator solution in Python. Thus, I've taken a leaf out of @TigerhawkT3 's book and updated my solution. It's now over 6x as fast as before, noticably faster than Tigerhawk's solution but slower than David's.

answered Apr 6, 2015 at 23:22

ShashankShashank

13.4k5 gold badges35 silver badges61 bronze badges

21

Here are some benchmarks for the various answers to this question. There were some surprising results, including wildly different performance depending on the string being tested.

Some functions were modified to work with Python 3 (mainly by replacing / with // to ensure integer division). If you see something wrong, want to add your function, or want to add another test string, ping @ZeroPiraeus in the Python chatroom.

In summary: there's about a 50x difference between the best- and worst-performing solutions for the large set of example data supplied by OP here (via this comment). David Zhang's solution is the clear winner, outperforming all others by around 5x for the large example set.

A couple of the answers are very slow in extremely large "no match" cases. Otherwise, the functions seem to be equally matched or clear winners depending on the test.

Here are the results, including plots made using matplotlib and seaborn to show the different distributions:

Corpus 1 (supplied examples - small set)

mean performance: 0.0003 david_zhang 0.0009 zero 0.0013 antti 0.0013 tigerhawk_2 0.0015 carpetpython 0.0029 tigerhawk_1 0.0031 davidism 0.0035 saksham 0.0046 shashank 0.0052 riad 0.0056 piotr median performance: 0.0003 david_zhang 0.0008 zero 0.0013 antti 0.0013 tigerhawk_2 0.0014 carpetpython 0.0027 tigerhawk_1 0.0031 davidism 0.0038 saksham 0.0044 shashank 0.0054 riad 0.0058 piotr

Corpus 2 (supplied examples - large set)

mean performance: 0.0006 david_zhang 0.0036 tigerhawk_2 0.0036 antti 0.0037 zero 0.0039 carpetpython 0.0052 shashank 0.0056 piotr 0.0066 davidism 0.0120 tigerhawk_1 0.0177 riad 0.0283 saksham median performance: 0.0004 david_zhang 0.0018 zero 0.0022 tigerhawk_2 0.0022 antti 0.0024 carpetpython 0.0043 davidism 0.0049 shashank 0.0055 piotr 0.0061 tigerhawk_1 0.0077 riad 0.0109 saksham

Corpus 3 (edge cases)

mean performance: 0.0123 shashank 0.0375 david_zhang 0.0376 piotr 0.0394 carpetpython 0.0479 antti 0.0488 tigerhawk_2 0.2269 tigerhawk_1 0.2336 davidism 0.7239 saksham 3.6265 zero 6.0111 riad median performance: 0.0107 tigerhawk_2 0.0108 antti 0.0109 carpetpython 0.0135 david_zhang 0.0137 tigerhawk_1 0.0150 shashank 0.0229 saksham 0.0255 piotr 0.0721 davidism 0.1080 zero 1.8539 riad

The tests and raw results are available here.

Non-regex solution:

def repeat(string): for i in range(1, len(string)//2+1): if not len(string)%len(string[0:i]) and string[0:i]*(len(string)//len(string[0:i])) == string: return string[0:i]

Faster non-regex solution, thanks to @ThatWeirdo (see comments):

def repeat(string): l = len(string) for i in range(1, len(string)//2+1): if l%i: continue s = string[0:i] if s*(l//i) == string: return s

The above solution is very rarely slower than the original by a few percent, but it's usually a good bit faster - sometimes a whole lot faster. It's still not faster than davidism's for longer strings, and zero's regex solution is superior for short strings. It comes out to the fastest (according to davidism's test on github - see his answer) with strings of about 1000-1500 characters. Regardless, it's reliably second-fastest (or better) in all cases I tested. Thanks, ThatWeirdo.

Test:

print(repeat('009009009')) print(repeat('254725472547')) print(repeat('abcdeabcdeabcdeabcde')) print(repeat('abcdefg')) print(repeat('09099099909999')) print(repeat('02589675192'))

Results:

009 2547 abcde None None None

answered Apr 6, 2015 at 23:20

TigerhawkT3TigerhawkT3

47.4k6 gold badges56 silver badges89 bronze badges

9

First, halve the string as long as it's a "2 part" duplicate. This reduces the search space if there are an even number of repeats. Then, working forwards to find the smallest repeating string, check if splitting the full string by increasingly larger sub-string results in only empty values. Only sub-strings up to length // 2 need to be tested since anything over that would have no repeats.

def shortest_repeat(orig_value): if not orig_value: return None value = orig_value while True: len_half = len(value) // 2 first_half = value[:len_half] if first_half != value[len_half:]: break value = first_half len_value = len(value) split = value.split for i in (i for i in range(1, len_value // 2) if len_value % i == 0): if not any(split(value[:i])): return value[:i] return value if value != orig_value else None

This returns the shortest match or None if there is no match.

answered Apr 7, 2015 at 2:42

davidismdavidism

113k24 gold badges367 silver badges323 bronze badges

0

The problem may also be solved in O(n) in worst case with prefix function.

Note, it may be slower in general case(UPD: and is much slower) than other solutions which depend on number of divisors of n, but usually find fails sooner, I think one of bad cases for them will be aaa....aab, where there are n - 1 = 2 * 3 * 5 * 7 ... *p_n - 1 a's

First of all you need to calculate prefix function

def prefix_function(s): n = len(s) pi = [0] * n for i in xrange(1, n): j = pi[i - 1] while(j > 0 and s[i] != s[j]): j = pi[j - 1] if (s[i] == s[j]): j += 1 pi[i] = j; return pi

then either there's no answer or the shortest period is

k = len(s) - prefix_function(s[-1])

and you just have to check if k != n and n % k == 0 (if k != n and n % k == 0 then answer is s[:k], else there's no answer

You may check the proof here (in Russian, but online translator will probably do the trick)

def riad(s): n = len(s) pi = [0] * n for i in xrange(1, n): j = pi[i - 1] while(j > 0 and s[i] != s[j]): j = pi[j - 1] if (s[i] == s[j]): j += 1 pi[i] = j; k = n - pi[-1] return s[:k] if (n != k and n % k == 0) else None

answered Apr 8, 2015 at 16:04

RiaDRiaD

45.8k10 gold badges75 silver badges119 bronze badges

3

This version tries only those candidate sequence lengths that are factors of the string length; and uses the * operator to build a full-length string from the candidate sequence:

def get_shortest_repeat(string): length = len(string) for i in range(1, length // 2 + 1): if length % i: # skip non-factors early continue candidate = string[:i] if string == candidate * (length // i): return candidate return None

Thanks to TigerhawkT3 for noticing that length // 2 without + 1 would fail to match the abab case.

answered Apr 7, 2015 at 6:13

2

Here's a straight forward solution, without regexes.

For substrings of s starting from zeroth index, of lengths 1 through len(s), check if that substring, substr is the repeated pattern. This check can be performed by concatenating substr with itself ratio times, such that the length of the string thus formed is equal to the length of s. Hence ratio=len(s)/len(substr).

Return when first such substring is found. This would provide the smallest possible substring, if one exists.

def check_repeat(s): for i in range(1, len(s)): substr = s[:i] ratio = len(s)/len(substr) if substr * ratio == s: print 'Repeating on "%s"' % substr return print 'Non repeating' >>> check_repeat('254725472547') Repeating on "2547" >>> check_repeat('abcdeabcdeabcdeabcde') Repeating on "abcde"

answered Apr 7, 2015 at 1:55

2

I started with more than eight solutions to this problem. Some were bases on regex (match, findall, split), some of string slicing and testing, and some with string methods (find, count, split). Each had benefits in code clarity, code size, speed and memory consumption. I was going to post my answer here when I noticed that execution speed was ranked as important, so I did more testing and improvement to arrive at this:

def repeating(s): size = len(s) incr = size % 2 + 1 for n in xrange(1, size//2+1, incr): if size % n == 0: if s[:n] * (size//n) == s: return s[:n]

This answer seems similar to a few other answers here, but it has a few speed optimisations others have not used:

  • xrange is a little faster in this application,
  • if an input string is an odd length, do not check any even length substrings,
  • by using s[:n] directly, we avoid creating a variable in each loop.

I would be interested to see how this performs in the standard tests with common hardware. I believe it will be well short of David Zhang's excellent algorithm in most tests, but should be quite fast otherwise.

I found this problem to be very counter-intuitive. The solutions I thought would be fast were slow. The solutions that looked slow were fast! It seems that Python's string creation with the multiply operator and string comparisons are highly optimised.

answered Apr 9, 2015 at 16:15

8

This function runs very quickly (tested and it's over 3 times faster than fastest solution here on strings with over 100k characters and the difference gets bigger the longer the repeating pattern is). It tries to minimise the number of comparisons needed to get the answer:

def repeats(string): n = len(string) tried = set([]) best = None nums = [i for i in xrange(2, int(n**0.5) + 1) if n % i == 0] nums = [n/i for i in nums if n/i!=i] + list(reversed(nums)) + [1] for s in nums: if all(t%s for t in tried): print 'Trying repeating string of length:', s if string[:s]*(n/s)==string: best = s else: tried.add(s) if best: return string[:best]

Note that for example for string of length 8 it checks only fragment of size 4 and it does not have to test further because pattern of length 1 or 2 would result in repeating pattern of length 4:

>>> repeats('12345678') Trying repeating string of length: 4 None # for this one we need only 2 checks >>> repeats('1234567812345678') Trying repeating string of length: 8 Trying repeating string of length: 4 '12345678'

answered Apr 22, 2015 at 0:10

Piotr DabkowskiPiotr Dabkowski

5,4434 gold badges37 silver badges47 bronze badges

1

In David Zhang's answer if we have some sort of circular buffer this will not work: principal_period('6210045662100456621004566210045662100456621') due to the starting 621, where I would have liked it to spit out: 00456621.

Extending his solution we can use the following:

def principal_period(s): for j in range(int(len(s)/2)): idx = (s[j:]+s[j:]).find(s[j:], 1, -1) if idx != -1: # Make sure that the first substring is part of pattern if s[:j] == s[j:][:idx][-j:]: break return None if idx == -1 else s[j:][:idx] principal_period('6210045662100456621004566210045662100456621') >>> '00456621'

answered Dec 23, 2016 at 1:30

sachinruksachinruk

9,06310 gold badges48 silver badges80 bronze badges

If you only want to know if a string repeats itself in another string, then this is the best (in my opinion) and shortest snippet for this:

def rep(repeat,fullstring): return len(fullstring.split(repeat)) > 2 #rep('test', 'test -others-') - no #rep('hello', 'hello world') - yes

How do you check for repeating strings in Python?

Approach is simple,.
First split given string separated by space..
Now convert list of words into dictionary using collections. Counter(iterator) method. Dictionary contains words as key and it's frequency as value..
Now traverse list of words again and check which first word has frequency greater than 1..

How do I find a repeated pattern in a string?

If you use regex, you only need one line: String repeated = str. replaceAll("(.

How do you find occurrences in a string in Python?

Python Find All Occurrences in String.
Use the string.count() Function to Find All Occurrences of a Substring in a String in Python..
Use List Comprehension and startswith() to Find All Occurrences of a Substring in a String in Python..
Use the re.finditer() to Find All Occurrences of a Substring in a String in Python..

How do you print a repeated character in a string in Python?

Method 2:.
Define a function which will take a word, m, n values as arguments..
if M is greater than length of word. set m value equal to length of word..
Now store the characters needed to be repeated into a string named repeat_string using slicing..
Multiply the repeat_string with n..
Now print the string..

Chủ đề