regex - Removing lines from a text file using python and regular expressions -
i have text files, , want remove lines begin asterisk (“*”).
made-up example:
words *remove me words words *remove me
my current code fails. follows below:
import re program = open(program_path, "r") program_contents = program.readlines() program.close() new_contents = [] pattern = r"[^*.]" line in program_contents: match = re.findall(pattern, line, re.dotall) if match.group(0): new_contents.append(re.sub(pattern, "", line, re.dotall)) else: new_contents.append(line) print new_contents
this produces ['', '', '', '', '', '', '', '', '', '', '*', ''], no goo.
i’m python novice, i’m eager learn. , i’ll bundle function (right i’m trying figure out in ipython notebook).
thanks help!
your regular expression seems incorrect:
[^*.]
means match character isn't ^
, *
or .
. when inside bracket expression, after first ^
treated literal character. means in expression have .
matching . character, not wildcard.
this why "*"
lines starting *
, you're replacing every character *
! keep .
present in original string. since other lines not contain *
, .
, of characters replaced.
if want match lines beginning *
:
^\*.*
what might easier this:
pat = re.compile("^[^*]") line in contents: if re.search(pat, line): new_contents.append(line)
this code keeps line not start *
.
in pattern ^[^*]
, first ^
matches start of string. expression [^*]
matches character *
. pattern matches starting character of string isn't *
.
it trick think when using regular expressions. need assert string, need change or remove characters in string, need match substrings?
in terms of python, need think each function giving , need it. sometimes, in example, need know match found. might need match.
sometimes re.sub
isn't fastest or best approach. why bother going through each line , replacing of characters, when can skip line in total? there's no sense in making empty string when you're filtering.
most importantly: need regex? (here don't!)
you don't need regular expression here. since know size , position of delimiter can check this:
if line[0] != "*":
this faster regex. they're powerful tools , can neat puzzles figure out, delimiters fixed width , position, don't need them. regex more expensive approach making use of information.
Comments
Post a Comment