Using regex on TextWrangler

Why TextWrangler?

Some Regex

BBEdit / TextWrangler Regular Expression Guide

—————————————————————————————————————————————————————————————————————
BBEdit / BBEdit-Lite / TextWrangler Regular Expression Guide Modified: 2018/08/10 01:19
—————————————————————————————————————————————————————————————————————
NOTES:

The PCRE engine (Perl Compatible Regular Expressions) is what BBEdit and TextWrangler use.

Items I’m unsure of are marked ‘# PCRE?’. The list while fairly comprehensive is not complete.

————————————————————————————————————————————————————————————————————
PATTERN MODIFIERS (switches)
————————————————————————————————————————————————————————————————————

i Case-insensitive
m Multiline : allow the grep engine to match at ^ and $ after and before at \r or \n.
s Magic Dot : allows . to match \r and \n
x Free-spacing: ignore unescaped white space; allow inline comments in grep patterns.

(?imsx) On
(?-imsx) Off
(?i-msx) Mixed

———————————————————————————————————————————————————————————————————
Regex Meta-Characters:
———————————————————————————————————————————————————————————————————
. Any character except newline or carriage return
[ ] Any single character of set
[^ ] Any single character NOT of set

  • 0 or more previous regular expression
    *? 0 or more previous regular expression (non-greedy)
  • 1 or more previous regular expression
    +? 1 or more previous regular expression (non-greedy)
    ? 0 or 1 previous regular expression
    | Alternation
    ( ) Grouping regular expressions
    ^ Beginning of a line or string
    $ End of a line or string
    {m,n} At least m but most n previous regular expression
    {m,n}? At least m but most n previous regular expression (non-greedy)
    \1-9 Nth previous captured group
    \& Whole match # BBEdit: ‘&’ only – no escape needed
    ` Pre-match # PCRE? NOT BBEdit
    \’ Post-match # PCRE? NOT BBEdit
    + Highest group matched # PCRE? NOT BBEdit
    \A Beginning of a string
    \b Backspace(0x08)(inside[]only) # PCRE?
    \b Word boundary(outside[]only)
    \B Non-word boundary
    \d Digit, same as[0-9]
    \D Non-digit
    \G Assert position at end of previous match or start of string for first match

—————————————————————————————————————————————————————————————————
Case-Change Operators
—————————————————————————————————————————————————————————————————
\E Change case – acts as an end delimiter to terminate runs of \L & \U.
\l Change case of only the first character to the right lower case. (Note: lowercase ‘L’)
\L Change case of all text to the right to lowercase.
\u Change case of only the first character to the right to uppercase.
\U Change case of all text to the right to uppercase.

——————————————————————————————————————————————————————————————
White-Space or Non-White-Space
——————————————————————————————————————————————————————————————
\t Tab
\n Linefeed
\r Return
\R Return or Linefeed or Windows CRLF (matches any Unicode newline sequence).
\f Formfeed
\s Whitespace character equivalent to [ \t\n\r\f]
\S Non-whitespace character
——————————————————————————————————————————————————————

\W Non-word character
\w Word character[0-9A-Za-z_]
\z End of a string
\Z End of a string, or before newline at the end
(?#) Comment
(?:) Grouping without backreferences
(?=) Zero-width positive look-ahead assertion
(?!) Zero-width negative look-ahead assertion
(?>) Nested anchored sub-regexp stops backtracking
(?imx-imx) Turns on/off imx options for rest of regexp
(?imx-imx:…) Turns on/off imx options, localized in group # ‘…’ indicates added regex pattern

———————————————————————————————————————————————————————————————
PERL-STYLE PATTERN EXTENSIONS : BBEdit Documentation : ‘…’ indicates added regex pattern
————————————————————————————————————————————————————————————————
Extension Meaning
————————————————————————————————————————————————————————————————
(?:…) Cluster-only parentheses, no capturing
(?#…) Comment, discard all text between the parentheses
(?imsx-imsx) Enable/disable pattern modifiers
(?imsx-imsx:…) Cluster-only parens with modifiers
(?=…) Positive lookahead assertion
(?!…) Negative lookahead assertion
(?<=…) Positive lookbehind assertion (?…) Match non-backtracking subpattern (“once-only”)
(?R) Recursive pattern

—————————————————————————————————————————————————————————————————
POSITIONAL ASSERTIONS (duplicatation of above)
—————————————————————————————————————————————————————————————————

POSITIVE LOOKAHEAD ASSERTION: (?=’pattern’)
NEGATIVE LOOKAHEAD ASSERTION: (?!’pattern’)

POSITIVE LOOKBEHIND ASSERTION: (?<=’pattern’) # Lookbehind Assertions are of fixed-length
NEGATIVE LOOKBEHIND ASSERTION: (?<!’pattern’)

————————————————————————————————————————————————————————————————
SPECIAL CHARACTER CLASSES (POSIX standard except where ‘Perl Extension’ is indicated):
———————————————————————————————————————————————————————————————
CLASS MEANING
———————————————————————————————————————————————————————————————
[[:alnum:]] Alpha-numeric characters
[[:alpha:]] Alphabetic characters
[[:ascii:]] Character codes 0-127 # Perl Extension
[[:blank:]] Horizontal whitespace
[[:cntrl:]] Control characters
[[:digit:]] Decimal digits (same as \d)
[[:graph:]] Printing characters, excluding spaces
[[:lower:]] Lower case letters
[[:print:]] Printing characters, including spaces
[[:punct:]] Punctuation characters
[[:space:]] White space (same as \s)
[[:upper:]] Upper case letters
[[:word:]] Word characters (same as \w) # Perl Extension
[[:xdigit:]] Hexadecimal digits

Usage example of multiple character classes:

[[:alpha:][:digit:]]

«Negated» character class example:

[[:^digit:]]+

** POSIX-style character class names are case-sensitive

** The outermost brackets above indicate a RANGE; the class name itself looks like this: [:alnum:]

—————————————————————————————————————————————————————————————
CONDITIONAL SUBPATTERNS
—————————————————————————————————————————————————————————————
Conditional subpatterns allow you to apply “if-then” or “if-then-else” logic to pattern matching.
The “if” portion can either be an integer between 1 and 99, or an assertion.

The forms of syntax for an ordinary conditional subpattern are:

 if-then: (?(condition)yes-pattern)

if-then-else: (?(condition)yes-pattern|no-pattern)

and for a named conditional subpattern are:

 if-then: (?P<NAME>(condition)yes-pattern)

if-then-else: (?P(condition)yes-pattern|no-pattern)

If the condition evaluates as true, the “yes-pattern” portion attempts to match. Otherwise, the
“no-pattern” portion does (if there is a “no-pattern”).

———————————————————————————————————————————————————————————————
REVISION NOTES:
———————————————————————————————————————————————————————————————

2016/02/29 17:23

\G metacharacter added.

Tested with BBEdit 11.5.1 & TextWrangler 5.0.2.

Also available in ICU RegEx:

http://userguide.icu-project.org/strings/regexp#TOC-Regular-Expression-Metacharacters

————————————————————————————————————————————————————————————————

Free copy of our latest paper in Computer Assisted Language Learning

Our article, Language teachers’ perceptions on the use of OER language processing technologies in MALL, has just been published on Computer Assisted Language Learning Journal, Taylor & Francis Online.

50 free eprints can be downloaded from the following URL:

http://www.tandfonline.com/eprint/epWFWhVAGFZ4yRSIaMcA/full

Get yours now!!!!

Abstract

Combined with the ubiquity and constant connectivity of mobile devices, and with innovative approaches such as Data-Driven Learning (DDL), Natural Language Processing Technologies (NLPTs) as Open Educational Resources (OERs) could become a powerful tool for language learning as they promote individual and personalized learning. Using a questionnaire that was answered by language teachers (n = 230) in Spain and the UK, this research explores the extent to which OER NLPTs are currently known and used in adult foreign language learning. Our results suggest that teachers’ familiarity and use of OER NLPTs are very low. Although online dictionaries, collocation dictionaries and spell checkers are widely known, NLPTs appear to be generally underused in foreign language teaching. It was found that teachers prefer computer-based environments over mobile devices such as smartphones and tablets and that teachers’ qualification determines their familiarity with a wider range of OER NLPTs. This research offers insight into future applications of Language Processing Technologies as OERs in language learning.

KEYWORDS: Language learning, teachers’ perceptions, OER, MALL, natural language processing technologies, higher education

Graphic Online Language Diagnostic

 

Graph-Magnifier-icon

The Graphic Online Language Diagnostic (“GOLD”) is a corpus tool that allows language educators to submit and analyze language data. GOLD was developed by the Center for Advanced Language Proficiency Education and Research (“CALPER”) at The Pennsylvania State University (“PSU”), University Park, PA, USA under a grant from the U.S. Department of Education (Title VI, P229A060003 and P229A020010).

Link here: http://gold.gwserver1.net

TAALES 2.2 is out : automatic analysis of lexical sophistication, Windows and Mac

From the TAALES website:

Kyle, K. & Crossley, S. A. (2015). Automatically assessing lexical sophistication: Indices, tools, findings, and application. TESOL Quarterly 49(4), pp. 757-786. doi: 10.1002/tesq.194

TAALES is a tool that measures over 400 classic and new indices of lexical sophistication, and includes indices related to a wide range of sub-constructs. TAALES indices have been used to inform models of second language (L2) speaking proficiency, first language (L1) and L2 writing proficiency, spoken and written lexical proficiency, genre differences, and satirical language.

Starting with version 2.2, TAALES provides comprehensive index diagnostics, including text-level coverage output (i.e., the percent of words/bigrams/trigrams in a text covered by the index) AND individual word/bigram/trigram index coverage information.

TAALES takes plain text files as input (it will process all plain text files in a particular folder) and produces a comma separated values (.csv) spreadsheet that is easily read by any spreadsheet software.

 

You can find all the info here. Windows and Mac versions available for free.