Shell File Manager
1. Tree equivalencies. Currently, /contend/ /content/ /resend/ /resent/
produces (?:conten[dt]|resend[dt]) but it is possible to produce
(?:cont|res)en[dt] if one can spot the common tail nodes (and walk back
the equivalent paths). Or be by me my => /[bm][ey]/ in the simplest case.
To do this requires a certain amount of restructuring of the code.
Currently, the algorithm uses a two-phase approach. In the first
phase, the trie is traversed and reductions are performed. In the
second phase, the reduced trie is traversed and the pattern is
emitted.
What has to occur is that the reduction and emission have to occur
together. As a node is completed, it is replaced by its string
representation. This then allows child nodes to be compared for
equality with a simple 'eq'. Since there is only a single traversal,
the overall generation time might drop, even though the context
baggage required to delve through the tree will be more expensive
to carry along (a hash rather than a couple of scalars).
Actually, a simpler approach is to take on a secret sentinel
atom at the end of every pattern, which gives the reduction
algorithm sufficient traction to create a perfect trie.
I'm rewriting the reduction code using this technique.
2. Investigate how (?>foo) works. Can it be applied?
5. How can a tracked pattern be serialised? (Add freeze and thaw methods).
6. Store callbacks per tracked pattern.
12. utf-8... hmmmm...
14. Adding qr//'ed patterns. For example, consider
$r->add ( qr/^abc/i )
->add( qr/^abd/ )
->add( qr/^ab e/x );
this should admit abc abC aBc aBC abd abe as matches
16. Allow a fast, unsafe tracking mode, that can be used if a(?bc)?
can't happen. (Possibly carp if it does appear during traversal)?
17. given a-\d+-\d+-\d+-\d+-b, produce a(?:-\d+){4}-b. Something
along the lines of (.{4))(\1+) would let the regexp engine
itself be brought to bear on the matter, which is a rather
appealing idea. Consider
while(/(?!\+)(\S{2,}?)(\1+)/g) { ... $1, $2 ... }
as a starting point.
19. The reduction code has become unbelievably baroque. Adding code
to handle (sting,singing,sing) => s(?:(?:ing)?|t)ing was far
too difficult. Adding more stuff just breaks existing behaviour.
And fixing the ^abcd$ ... bug broke stuff all over again.
Now that the corner cases are more clearly identified, a full
rewrite of the reduction code is needed. And would admit the
possibility of implementing items 1 and 17.
20. Handle debug unrev with a separate bit
23. Japhy's http://www.perlmonks.org/index.pl?node_id=90876 list2range
regexp
24. Lookahead assertions contain serious bugs (as shown by
assembling powersets. Need to save more context during reduction,
which in turn will simplify the preparation of the lookahead
classes. See also 19.
26. _lex() swamps the overall run-time. It stems from the decision
to use a single regexp to pull apart any pattern. A suite of
simpler regexp to pick of parens, char classes, quantifiers
and bare tokens may be faster. (This has been implemented as
_fastlex(), but it's only marginally faster. Perhaps split-by-
char and lex a la C?
27. We don't, as yet, unroll_plus a paren e.g. (abc)+?
28. We don't reroll unrolled a a* to a+ in indented or tracked
output
29. Use (*MARK n) in blead for tracked patterns, and use (*FAIL) for
the unmatchable pattern.
Shell File Manager Version 1.1, Coded By Shell
Email: [email protected]