htmanning has asked for the wisdom of the Perl Monks concerning the following question:
Monks,
I have a log with a field called $text. I recognize 4 digit numbers in $text which represent apartment numbers. I bold those numbers and link to another script like this:
my $digits_4 = qr{ (?<!-)\b[0-9]{4}\b(?!-) }xms;
$text =~ s{ ($digits_4) }
{<a href="apartments.pl?do_what=view&unit=$1"><b>$1</b></a>}
+xmsg;
It works, but it sometimes tags other 4 digits numbers like a year or part of a phone number. I tried setting all unit numbers to a var then using an unless statement like this:
$unitslist_4 = "1001,1002,1003,1004,1101,1102,1103,1104,1201,1202,1203
+,1204,1301,1302,1303,1304,1401,1402,1403,1404,
1501,1502,1503,1504,1601,1602,1603,1604,1701,1702,1703,1704,1801,1802,
+1803,1804,1901,1902,1903,1904,2001,2002,2003,2004,
2101,2102,2103,2104,2201,2202,2203,2204,2301,2302,2303,2304,2401,2402,
+2403,2501,2502,2503,2504,2505";
unless ($digits_4 =~ /$unitslist/) {
$text =~ s{ ($digits_4) }
{<a href="apartments.pl?do_what=view&unit=$1"><b>$1</b></a>}
+xmsg;
}
It works for the first few $text fields, but then stops working altogether. This can't be the right way to do something like this. Any help would be appreciated.
Re: Text switching
by Fletch (Bishop) on Sep 13, 2022 at 01:42 UTC
|
If you've got a list of valid apt numbers probably the most efficient thing to do would be create a hash and use that to validate (although declaring it with a comma separated list likewise is a minor kludge, but you get the idea):
my %valid_apt_numbers;
for my $unit ( split( /,/, q{1001,1002,...YADDAYADDA...,2505} ) ) {
$valid_apt_numbers{ $unit } = 1;
}
if( exists $valid_apt_numbers{ $digits_4 } ) {
$text =~ s{\b($digits_4)\b}{<a href=YADDAYADDA</a>}xmsg;
}
The cake is a lie.
The cake is a lie.
The cake is a lie.
| [reply] [d/l] |
|
If you've got a list of valid apt numbers probably the most efficient thing to do would be create a hash and use that to validate
The danger here is when a valid apartment number (either now or in the future) is also a valid year.
The OP appears to have listed 2022 as being an apartment number - and that's certainly a valid year...
| [reply] |
|
if( exists $valid_apt_numbers{ $digits_4 } ) {
$text =~ s{\b($digits_4)\b}{<a href=YADDAYADDA</a>}xmsg;
}
If $digits_4 is defined as in the OP, i.e., as a Regexp object, I don't see how its stringization will ever match any key in %valid_apt_numbers, all of which are "unit" numbers of the format '1001', '2202', etc.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
DERP, good point; for some reason I read that as having the matched number in it. Handwaving attempt below using /e and adding a bit of logic to the RHS of the substitution would work. Still has the problem of the fuzziness that it's non-trivial to be sure you're only pulling apartment numbers with just the regex and not getting years or what not.
A better solution would be to annotate explicitly in the source text as mentioned elsewhere. Better still use a real templating solution to explicitly mark up things. (e.g. TT like See more about [% apt_link( 1002 ) %] )
#!/usr/bin/env perl
use 5.034;
my $text = <<'EOT';
This is an appt: 1203
So is 1001.
But 8675 is not.
EOT
chomp($text);
my $digits_4_re = qr{ (?<!-)\b[0-9]{4}\b(?!-) }xms;
my %valid_appt_numbers = ( 1001 => 1, 1203 => 1, );
sub _anchor_if_apt {
my $candidate = shift;
if ( exists $valid_appt_numbers{$candidate} ) {
return _anchor_for_num($candidate);
} else {
return $candidate;
}
}
sub _anchor_for_num {
my $unit = shift;
return qq{<a href="apartment.pl?do_what=view&unit=$unit"><b>$unit<
+/b></a>};
}
$text =~ s{($digits_4_re)}{ _anchor_if_apt( $1 ) }xsmeg;
say $text;
exit 0;
__END__
$ perl texty.plx
This is an appt: <a href="apartment.pl?do_what=view&unit=1203"><b>1203
+</b></a>
So is <a href="apartment.pl?do_what=view&unit=1001"><b>1001</b></a>.
But 8675 is not.
The cake is a lie.
The cake is a lie.
The cake is a lie.
| [reply] [d/l] [select] |
|
|
Wow, thanks. It works, but only on the first $text field. This whole section is in a while loop. So we search the database then print the latest 50 records after doing the text switching like this:
while (($pointer = $sth->fetchrow_hashref) && ($current_count <= $stop
+count)){
$current_count++;
if ($current_count >= $startcount && $current_count <= $st
+opcount) {
my %valid_apt_numbers;
for my $unit ( split( /,/, q{1001,1002,1003,1004,1101,1102,1103,1104,1
+201,1202,1203,1204,1301,1302,1303,1304,1401,1402,1403,1404,
1501,1502,1503,1504,1601,1602,1603,1604,1701,1702,1703,1704,1801,1802
+,1803,1804,1901,1902,1903,1904,2001,2002,2003,2004,
2101,2102,2103,2104,2201,2202,2203,2204,2301,2302,2303,2304,2401,2402
+,2403,2501,2502,2503,2504,2505} ) ) {
$valid_apt_numbers{ $unit } = 1;
}
if( exists $valid_apt_numbers{ $digits_4 } ) {
$text =~ s{\b($digits_4)\b}{<a href="apartments.pl?do_what=view&unit
+=$1"><b>$1</b></a>}xmsg;
}
print qq~
$text
~;
}
}
| [reply] [d/l] |
Re: Text switching (updated)
by AnomalousMonk (Archbishop) on Sep 14, 2022 at 05:22 UTC
|
my $digits_4 = qr{ (?<!-)\b[0-9]{4}\b(?!-) }xms;
...
$unitslist_4 = "1001,1002,1003,1004,...,2504,2505";
...
unless ($digits_4 =~ /$unitslist/) { ... }
$digits_4 =~ /$unitslist/ seems to be matching the stringization of Regexp (regex or qr//) object $digits_4 to a string $unitslist of all possible unit numbers (assuming $unitslist is the same as $unitslist_4) that has been compiled as a match regex. I don't understand the purpose of this.
unless ($digits_4 =~ /$unitslist/) { ... } is doing the match described above and executing a conditional block (which contains the substitution) if there is no match. I don't understand this either.
A simple approach to the general problem would be to uniquely mark all unit number sub-strings: %1234% or {{1234}}. This would also allow for easy support of unit numbers like 123A or 12-B or C-4 or even Penthouse! Substitution matching then becomes straightforward and unambiguous.
Update: A spelling correction, and a few minor wording changes for clarity.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: Text switching
by duelafn (Parson) on Sep 13, 2022 at 13:02 UTC
|
use strict;
my $unitslist = join "|", qw(
1001 1002 1003 1004 1101 1102 1103 1104 1201 1202 1203 1204 1301 130
+2
1303 1304 1401 1402 1403 1404 1501 1502 1503 1504 1601 1602 1603 160
+4
1701 1702 1703 1704 1801 1802 1803 1804 1901 1902 1903 1904 2001 200
+2
2003 2004 2101 2102 2103 2104 2201 2202 2203 2204 2301 2302 2303 230
+4
2401 2402 2403 2501 2502 2503 2504 2505
);
my $text = "Visit units 1101 or 2202, call us at 555-555-5555\n";
$text =~ s{ \b($unitslist)\b }
{<a href="apartments.pl?do_what=view&unit=$1"><b>$1</b></a>}
+xg;
print $text;
Update: Added \b boundary matches to pattern matcher
| [reply] [d/l] [select] |
|
Win8 Strawberry 5.8.9.5 (32) Tue 09/13/2022 15:47:56
C:\@Work\Perl\monks
>perl
use strict;
use warnings;
# use Data::Dump qw(dd); # for debug
my $unitslist = join "|", qw(
1001 1002 1003 1004 1101 1102 1103 1104 1201 1202 1203 1204 1301 130
+2
1303 1304 1401 1402 1403 1404 1501 1502 1503 1504 1601 1602 1603 160
+4
1701 1702 1703 1704 1801 1802 1803 1804 1901 1902 1903 1904 2001 200
+2
2003 2004 2101 2102 2103 2104 2201 2202 2203 2204 2301 2302 2303 230
+4
2401 2402 2403 2501 2502 2503 2504 2505
);
my $text =
"Visit units 1101 or 2202, call us at 555-555-2202 and
call before 13 Aug, 2202\n";
$text =~ s{ \b($unitslist)\b }
{\n<a href="apartments.pl?do_what=view&unit=$1"><b>$1</b></a
+>\n}xg;
print $text;
^Z
Visit units
<a href="apartments.pl?do_what=view&unit=1101"><b>1101</b></a>
or
<a href="apartments.pl?do_what=view&unit=2202"><b>2202</b></a>
, call us at 555-555-
<a href="apartments.pl?do_what=view&unit=2202"><b>2202</b></a>
and
call before 13 Aug,
<a href="apartments.pl?do_what=view&unit=2202"><b>2202</b></a>
Indeed, it doesn't seem as if the problem can be entirely eliminated unless input text can be specified to be much more specialized. E.g., uniquely delimit all unit number sub-strings: %1234% or {{1234}}. This would also allow for easy support of unit numbers like 123A or 12-B.
It's possible to somewhat mitigate the problems associated with completely free-form text by adding more boundary conditions.
Win8 Strawberry 5.8.9.5 (32) Wed 09/14/2022 0:21:27
C:\@Work\Perl\monks
>perl
use strict;
use warnings;
# use Data::Dump qw(dd); # for debug
my ($rx_all_units) =
map qr{ (?<! [-.:]) \b (?: $_) \b (?! [-.:]) }xms,
join '|',
reverse sort
qw(
1001 1002 1003 1004 1101 1102 1103 1104 1201 1202 1203 1204 1301 1
+302
1303 1304 1401 1402 1403 1404 1501 1502 1503 1504 1601 1602 1603 1
+604
1701 1702 1703 1704 1801 1802 1803 1804 1901 1902 1903 1904 2001 2
+002
2003 2004 2101 2102 2103 2104 2201 2202 2203 2204 2301 2302 2303 2
+304
2401 2402 2403 2501 2502 2503 2504 2505
);
my $text =
"Visit units 1101 or 2202, call us at 555-555-2202 and
call before 13 Aug, 2202\n";
$text =~ s{ ($rx_all_units) }
{\n<a href="apartments.pl?do_what=view&unit=$1"><b>$1</b></a
+>\n}xg;
print $text;
^Z
Visit units
<a href="apartments.pl?do_what=view&unit=1101"><b>1101</b></a>
or
<a href="apartments.pl?do_what=view&unit=2202"><b>2202</b></a>
, call us at 555-555-2202 and
call before 13 Aug,
<a href="apartments.pl?do_what=view&unit=2202"><b>2202</b></a>
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
|