Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Script to scrap data

by Anonymous Monk
on Dec 01, 2024 at 23:38 UTC ( [id://11162964]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am trying to extract (in csv, txt or any human-readable format), the data from the table results in this website: https://desmace.com/provincia/asturias/

According to table headings, I want my script to ask for two dates for the top letf column "FECHA DEL TRAMITE" (prodedure date), and set two fixed dates (always 01/01/1900 and 31/12/2000) in the "FECHA MATRICULA" (plate date) column.

You can also add more columns to the table by clicking in "Columnas" at the topright. By doing this, I add "Prov. Matriculacion", and I also want my script to ask input for this field.

Then, according to this criteria, table results are displayed, and this is what I want to store in csv, txt or similar.

I have this code so far...it runs ok, but does not properly store the data.

I would be super grateful if I can get some help to make the script work.

use strict; use warnings; use LWP::Simple; use HTML::TreeBuilder; use Text::CSV; # URL of the website my $url = 'https://desmace.com/provincia/asturias/'; # Filter criteria my $fecha_tramite_min = '24/11/2024'; my $fecha_tramite_max = '27/11/2024'; my $fecha_matricula_min = '01/01/1900'; my $fecha_matricula_max = '31/12/2000'; my $prov_matriculacion_filtro = 'ASTURIAS'; # HTML page content my $html = get($url) or die "No se pudo acceder a la URL: $!"; # Parse HTML my $tree = HTML::TreeBuilder->new; $tree->parse($html); # Open csv file my $csv = Text::CSV->new({ binary => 1, eol => "\n" }); open my $fh, ">", "resultados.csv" or die "No se pudo crear el archivo + CSV: $!"; # Headings for csv file $csv->print($fh, ['FECHA DEL TRÁMITE', 'TRÁMITE', 'FECHA MATRÍCULA', ' +MARCA', 'MODELO', 'BASTIDOR (VIN)', 'PROV. MATRICULACIÓN']); # This is the table that contains data my @rows = $tree->look_down(_tag => 'tr'); foreach my $row (@rows) { my @columns = $row->look_down(_tag => 'td'); my @data; # Extract values from columns foreach my $col (@columns) { push @data, $col->as_text; } # Row filtering if (@data >= 9) { my ($fecha_tramite, $fecha_matricula, $prov_matriculacion) = @ +data[0, 2, 7]; if ($fecha_tramite ge $fecha_tramite_min && $fecha_tramite le +$fecha_tramite_max && $fecha_matricula ge $fecha_matricula_min && $fecha_matricu +la le $fecha_matricula_max && $prov_matriculacion eq $prov_matriculacion_filtro) { $csv->print($fh, \@data); } } } # Close file close $fh; $tree->delete; print "Datos filtrados guardados en 'resultados.csv'.\n";
Many thanks in advance!

Replies are listed 'Best First'.
Re: Script to scrap data
by marto (Cardinal) on Dec 02, 2024 at 10:29 UTC

    The site in question is using the DataTables JavaScript library to populate the table based on a query to the back end, with various parameters, includes the columns you want to display, how many entries to display per page etc. Using your browsers developer tools you can see this query being sent for processing (the entire payload, the url it hits), the results returning as a JSON object. Personally I'd skip the HTML parsing method, automate the interaction with the back end query. I'd use Mojo::UserAgent my go to choice for web work/scraping, send a request with the parameters you want (copy/paste from Developer tools once satisfied) and process the JSON result however you want. Super search will show some interesting results.

      That was my first thought, however when I went to that website and looked at Network traffic in Developer Tools, it showed that it is making two ajax requests. Both of them using this URL: https://desmace.com/wp-admin/admin-ajax.php?action=get_wdtable&table_id=414

      Interestingly, when I copied and pasted this address to see what is being downloaded, it showed nothing. Why is that?

        You're not posting anything along with that request, it's unlikely to perform any action without parameters. You can see what's being sent by clicking the row in question then selecting 'Request' in the right hand pane, listing all of the parameters and values, or right click the row in question -> 'Copy Value' and looking at all the options provided.

      Hi again! Thank you for your answers, following them I looked at the DevTools and I think I am close of my goal.

      More precisely, what I need are the code lines to perform the ajax request on <c>admin-ajax.php?action=get_wdtable&table_id=414

      , including the payload data (the bunch of code starting with "draw..." that contents all the values that I have used as input in the table).

      I need to insert this request in my Perl script so I have the JSON response in any human-readable format that I can store in txt, csv or whatever.

      Many thanks in advance!

Re: Script to scrape data
by hippo (Archbishop) on Dec 02, 2024 at 09:42 UTC
    my $fecha_tramite_min = '24/11/2024'; my $fecha_tramite_max = '27/11/2024'; my $fecha_matricula_min = '01/01/1900'; my $fecha_matricula_max = '31/12/2000'; # ... if ($fecha_tramite ge $fecha_tramite_min && $fecha_tramite le +$fecha_tramite_max && $fecha_matricula ge $fecha_matricula_min && $fecha_matricu +la le $fecha_matricula_max && $prov_matriculacion eq $prov_matriculacion_filtro) {

    At least one of the problems is that you are comparing dates lexically. This will not work at all where the least significant part of the date (the day of the month) comes first. eg. in your current code, any $fecha_tramite which is the 25th or 26th of any month of any year will match which is presumably not what you want.

    There are any number of ways to solve this. I would reach for Time::Piece, convert all the dates into Time::Piece objects and then use the less-than and greater-than operators (< and >) to compare them.

    Looking at the big picture, ensure that you have the right to extract these data sets in the first place. And if you do, consider searching for an API which would be less brittle than parsing HTML for the results.


    🦛

Re: Script to scrap data
by soonix (Chancellor) on Dec 02, 2024 at 15:14 UTC
    According to the Google Perewodchik the site states that all their data comes from public domain. The two first commenters ended their posts with a hint to an API/"connection point", which the operator of that site (some Diego, obviously) probably could tell you about, if you ask him. Another possibility would be looking at his source, which might be an API at the department of traffic or the corresponding ministry.
Re: Script to scrap data
by 1nickt (Canon) on Dec 02, 2024 at 11:17 UTC

    Hi,

    what does "does not properly store the data" mean? It's too vague a description to be able to help.


    The way forward always starts with a minimal test.
Re: Script to scrap data
by harangzsolt33 (Deacon) on Dec 02, 2024 at 01:22 UTC
    Well..well... I will be very curious what others will tell you.

    With my limited knowledge, this how I would solve this problem: I would write a JScript in Windows that loads the website, enters the dates into the website, performs the search, presses CTRL+A to select all and then copy. Then save the data from clipboard. Fortunately, this comes in a format that can be easily pasted into Excel spreadsheet or a text file. So, all the data is in a format that can be processed easily. So, once the first page is saved, then I would load the next page, then save that one, load the next page, save, and so forth.

    How much data do you need to download using this method? How many pages? Do you want to download the entire website?

    This site contains millions of entries of data (73,902,123 records to be precise). Each one can be accessed individually from the website like this where you can modify the record number: https://desmace.com/tramite?id=71015976 That number at the end of the URL is the record number. You can change it to any number between 1 and 73,902,123, and it will show you one record. The problem is, of course, the individual records are not sorted by date. So, you can't just download all the records between X and Y. They seem to be just random nodes. And if you tried to download each piece of data, it might take about 100 years to download each individual page. But if you could do that, then you would have a copy of the entire database on your device, which you could format and search and filter in way imaginable using nothing but Perl.

    I am not sure what your goal is, so I don't know what to tell you.

    What I did find was this website is available for Android and Windows as an app. ...which means it probably has a connection point where you could just tell the site to give you a short list of raw data. There is also an English version of this website, which helped me understand what this whole thing is about. https://stolen.desmace.com/list-stolen-cars-spain/#search-form (I don't understand any Spanish.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11162964]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2025-01-24 17:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which URL do you most often use to access this site?












    Results (68 votes). Check out past polls.