While extracting data from details pages (page reached by navigating a link from the start page), it is recommended that the ‘Capture Following Text‘ option be used whenever possible to correctly and consistently scrape data.
This is because the layout and the amount of data displayed in details pages may not be consistent. For example, if you are trying to scrape Amazon products listing, the data displayed in the product details page (page reached by clicking the product link from the search results) may vary slightly from product to product. Here, if you are tying to extract the Shipping Weight under Product Details, instead of clicking on the data (example: ’1.2 pounds’) click on the heading ‘Shipping Weight’ and apply the ‘Capture following text’ option under the ‘More Options’ button.
Watch the demo :-
So in summary, if the data to be extracted comes under a heading, always click the heading and apply the ‘Capture following Text’ option. This ensures that the data is scraped from all similar pages without missing any, even if the page contents varies slightly.
WebHarvy allows you to scrape HTML of page contents in addition to plain text. In the Capture window, click ‘More Options’ button and select the ‘Capture HTML’ option to scrape the HTML of the selected content.
To capture only a portion of the displayed HTML, you may select and highlight the required portion before clicking the Capture button.
Usually Regular Expressions are applied over the HTML source of the content to extract the data of interest like image URL or hidden fields like phone number.
The following video shows how the ‘Capture HTML’ option is used along with Regular Expressions to correctly extract the product price.
Try out the free evaluation copy of WebHarvy from https://www.webharvy.com/download.html.
Certain web pages require that you to click on a link or button for the data to be displayed. There are many websites where email addresses or phone numbers are partially displayed, they will be fully displayed only if you click on them.
The ‘Click’ option under ‘More Options’ button in the Capture Window lets you scrape data in such scenarios. (See https://www.webharvy.com/tour1.html#ScrapeHidden).
The following video shows how this option can be used to scrape hidden fields.
Here the phone numbers are partially displayed. Using the Click option, they can be made fully visible and then scraped.
To know more about the features of WebHarvy, see the product feature tour at https://www.webharvy.com/tour.html.
WebHarvy is designed as a ‘point and click’ visual Web Scraper. The design concentrates on easy of use, so that you can start scraping data within few minutes after downloading the software.
But in case you need more control over what needs to be extracted you can use Regular Expressions (RegEx) with WebHarvy. WebHarvy allows you to extract data by matching RegEx strings on text content as well as on HTML source of the web page.
If you are new to Regular Expressions, see http://en.wikipedia.org/wiki/Regular_expression.
The following video shows how WebHarvy can be used to scrape the image URL from a web page by applying Regular Expression.
The ‘Capture More Content’ feature comes in handy here (as shown in the video) to make sure that the selected text contains the data (text or HTML code) of interest, before RegEx string is applied.
Regular Expressions can also be applied directly on the text content of the page as shown in the following video.
To explore further download the latest version of WebHarvy from https://www.webharvy.com/download.html.
The 3.1 update of WebHarvy which was released yesterday (July 24) has the following changes.
- Added option to Tag captured data rows with corresponding Keyword/Category. (Applicable only for Keyword/Category based Scraping). See the new Miner Settings Window (Edit menu – Settings)
- Option to separately set Page Load Timeout and AJAX Load Wait Time in Miner Settings.
- Option to edit the start URL / Post Data / Headers for the configuration directly from the UI, without editing the XML configuration file. (under Edit menu – Edit Options)
- Updates related to Category Scraping, Capture Text following a Heading, Mining multiple pages
- Bug Fixes
Download and install the latest update from https://www.webharvy.com/download.html.
We are happy to announce the release of WebHarvy 3.0. We have added a lot of new features in this major update. The feature/changes list for this update is the longest among all product updates which we have done till date. Here we go. .
- Added the following options in the Capture Window (grouped under ‘More Options’)
- Capture following text: Improved by using brute force search for all elements in the page
- Capture HTML: Option to scrape HTML of selected element
- Capture Text as File: Option to scrape text and save it as a local file (useful while scraping articles and blog posts)
- Click: Ability to scrape hidden (partially displayed) fields in webpages which require a click from the user to be displayed in full. For example phone numbers or email addresses which are displayed completely only if you click them.
- Apply Regular Expression: Option to apply Regular Expressions (RegEx) on captured text. RegEx can be applied even after applying ‘Capture following text’, ‘Capture HTML’ & ‘Capture More Content’ options.
- Capture More Content: Option to capture more text than the selected text, captures parent element’s text. For example this would capture the entire article if you apply this option after having selected the first paragraph.
- Option to individually select categories/links (one by one) for Category Scraping (Mine menu – Scrape a list of similar links)
- Export captured data as JSON
- Ability to mine data from tables (row-column / grid layout)
- Ability to mine pages which has fewer (less than 10) data items
- Option to test proxies before using them (Edit menu – Settings – Proxy Settings)
- Non responsive proxies are skipped during mining. Mining would not stop because of a bad/non-responsive proxy in the list.
- Option to manually add URLs to an existing configuration (Edit menu – Add URLs to configuration)
- Option to remove duplicates while mining (Edit menu – Settings – Miner)
- Added ‘Hourly’ frequency option in Scheduler (Mine menu – Scheduler)
- Added option to export data directly to database for scheduled mining tasks & command line
- Added ‘Clear’ option in Edit menu which will clear both the browser and data preview pane
- Language encoding defaulted to ‘utf-8′ for file exports (XML, CSV etc)
- CSV/Database export : handles delimiters (comma, quotes etc) in captured data
- Keyword/Category scraping allowed for 2 entries in evaluation version
- Rendering issues with in-built browser fixed – defaults to IE 9 rendering
- New Installer built with InstallShield
Download the latest installation of WebHarvy Web Scraper from https://www.webharvy.com/download.html.
We’ve just released the latest update of USBTrace, the USB analyzer software for Windows. The changes in this update are :-
- Added option to timestamp captured requests in ‘system’ time (HH:MM:SS:milliSeconds)
- Added decoding of bConfigurationValue in Configuration Descriptor
- Captured Data Export (HTML, CSV, XML) made faster
- Added headers for HTML and XML export files
- Updated USB device list for VID/PID decoding
- Layout of Search/Filter/Trigger windows changed
- More support for Windows 8 and USB 3.0 (SuperSpeed USB)
- Minor bug fixes
You may download the 15 days free trial version of USBTrace USB Analyzer from http://www.sysnucleus.com/usbtrace_download.html.