WebHarvy based on Google Chrome Released (version 5.0.1.148)

This release comes with least bells and whistles since we have not added features or changed cosmetics of the software. But still, this is a major upgrade. The change is all internal.

WebHarvy has been using Microsoft’s Internet Explorer (IE) as its internal browser since inception. Microsoft stopped supporting IE a few years back when they introduced the Edge browser.

So WebHarvy had to switch to another solution to power its internal browser and we believe using Google’s Chrome Browser Project is the way forward. This makes WebHarvy more stable, faster and secure. Switching to Chrome also opens up the possibility of porting the software to other platforms like Mac and Linux.

You may download and install the latest version which is based on Chrome browser from the following link.

http://www.webharvy.com/webharvysetup.exe

As mentioned before the change from IE to Chrome is internal to the software and transparent to the user interface.  So, the configuration process and user interface of WebHarvy remains the same.

Minor Changes

  1. For scraping data from sites which require login, the steps have been simplified. You no longer need to login to the website separately from IE. See https://www.webharvy.com/articles/sites-requiring-login.html

  2. The ‘Internet Options’ menu option under Edit menu has been removed. Instead a new Browser options tab has been added in Settings window.

Running configuration files created using the older version which was based on IE on this new version based on Chrome

Configuration files created using the old version should normally work fine with the new version which is based on Chrome, but there will be exceptions. In such cases we recommend that you create a new configuration using the latest version.

As always, in case you have an questions or need assistance you may contact our support at https://www.webharvy.com/support.html

 

 

Posted in Release update, WebHarvy | Tagged , , , , , , | Leave a comment

WebHarvy 4.1.5.141 released

The main changes in this release are :-

  1. Pagination via JavaScript – see https://www.webharvy.com/tour3.html#JS

    This powerful feature is the main highlight of this release. When all other methods of pagination fails, this method, where you can directly provide a JavaScript code which when run would load the next page, can be used.

  2. Increased size of virtual browser used by miner

    The dimensions of miner’s virtual browser has been increased. This solves issues related with websites whose layout changes when the browser has a smaller window dimension (mobile layout). This also helps the miner to load more items in a single page and scroll, in case of websites which display data based on the size of the browser window.

  3. Support forLoad more content&Scroll to load next pagetype pagination even when the real listing page is reached by clicking links/buttons from the start page.

    In earlier versions if the listing page loads more data in same page via a button/link click or scroll and if initial navigation (click, java-script etc.) is required in the configuration itself to load the listing page from another start page, then pagination would fail. This release removes this limitation.

  4. More support for extracting data from popups.

    Popups now handle clicks and javascript. This can be used to close the popup window, in cases where closing the currently opened popup is required to open the next one.

  5. SQL data export encoding issue related to foreign languages fixed. 

    Encoding issues while exporting text in non-English languages like Chinese fixed.

  6. Other minor bug fixes

As always you may download and install the latest version from https://www.webharvy.com/download.html.

Posted in Release update, WebHarvy, WebHarvy Feature | Tagged , , , , , , , , | Leave a comment

Scraping high resolution images from pinterest.com

In this blog post, we will take a look at how to scrape images from www.pinterest.com in their full sizes.We follow a two stage extraction process to capture the high-res images from pinterest.com.

In the first extraction stage, we capture the image URLs which are present in the listings page. These URLs actually point to smaller sized images (236 Pixels). Then using any Text Editor, we replace the /236x/ with /564x/ in all the URLs.

For example the URL : https://s-media-cache-ak0.pinimg.com/236x/99/….

is modified to : https://s-media-cache-ak0.pinimg.com/564x/99/….

In the second extraction stage, we use ‘Add URLs‘ method to add the modified URLs and scrape the full sized images ((564 Pixels) from each of these URLs using a single WebHarvy configuration.

This method is displayed in the following video :

Links:-

  1. Know more about WebHarvy, the easy to use visual web scraper
  2. WebHarvy video tutorial series
  3. Various methods of extracting images and image URLs using WebHarvy
  4. Download free WebHarvy trial

Have any questions ?

Contact us

Posted in Use Case, WebHarvy | Tagged , , , | Leave a comment

WebHarvy 4.0.3.129 (Installer Update Only)

This update addresses problems in installing .NET 4.5 on Windows 7 (and earlier Windows versions where .NET 4.5 is not present) during installation process. Only the installer has been updated in this release and WebHarvy application files are unchanged compared to the just previous version. So in case you are already running 4.0.3.128 you can ignore this version.

You may download and try the latest version from https://www.webharvy.com/download.html. Let us know in case you have any questions.

Posted in Release update, WebHarvy | Tagged , , , , | Leave a comment

Windows Smartscreen warning while installing WebHarvy

All WebHarvy application files and installation package are digitally signed (Comodo RSA Code Signing CA) and secured. However in case you get the following Smartscreen warning while trying to install the latest version of WebHarvy, please click the ‘More info‘ link and then click the ‘Run anyway‘ button to proceed with the installation.

smartscreen1.png

smartscreen2.png

The above popup message is displayed because we recently changed our .NET dependency from 3.5 to 4.5, thereby considerably reducing the installation package size, and more importantly the code signing agency of our digital certificate has been changed from GlobalSign to Comodo. So the above warning may appear till the new WebHarvy installer gets enough reputation from Microsoft which will take a few weeks time. In case you have any questions or require assistance please do not hesitate to contact our support.

Posted in WebHarvy | Tagged , , , , , , | Leave a comment

WebHarvy 4.0.3.128 (Minor Update)

From this release on wards WebHarvy targets (depends on) .NET 4.5 which comes pre-installed on latest Windows editions. This results in smoother installation process, doing away with .NET 3.5 download and install which was previously required. Targeting .NET 4.5 also helps WebHarvy improve performance and resource usage, and to solve issues related to crashes while trying to extract data from certain websites.

The changes in this release are :-

  1. Depends on .NET 4.5
  2. More support for pages where next page link is implemented in JavaScript
  3. Handles pagination where next page link (next link or ‘show more data’ link) contains a number which varies from page to page
  4. Minor bug fixes related to running JavaScript code on page, opening popup and following links by using regular expressions.

As always you may download and try the latest version from https://www.webharvy.com/download.html. Let us know in case you have any questions.

Posted in Release update, WebHarvy | Tagged , , , , , , , , | Leave a comment

WebHarvy 4.0.2.125 – Multi-level Category / Multi-list Keyword scraping

We have introduced support for scraping multiple level categories (main categories, sub categories tree) and support for multiple input keyword lists in this release. The main features are:-

True multi-level Category Scraping

WebHarvy now supports automatically navigating category/subcategory lists of a website to extract data from the final listing pages. Know More

 

Support for multiple input keywords

Any number of input text fields can be populated with lists of strings/keywords during configuration. WebHarvy will automatically apply all combinations of provided keywords during the mining phase. Know More.

 

Capture window with new options

webharvy

Run JavaScript on Page

Run specified Java Script code on page – know more. This option can be used to load elements on a page which cannot be done using the default navigation options (link-follow, click) provided by WebHarvy.

Input strings to text input fields

Strings to be input to text fields can now be made a part of the configuration. Know More. Earlier such parameters were automatically taken from the PostData of the configuration. But sometimes, with some websites, the PostData will not contain the input strings submitted and this option helps to correctly load the page displaying data during mining phase.

Extract data from Popups

Know More. Helps to extract data by clicking each listing link/button and get data from a popup window or a view in the same page populated by data. This is different from ‘Follow this link’ option because here the data is loaded on the same page (no page navigation) and different from ‘Click’ option because after clicking each link data has to be extracted from page before clicking the next link.

Option to smoothly scroll page during mining to load all contents (lazy loading)

Smooth scroll to page end to load elements which are loaded (for example lazy loading of images) only when the elements are made visible by scrolling down. Know More.

Select drop-down/list-box options

Select drop-down/list-box/combo-box options during configuration and mining. Again this option allows navigation to result pages when normal configuration is unable to make these selections and load the result page. Know More.

Other Minor Additions Include :-

  1. Improvements in automatic scraping of multiple product images
  2. Support for loading keyword lists directly from file
  3. ‘Capture Image’ option automatically enabled via HTML/RegEx method in applicable cases.
  4. Name downloaded image files by value obtained from a column/cell in miner data table. More.
  5. Allows applying ‘Capture More Content’ after selecting ‘Capture HTML’.
  6. Quick access to items under ‘More Options’ in Capture window via toolbar buttons.
  7. Minor bug fixes.

You may please download and try the latest version from https://www.webharvy.com/download.html.

Posted in Release update, Uncategorized, WebHarvy | Tagged , , , , , | Leave a comment